Sums of Chi-square Variables and Approximation of Variance – Introduction to Statistics via Spreadsheets

Let $\displaystyle{ V_1 }$ and $\displaystyle{ V_2 }$ be two independent $\displaystyle{ \chi^2 }$ random variables with $\displaystyle{ n_1 }$ and $\displaystyle{ n_2 }$ degrees of freedom, respectively. Consider the problem of finding the distribution of their sum $\displaystyle{ V =V_1 +V_2 }\;$ . As the two are independent, the moment generating function of $V$ is the product of the MGFs of $\displaystyle{ V_1 }$ and $\displaystyle{ V_2 }$ . That is
$$ M_V(t) =M_{V_1}(t)\cdot M_{V_2}(t) \;\text{.} $$
But we know the MGF of a $\displaystyle{ \chi^2 }$ random variable:
$$
M_{V_1}(t) =(1-2\, t)^{-\frac{n_1}{2}} \quad\text{and}\quad M_{V_2}(t) =(1-2\, t)^{-\frac{n_2}{2}} \;\text{.}
$$
So
$$
M_V(t) =(1-2\, t)^{-\frac{n_1}{2}} \cdot (1-2\, t)^{-\frac{n_2}{2}} =(1-2\, t)^{-\frac{(n_1 +n_2)}{2}} \;\text{.}
$$
That is, $\, V$ is a $\displaystyle{ \chi^2 }$ random variable with $\displaystyle{ n_1 +n_2 }$ degrees of freedom.

Theorem: If $\displaystyle{ V_1 }$ and $\displaystyle{ V_2 }$ are two independent $\displaystyle{ \chi^2 }$ random variables with $\displaystyle{ n_1 }$ and $\displaystyle{ n_2 }$ degrees of freedom, respectively, then their sum $\displaystyle{ V =V_1 +V_2 }$ will be a $\displaystyle{ \chi^2 }$ random variable with $\displaystyle{ n_1 +n_2 }$ degrees of freedom.

We use this theorem to find the distribution of the sum of squares of a set of independent standard normal random variables. To this end, let $Z$ be a standard normal variable. Then the MGF of $\displaystyle{ Z^2 }$ is
$$
\begin{array}{rl}
M_{Z^2}(t) & =\int_{-\infty}^{\infty}\, {\rm e}^{t\, x^2}\,\frac{{\rm e}^{-\left( \frac{x^2}{2} \right)}}{\sqrt{2\,\pi}}\, {\rm d}x \\
& =\frac{1}{\sqrt{2\,\pi}} \,\int_{-\infty}^{\infty}\, {\rm e}^{-\left( \frac{x^2}{2} \right)\,(1-2\, t)}\, {\rm d}x \;\text{.}
\end{array}
$$
Letting $y =x\,\sqrt{1-2\,t}\,$ , this integral reduces to
$$
\begin{array}{rl}
M_{Z^2}(t) & =(1-2\, t)^{-\frac{1}{2}}\, \int_{-\infty}^{\infty}\, \frac{{\rm e}^{-\left( \frac{y^2}{2} \right)}}{\sqrt{2\,\pi}}\, {\rm d}y \\
& =(1-2\, t)^{-\frac{1}{2}} \;\text{.}
\end{array}
$$

This is the MGF of a $\displaystyle{ \chi^2 }$ variable with one degree of freedom. Thus, using the above theorem, we have that the sum of squares of a set of $n$ independent standard normal variables will be a $\displaystyle{ \chi^2 }$ variable with $n$ degrees of freedom. We will now use this in considering several problems relating to the variance of a normal distribution.

The Distribution of $\displaystyle{ \sum_{k=1}^n\, \left( X_k -\mu\right)^2 }$

If $X$ is a normal random variable with mean $\mu$ and variance $\sigma^2\,$ , consider a random sample $X_1\,$ , … , $\,X_n$ of size $n\;$ . Then $\displaystyle{ Z_1 =\frac{X_1 -\mu}{\sigma}}$ , … , $\displaystyle{ Z_n =\frac{X_n -\mu}{\sigma}}$ are independent standard normal variables. It follows that
$$ V =\sum_{k=1}^n\, \left( \frac{X_n -\mu}{\sigma} \right)^2 =\sum_{k=1}^n\, Z_k^2 $$
is a $\displaystyle{ \chi^2 }$ variable with $n$ degrees of freedom.

Theorem: If $X$ is a normal random variable with mean $\mu$ and variance $\sigma^2\,$ , and $X_1\,$ , … , $\,X_n$ is a random sample of size $n$ of $X\,$ , then $ V =\sum_{k=1}^n\, \left( X_n -\mu\right)^2\big/\sigma^2 $ is a $\displaystyle{ \chi^2 }$ variable with $n$ degrees of freedom.

We can use this to address various problems concerning the parameter $\sigma$ of a normal random variable.

Note that the quantity $ \sum_{k=1}^n\, \left( X_n -\mu\right)^2\big/ n $ looks a lot like the sample variance, and thus it will be useful for estimating $\sigma^2$ when $\mu$ is known. If $\mu$ is not known, we will need to replace it by $\bar{X}\;$ . The resulting random variable
$$ V =\sum_{k=1}^n\, \left( \frac{X_n -\bar{X}}{\sigma} \right)^2 =\sum_{k=1}^n\, Z_k^2 $$
would naturally be expected to be approximately a $\displaystyle{ \chi^2 }$ random variable with $n$ degrees of freedom. And we would expect that as $n$ gets larger the approximation improves. The actuality is that $V$ is a $\displaystyle{ \chi^2 }$ random variable, but with $n-1$ degrees of freedom. (A fact that will be justified later. Thus we will use
$$ \sum_{k=1}^n\, \left( \frac{X_n -\mu}{\sigma} \right)^2 $$
when $\mu$ is available, and
$$ \sum_{k=1}^n\, \left( \frac{X_n -\bar{X}}{\sigma} \right)^2 $$
when it’s not.

Example 1

Suppose that $X$ is normal with mean $12$ and unknown variance $\sigma^2\;$ . Consider the problem of estimating this variance using a random sample of size $25\;$ . The quantity
$$ \sum_{k=1}^{25}\, \left( \frac{X_n -12}{25} \right)^2 $$
will be used as to estimate $\sigma^2\;$ . What is the probability that this will not be off by more than $10\%\;$ ?

Our accuracy requirement can be expressed as
$$
.9\,\sigma^2 \lt \sum_{k=1}^{25}\, \left( \frac{X_n -12}{25} \right)^2 \lt 1.1\,\sigma^2 \;\text{.}
$$
This is equivalent to
$$ 22.5 \lt \sum_{k=1}^{25}\, \left( \frac{X_n -12}{\sigma^2} \right)^2 \lt 27.5 \;\text{.} $$
In terms of our discussion, this is
$$ 22.5 \lt V \lt 27.5 \,\text{,} $$
where $V$ is a $\displaystyle{ \chi^2 }$ random variable with $25$ degrees of freedom. Now $\displaystyle{ \chi^2 }$ densities and are distributions cannot be known in simple combinations of classical functions. They can only be approximated. Knowing this, spreadsheets have been designed with approximate values of $\displaystyle{ \chi^2 }$ densities and distributions built in. We can obtain the probability that $V$ is between $22.5$ and $27.5$ using a spreadsheet. The command “=CHISQDIST(22.5,25,1)” gives that the probability of a $\displaystyle{ \chi^2 }$ random variable with $25$ degrees of freedom being at most $22.5$ is $0.3933\, $ , and similarly the probability of being less than $27.5$ is $0.6686\; $ . Thus the probability that such a variable takes values between $22.5$ and $27.5$ is $0.6686 -0.3933 =0.2754\; $ . This is the probability that our estimate of $\sigma^2$ is off by less than $10\%\;$ .

Example 2

Suppose now that we do not know the actual mean of our normal variable, and must use the sample mean. That is, we use
$$ \sum_{k=1}^n\, \left( \frac{X_n -\bar{X}}{\sigma} \right)^2 $$
to address problems concerning $\sigma^2\;$ .

By way of illustration, suppose that a sample of size $20$ is taken of a normal variable $X\;$ . What is the probability that the sample variance will be $25\%$ larger than $\sigma^2\;$ ? We represent this as the probability that
$$ \sum_{k=1}^{20}\, \frac{\left( X_k -\bar{X} \right)^2}{19} \gt 1.25\,\sigma^2 \;\text{.} $$
This is equivalent to
$$ \sum_{k=1}^{20}\, \frac{\left( X_k -\bar{X} \right)^2}{\sigma^2} \gt 23.75\,\sigma^2 \;\text{.} $$

As noted above, the left-hand side is a $\displaystyle{ \chi^2 }$ random variable with $19$ degrees of freedom. Again using a spreadsheet we find
$$ \sum_{k=1}^{20}\, \frac{\left( X_k -\bar{X} \right)^2}{\sigma^2} \le 23.75\,\sigma^2 $$
to be $0.7941$ via the command “=CHISQDIST(23.75,19,1)” . Thus the desired probability is $1 -0.7941 =0.2059\;$ . That is, there is an approximately $20\%$ chance that our sample variance will exceed the actual by more than $20\%\;$ .

Example 3

As a different type of example, consider the problem of trying to deliver supplies to a colony on Mars. What is the probability that a delivery will miss the drop-spot by more than a mile, assuming that the north-south and east-west errors are independently normally distributed with common
standard deviation of $.25{\rm mi}\;$ .

For convenience, we will measure in units of quarter miles. Let $N$ denote the north-south displacement of our delivery, and $W$ the east-west displacement. Our choice of units means that $N$ and $W$ are independent standard normal variables. The square of the total displacement is
$$ N^2 +E^2 \,\text{,} $$
and possesses a $\displaystyle{ \chi^2 }$ distribution with $2$ degrees of freedom. Thus the desired probability is
$$ P\left(\sqrt{ N^2 +W^2 } \gt 4\right) =P\left( N^2 +W^2 \gt 16 \right) \;\text{.} $$
We again use a spreadsheet with the command “=CHISQDIST(16,2,1)” to find that the probability that a $\displaystyle{ \chi^2 }$ distribution with $2$ degrees of freedom takes on values no greater than $16$ is $0.9997\;$ . The probability that our delivery misses its drop-spot by more than a mile is thus $1-0.9997 =0.0003\,$ , so that there is about a $0.03\%$ chance of missing the drop-spot by over a mile.

There are other interesting and important application of $\displaystyle{ \chi^2 }$ random variables, and some of these will be introduced later.