Sums of Chi-square Variables and Approximation of Variance

Let   $\displaystyle{ V_1 }$   and   $\displaystyle{ V_2 }$   be two independent   $\displaystyle{ \chi^2 }$   random variables with   $\displaystyle{ n_1 }$   and   $\displaystyle{ n_2 }$   degrees of freedom, respectively.   Consider the problem of finding the distribution of their sum   $\displaystyle{ V =V_1 +V_2 }\;$ .   As the two are independent, the moment generating function of   $V$   is the product of the MGFs of   $\displaystyle{ V_1 }$   and   $\displaystyle{ V_2 }$ .   That is
$$ M_V(t) =M_{V_1}(t)\cdot M_{V_2}(t) \;\text{.} $$
But we know the MGF of a   $\displaystyle{ \chi^2 }$   random variable:
$$
M_{V_1}(t) =(1-2\, t)^{-\frac{n_1}{2}} \quad\text{and}\quad M_{V_2}(t) =(1-2\, t)^{-\frac{n_2}{2}} \;\text{.}
$$
So
$$
M_V(t) =(1-2\, t)^{-\frac{n_1}{2}} \cdot (1-2\, t)^{-\frac{n_2}{2}} =(1-2\, t)^{-\frac{(n_1 +n_2)}{2}} \;\text{.}
$$
That is, $\, V$   is a   $\displaystyle{ \chi^2 }$   random variable with   $\displaystyle{ n_1 +n_2 }$   degrees of freedom.

Theorem:   If   $\displaystyle{ V_1 }$   and   $\displaystyle{ V_2 }$   are two independent   $\displaystyle{ \chi^2 }$   random variables with   $\displaystyle{ n_1 }$   and   $\displaystyle{ n_2 }$   degrees of freedom, respectively, then their sum   $\displaystyle{ V =V_1 +V_2 }$   will be a   $\displaystyle{ \chi^2 }$   random variable with   $\displaystyle{ n_1 +n_2 }$   degrees of freedom.

We use this theorem to find the distribution of the sum of squares of a set of independent standard normal random variables.   To this end, let   $Z$   be a standard normal variable. Then the MGF of   $\displaystyle{ Z^2 }$   is
$$
\begin{array}{rl}
M_{Z^2}(t) & =\int_{-\infty}^{\infty}\, {\rm e}^{t\, x^2}\,\frac{{\rm e}^{-\left( \frac{x^2}{2} \right)}}{\sqrt{2\,\pi}}\, {\rm d}x \\
& =\frac{1}{\sqrt{2\,\pi}} \,\int_{-\infty}^{\infty}\, {\rm e}^{-\left( \frac{x^2}{2} \right)\,(1-2\, t)}\, {\rm d}x \;\text{.}
\end{array}
$$
Letting   $y =x\,\sqrt{1-2\,t}\,$ , this integral reduces to
$$
\begin{array}{rl}
M_{Z^2}(t) & =(1-2\, t)^{-\frac{1}{2}}\, \int_{-\infty}^{\infty}\, \frac{{\rm e}^{-\left( \frac{y^2}{2} \right)}}{\sqrt{2\,\pi}}\, {\rm d}y \\
& =(1-2\, t)^{-\frac{1}{2}} \;\text{.}
\end{array}
$$

This is the MGF of a   $\displaystyle{ \chi^2 }$   variable with one degree of freedom.   Thus, using the above theorem, we have that the sum of squares of a set of   $n$   independent standard normal variables will be a   $\displaystyle{ \chi^2 }$   variable with   $n$   degrees of freedom.   We will now use this in considering several problems relating to the variance of a normal distribution.

The Distribution of   $\displaystyle{ \sum_{k=1}^n\, \left( X_k -\mu\right)^2 }$

If   $X$   is a normal random variable with mean   $\mu$   and variance   $\sigma^2\,$ , consider a random sample   $X_1\,$ , … , $\,X_n$   of size   $n\;$ .   Then   $\displaystyle{ Z_1 =\frac{X_1 -\mu}{\sigma}}$ , … , $\displaystyle{ Z_n =\frac{X_n -\mu}{\sigma}}$   are independent standard normal variables.   It follows that
$$ V =\sum_{k=1}^n\, \left( \frac{X_n -\mu}{\sigma} \right)^2 =\sum_{k=1}^n\, Z_k^2 $$
is a   $\displaystyle{ \chi^2 }$   variable with   $n$   degrees of freedom.

Theorem:   If   $X$   is a normal random variable with mean   $\mu$   and variance   $\sigma^2\,$ , and   $X_1\,$ , … , $\,X_n$   is a random sample of size   $n$   of   $X\,$ , then   $ V =\sum_{k=1}^n\, \left( X_n -\mu\right)^2\big/\sigma^2 $   is a   $\displaystyle{ \chi^2 }$   variable with   $n$   degrees of freedom.

We can use this to address various problems concerning the parameter   $\sigma$   of a normal random variable.

Note that the quantity   $ \sum_{k=1}^n\, \left( X_n -\mu\right)^2\big/ n $   looks a lot like the sample variance, and thus it will be useful for estimating   $\sigma^2$   when   $\mu$   is known.   If   $\mu$   is not known, we will need to replace it by   $\bar{X}\;$ .   The resulting random variable
$$ V =\sum_{k=1}^n\, \left( \frac{X_n -\bar{X}}{\sigma} \right)^2 =\sum_{k=1}^n\, Z_k^2 $$
would naturally be expected to be approximately a   $\displaystyle{ \chi^2 }$   random variable with   $n$   degrees of freedom.   And we would expect that as   $n$   gets larger the approximation improves.   The actuality is that   $V$   is a   $\displaystyle{ \chi^2 }$   random variable, but with   $n-1$   degrees of freedom.   (A fact that will be justified later.   Thus we will use
$$ \sum_{k=1}^n\, \left( \frac{X_n -\mu}{\sigma} \right)^2 $$
when   $\mu$   is available, and
$$ \sum_{k=1}^n\, \left( \frac{X_n -\bar{X}}{\sigma} \right)^2 $$
when it’s not.

Example 1

Suppose that   $X$   is normal with mean   $12$   and unknown variance   $\sigma^2\;$ .   Consider the problem of estimating this variance using a random sample of size   $25\;$ .   The quantity
$$ \sum_{k=1}^{25}\, \left( \frac{X_n -12}{25} \right)^2 $$
will be used as to estimate   $\sigma^2\;$ .   What is the probability that this will not be off by more than   $10\%\;$ ?

Our accuracy requirement can be expressed as
$$
.9\,\sigma^2 \lt \sum_{k=1}^{25}\, \left( \frac{X_n -12}{25} \right)^2 \lt 1.1\,\sigma^2 \;\text{.}
$$
This is equivalent to
$$ 22.5 \lt \sum_{k=1}^{25}\, \left( \frac{X_n -12}{\sigma^2} \right)^2 \lt 27.5 \;\text{.} $$
In terms of our discussion, this is
$$ 22.5 \lt V \lt 27.5 \,\text{,} $$
where   $V$   is a   $\displaystyle{ \chi^2 }$   random variable with   $25$   degrees of freedom.   Now   $\displaystyle{ \chi^2 }$   densities and are distributions cannot be known in simple combinations of classical functions.   They can only be approximated.   Knowing this, spreadsheets have been designed with approximate values of   $\displaystyle{ \chi^2 }$   densities and distributions built in.   We can obtain the probability that   $V$   is between   $22.5$   and   $27.5$   using a spreadsheet.   The command “=CHISQDIST(22.5,25,1)” gives that the probability of a   $\displaystyle{ \chi^2 }$   random variable with   $25$   degrees of freedom being at most   $22.5$   is   $0.3933\, $ , and similarly the probability of being less than   $27.5$   is   $0.6686\; $ .   Thus the probability that such a variable takes values between   $22.5$   and   $27.5$   is   $0.6686 -0.3933 =0.2754\; $ .   This is the probability that our estimate of   $\sigma^2$   is off by less than   $10\%\;$ .

Example 2

Suppose now that we do not know the actual mean of our normal variable, and must use the sample mean.   That is, we use
$$ \sum_{k=1}^n\, \left( \frac{X_n -\bar{X}}{\sigma} \right)^2 $$
to address problems concerning   $\sigma^2\;$ .

By way of illustration, suppose that a sample of size   $20$   is taken of a normal variable   $X\;$ .   What is the probability that the sample variance will be   $25\%$   larger than   $\sigma^2\;$ ?   We represent this as the probability that
$$ \sum_{k=1}^{20}\, \frac{\left( X_k -\bar{X} \right)^2}{19} \gt 1.25\,\sigma^2 \;\text{.} $$
This is equivalent to
$$ \sum_{k=1}^{20}\, \frac{\left( X_k -\bar{X} \right)^2}{\sigma^2} \gt 23.75\,\sigma^2 \;\text{.} $$

As noted above, the left-hand side is a   $\displaystyle{ \chi^2 }$   random variable with   $19$   degrees of freedom.   Again using a spreadsheet we find
$$ \sum_{k=1}^{20}\, \frac{\left( X_k -\bar{X} \right)^2}{\sigma^2} \le 23.75\,\sigma^2 $$
to be   $0.7941$   via the command “=CHISQDIST(23.75,19,1)” .   Thus the desired probability is   $1 -0.7941 =0.2059\;$ .   That is, there is an approximately   $20\%$   chance that our sample variance will exceed the actual by more than   $20\%\;$ .

Example 3

As a different type of example, consider the problem of trying to deliver supplies to a colony on Mars.   What is the probability that a delivery will miss the drop-spot by more than a mile, assuming that the north-south and east-west errors are independently normally distributed with common
standard deviation of   $.25{\rm mi}\;$ .

For convenience, we will measure in units of quarter miles.   Let   $N$   denote the north-south displacement of our delivery, and   $W$   the east-west displacement.   Our choice of units means that   $N$   and   $W$   are independent standard normal variables.   The square of the total displacement is
$$ N^2 +E^2 \,\text{,} $$
and possesses a   $\displaystyle{ \chi^2 }$   distribution with   $2$   degrees of freedom.   Thus the desired probability is
$$ P\left(\sqrt{ N^2 +W^2 } \gt 4\right) =P\left( N^2 +W^2 \gt 16 \right) \;\text{.} $$
We again use a spreadsheet with the command “=CHISQDIST(16,2,1)” to find that the probability that a   $\displaystyle{ \chi^2 }$   distribution with   $2$   degrees of freedom takes on values no greater than   $16$   is   $0.9997\;$ .   The probability that our delivery misses its drop-spot by more than a mile is thus   $1-0.9997 =0.0003\,$ , so that there is about a   $0.03\%$   chance of missing the drop-spot by over a mile.

There are other interesting and important application of   $\displaystyle{ \chi^2 }$   random variables, and some of these will be introduced later.