Small Sample Sizes – Introduction to Statistics via Spreadsheets

In the preceding sections on confidence intervals, we assumed that sample sizes were large enough to justify unknown variances by their sample estimates. If sample sizes are small, we need a different method. With $\nu$ denote the size of the sample, this is based on the Students’s $\displaystyle{ T_{\nu} }$ variable, a continuous random variable introduced when we originally discussed important continuous random variables. It has density function
$$
T_{\nu}(x) =\frac{\Gamma\left( \frac{\nu +1}{2} \right)}{ \sqrt{\nu\,\pi}\,\Gamma\left( \frac{\nu}{2} \right) } \, \left( 1+\frac{x^2}{\nu}\right)^{-\frac{\nu +1}{2}} \;\text{.}
$$
This variable is usually defined by
$$ T_\nu =\frac{Z}{V}\,\sqrt{v} \,\text{,} $$
where $Z$ is standard normal and $\displaystyle{ V^2 }$ is a $\displaystyle{ {\chi^2}_{\nu} }$ random variable, i.e. a $\displaystyle{ \chi^2 }$ variable with $\nu$ degrees of freedom. These variables must be
independently distributed.

We will usually just say “Student’s T variable”, and use $T\,$ , when speaking generally or when the sample size is understood.

For our purposes, the formulae for $T$ variables are not so essential. It will be derived later, so that $T$ variables will be seen to arise naturally rather than as some random or arbitrarily imposed quantities. But for now, what is more important for us is that when $\nu$ is fairly large, say $\nu\gt 30\,$ , the graph of the $T$ density function differs very little from that of the standard normal. Even for fairly small values of $\nu$ the graphs do not differ by much. This was seen when the $T$ variable was first mentioned, but the tool below allows you to play with this.

JSXGRAPH TOOL

Just as with $\displaystyle{ \chi^2 }$ variables, the number $\nu$ is called the number of degrees of freedom for a $T$ variable. Neither the density nor the distribution function for $T$ variables is particularly convenient. Instead, it is common to approximate these numerically either using computer algebra systems or spreadsheets. For most of the last century the values of values of both of these functions for various $\nu$ were compiled in books, but this is now neither sufficient nor convenient.

Using a spreadsheet, one can find values
$$ P\left( T_{\nu}\gt A \right) $$
for $A\ge 0$ using the command “=TDIST(A, $\,\nu\,$ , 1)” and
$$ P\left( \big| T_{\nu}\big|\gt A \right) $$
with using the command “=TDIST(A, $\,\nu\,$ , 2)”. The final value (1 or 2) represents the number of tails considered; these values of distribution functions are used to obtain one- or two-sided confidence intervals. These regions are illustrated below for $\nu =5\;$ . The symmetry of the density function makes it evident that $P\left( \big| T_{\nu}\big|\gt A \right) =2\,P\left( T_{\nu}\gt A \right)\;$ .

ILLUSTRATION w/ TWO PICS

As we will use these random variables to construct two-sided confidence intervals for means (or differences of two means) when sample sizes are small, we will need to know $A$ values for which
$$ P\left( \big| T_{\nu}\big|\gt A \right) =B $$
for one of the values $B=0.2\,$ , $\,0.1\,$ , $\,0.05\,$ , $\,0.02\,$ , $\,0.01\;$ . These are the $B$ values corresponding to two-sided $80\%$ (or $90\%\,$ , $\, 95\%\,$ , $\, 98\%\,$ , or $99\%\,$ ) confidence interval, etc. We find the corresponding $A$ value within spreadsheets using the “=TINV(B, $\nu$ )”.

Example: Find $A$ such that $\displaystyle{ P\left( \big| T_{10}\big|\gt A \right) =.1 }\;$ .

We can find this $A$ value directly by means of “=TINV(.1, 10 )”, and get $1.8124611228\;$ . That is, the shaded region in the diagram below has area $0.1\;$ .

IMAGE: two tailed T_10 dist shaded at ends

the area under the graph for $|x| \gt 1.8125$ is approximately $0.1$

For one-sided confidence intervals, we need the observation, due to the symmetry of the $T$ densities, that $\displaystyle{ P\left( \big| T_{\nu}\big|\gt A \right) =2\,B }$ precisely when $\displaystyle{ P\left( T_{\nu}\gt A \right) =B }\;$ .

Example: Find $A$ such that $\displaystyle{ P\left( T_{10}\gt A \right) =0.05 }\;$ .

Using our observation, we note that the $A$ value satisfying $\displaystyle{ P\left( T_{10}\gt A \right) =0.05 }$ is precisely the $A$ value satisfying $\displaystyle{ P\left( \big|T_{10}\big|\gt A \right) =0.1 }$ . We found this by means of “=TINV(.1, 10 )” to be $1.8124611228\;$ . The shaded region in the diagram below has area $0.05\;$ .

IMAGE: one tailed T_10 dist shaded at right end

the area under the graph for $x \gt 1.8125$ is approximately $0.05$

Confidence Intervals for Small Samples

Confidence Interval for a Normal Mean

In order to apply the formula
$$ T =\frac{Z}{V}\,\sqrt{v} $$
to the problem of finding a confidence interval for the mean of a small sample, we choose
$$ Z=\frac{\bar{X} -\mu}{\sigma_{\bar{X}}} =\frac{\bar{X}-\mu}{\sigma}\,\sqrt{n} $$
and
$$ V^2=\frac{\sum_{k=1}^n\,\left( X_k -\bar{X}\right)^2}{\sigma^2} \;\text{.} $$

As $X$ is a normal variable, $\,\bar{X}$ is a normal variable, and $Z$ is thus a standard normal variable. Moreover, as stated earlier, $\,V^2$ has a $\chi^2$ distribution with $n-1$ degrees of freedom. It can be shown that these two variables are independently distributed (shown later). Thus we have
$$
T =\frac{\left(\bar{X} -\mu\right)\,\sqrt{n}}{\sqrt{\sum_{k=1}^n\,\left( X_k -\bar{X}\right)^2}}\,\sqrt{n-1} \;\text{.}
$$
In terms of the sample variance
$$ S^2 =\frac{\sum_{k=1}^n\,\left( X_k -\bar{X}\right)^2}\,\sqrt{n-1} $$
this reads
$$ T =\frac{\left(\bar{X} -\mu\right)}{S}\,\sqrt{n} \;\text{.} $$

If $S$ were replaced by $\sigma\,$ , $\,T$ would be standard normal. It is thus not surprising that the distribution of $T$ is very similar to that of a standard normal variable – provided that the sample size is not very small. The big advantage of $T$ is that it does not require one to replace $\sigma$ by its sample estimate $S\,$ , and thus eliminates the error from this approximation. It does require that $X$ be a normal variable – although experiments show that even if $n$ is small, moderate deviations from normality do not have much effect the distribution of $T\;$ . And for large values of $n\,$ , large deviations from normality will cause little harm.

Example

A buffering agent is being tested to see if it helps with drug absorption and efficacy. Find a $90\%$ confidence interval for $\mu\;$ .

First we find $A$ so that $P( |T| \gt A ) =1-0.9 =0.1\;$ . Given our sample size, this is

???

Thus the probability is $0.9$ that
$$ X \;\text{.} $$
This is equivalent to
$$ X \;\text{.} $$

Using our numerical values ***** and ***** , we have
$$ ***** \,\text{,} $$
or
$$ ***** \;\text{.} $$
This is our desired confidence interval.

Conclusion

Comparison with large sample method. Note that the procedure is the same.

Confidence Interval for the Difference of Two Means

Suppose that $X$ and $Y$ are two independently normally distributed variables with means $\displaystyle{ \mu_X }$ and $\displaystyle{ \mu_Y }\,$ , and they each have variance $\sigma\;$ . Suppose further that we have random samples of sizes $\displaystyle{ n_X }$ and $\displaystyle{ n_Y }$ from the two populations. Denote by $\bar{X}\,$ , $\,\bar{Y}\,$ , $\,\displaystyle{ S_X^2 }\,$ , and $\displaystyle{ S_Y^2 }$ the sample means and variances. We now apply our formula

****

to the problem of finding a small sample confidence interval for $\displaystyle{ \mu_X -\mu_Y }\;$ .

Choose
$$ Z= $$
and
$$ Z= \;\text{.} $$
Since
$$ Z= $$
and
$$ Z= \,\text{,} $$
it follows that these two quantities possess $\chi^2$ distributions with
$\displaystyle{ n_X -1 }$ and $\displaystyle{ n_Y -1 }$ degrees of freedom,
respectively. Also, they are independently distributed because the data came from independent
experiments. Thus their sum possesses a $\chi^2$ distribution with
$\displaystyle{ n_X +n_Y -2 }$ degrees of freedom.

Example XXX