Estimating Means and Variances in General

The previously introduced technique for constructing confidence intervals for the mean of a normal variable, based on sample data, can be mimicked to construct confidence intervals for the means of other types of random variables.   Thus, if the sample size is sufficiently large, a similar technique based on the central limit theorem can be used to estimate the mean of a non-normal variable.   In particular, this same method can be used to find a confidence interval for a binomial or Bernoulli parameter   $p\;$ .   Throughout we will assume that the sample size justifies the use of the central limit theorem for means.   The following examples illustrate the point.

Example 1: A sample of size   $80$   from an unknown distribution yields mean and standard deviation   $\bar{x} =51$   and   $s=7\;$ .   Find a   $90\%$   confidence interval for the mean   $\mu$   of the distribution.

Since   $80$   is sufficiently large to assume that   $\bar{X}$   is approximately normal (i.e.   $\gt 30\,$ ), and since   $s$   for a sample of this size is likely to be a fairly good point estimate of   $\sigma\,$ , an approximate   $90\%$   confidence interval for   $\mu$   is given by
$$ \bar{x} -1.64\,\frac{s}{\sqrt{n}} \lt \mu \lt \bar{x} +1.64\,\frac{s}{\sqrt{n}} \;\text{.} $$
Substitution of the sample values gives
$$ 51 -1.64\,\frac{7}{\sqrt{80}} \lt \mu \lt 51 +1.64\,\frac{7}{\sqrt{80}} \,\text{,} $$
or
$$ 49.72 \lt \mu \lt 52.28 \;\text{.} $$

Example 2: A random sample of   $50$   students showed that   $35$   of them worked at least   $5$   hours per week at some job.   Use this data to find a $95\%$   confidence interval for the proportion   $p$   of such students.

Since   $50 \gt 30\,$ , the sample is sufficiently large to use normal variable approximation to this binomial random variable.   The desired   $95\%$   confidence interval is of the form
$$ \hat{p} -1.960\,s_{\hat{p}} \lt p \lt \hat{p} -1.960\,s_{\hat{p}} \;\text{.} $$
Since
$$\hat{p} =\frac{35}{50} =0.7 $$
and
$$
s_{\hat{p}} =\sqrt{\frac{\hat{p}\,\left( 1-\hat{p} \right)}{50}} =\sqrt{\frac{(.7)(.3)}{50}} \doteq 0.065 \,\text{,}
$$
this becomes
$$ 0.7 -(1.960)(0.065) \lt p \lt 0.7 +(1.960)(0.065) $$
or
$$ 0.5726 \lt p \lt .8724 \;\text{.} $$

Here we used an approximation to the standard deviation derived from the sample mean, as knowing the mean of a bernoulli process allows us to determine its standard deviation.

A common problem that arises in many situations is that of comparing two different processes to determine whether one is preferable.   For example, consider the problem of comparing two toothpastes for effectiveness.   Such a comparison can often be made by comparing means or proportions.   If the variable under consideration is continuous, one may wish to compare means, and if interest is on the proportion of positive outcomes then this becomes a comparison of proportions.   Such comparisons can be enacted by finding confidence intervals for the difference of two means or proportions.   Here we will examine the method for finding such confidence intervals.

Estimating the Difference of Two Means

To estimate the difference of two means and determine the accuracy of the estimate, it is necessary to obtain the distribution of the difference of the two sample means, which will be denoted by   $\bar{X}$   and   $\bar{Y}\;$ .

We will assume for now that   $\bar{X}$   and   $\bar{Y}$   are based on samples of sizes   $n_X$   and   $n_Y\,$ , respectively, and that   $\bar{X}$   and   $\bar{Y}$   are independently normally distributed.   With this assumption, $\bar{X} -\bar{Y}$   possesses a normal distribution (explained below).   We have the following.

Theorem: If   $\bar{X}$   and   $\bar{Y}$   are independent normal variables then   $\bar{X} -\bar{Y}$   is normally distributed with
$$ \mu_{\bar{X} -\bar{Y}} =\mu_X -\mu_Y $$
and
$$ \sigma_{\bar{X} -\bar{Y}}^2 =\sigma_{\bar{X}}^2 +\sigma_{\bar{Y}}^2 \;\text{.} $$

Why?

As discussed earlier, if   $X$   and   $Y$   are independent random variables then
$$ M_{X-Y}(t) =M_X(t)\, M_{-Y}(t) \;\text{.} $$
Also, if   $X$   and   $Y$   are both normal and   $\bar{X}$   and   $\bar{Y}$   are means of samples of sizes   $n_X$   and   $n_Y\,$ , then
$$ M_{\bar{X}}(t) ={\rm e}^{\mu_X\, t +\frac{1}{2}\left(\frac{\sigma_X^2}{n_X}\right)\, t^2 } $$
and
$$ M_{-\bar{Y}}(t) ={\rm e}^{-\mu_Y\, t +\frac{1}{2}\left(\frac{\sigma_Y^2}{n_Y}\right)\, t^2 } \;\text{.} $$
It follows that
$$
M_{\bar{X} -\bar{Y}}(t) ={\rm e}^{\mu_X\, t +\frac{1}{2}\left(\frac{\sigma_X^2}{n_X}\right)\, t^2 }\cdot {\rm e}^{-\mu_Y\, t +\frac{1}{2}\left(\frac{\sigma_Y^2}{n_Y}\right)\, t^2 } =
{\rm e}^{\left(\mu_X-\mu_Y\right)\, t +\frac{1}{2}\left(\frac{\sigma_X^2}{n_X} +\frac{\sigma_Y^2}{n_Y}\right)\, t^2 } \,\text{,}
$$
and   $\bar{X} -\bar{Y}$   has the moment generating function of a normal variable with mean   $\mu_X-\mu_Y$   and variance   $\displaystyle{ \frac{\sigma_X^2}{n_X} +\frac{\sigma_Y^2}{n_Y} }\;$ .

The formulae for mean and variance of   $\bar{X} -\bar{Y}$   can be obtained in general, but we specifically want to know that   $\bar{X} -\bar{Y}$   is normally distributed.   Our use or this will be through the central limit theorem, where for large enough samples we will approximate   $\bar{X}$   and   $\bar{Y}$   by normal variables even if   $X$   and   $Y$   are not.   Here is an example of how this theorem is used.

Example

The length of time it takes workers to perform a certain job is under review.   Suppose that a job is performed by fifty workers using method I and the same job is performed by forty workers using method II.   The means and standard deviations of the two groups are   $m_X =115\,$ , $\,m_Y =125\,$ , $\, s_X =6\,$ , and   $s_Y =4\,$ , respectively.   Our problem is to estimate the difference between the mean times of the two methods.

A confidence interval for   $\mu_X -\mu_Y$   can be constructed using the same technique as that for finding a confidence interval for a normal mean   $\mu\,$ .   Assume that   $\bar{X}$   and   $\bar{Y}$   are independent and normally distributed.   Then it follows that with   $90\%$ probability that
$$
\mu_{X -Y} -1.64\,\sigma_{\bar{X} -\bar{Y}} \lt \bar{X} -\bar{Y} \lt \mu_{X-Y} + 1.64\,\sigma_{\bar{X} -\bar{Y}} \;\text{.}
$$
This can be rearranged to give
$$
\bar{X} -\bar{Y} -1.64\,\sigma_{\bar{X}-\bar{Y}} \lt \mu_{X-Y} \lt \bar{X} -\bar{Y} +1.64\,\sigma_{\bar{X}-\bar{Y}} \;\text{.}
$$
Here
$$ \sigma_{\bar{X}-\bar{Y}} =\sqrt{\frac{\sigma_X^2}{n_X} +\frac{\sigma_Y^2}{n_Y}} \;\text{.} $$

We don’t know   $\sigma_X^2$   or   $\sigma_Y^2\,$ , so we estimate them with   $s_X^2 =36$   and   $s_Y^2 =16\;$ .   Thus we get the approximate confidence interval
$$
115 -125 -1.64\,\sqrt{\frac{36}{50} +\frac{16}{40}} \lt \mu_{X-Y} \lt
115 -125 +1.64\,\sqrt{\frac{36}{50} +\frac{16}{40}} \,\text{,}
$$
or
$$
-11.74 \lt \mu_{X-Y} \lt \mu_{X-Y} \lt -8.26 \;\text{.}
$$
We can be quite confident that more than eight minutes, and possibly almost as much as twelve minutes, will be saved if the method I is used instead of method II.

Estimating the Difference of Two Proportions

If we are sampling two Bernoulli processes, we proceed as in the precious section but work with the proportion of successes.   Thus, on tossing two coins, we would not compare   $45$   heads in   $100$   flips with   $43$   heads in   $80$   flips.   Rather the first coin comes up heads   $45\%$   of the time in   $100$   flips while the second comes up heads   $53.75\%$   of the time in   $80$   flips.   Using the central limit theorem, we know that the proportion of heads in each case ( $\,\hat{p} =\frac{X}{n}\, $ ) is approximately normally distributed   –   with mean   $p$   and variance   $\frac{p(1-p)}{n}$   –   since the sample sizes are each   $\gt 30\;$ .

Let   $\hat{p}_1$   and   $\hat{p}_2$   denote two independent sample proportions from samples of sizes   $n_1$   and   $n_2\,$ , respectively, from two bernoulli processes with probabilities   $p_1$   and   $p_2\,$ , respectively.   Assume that   $n_1$   and   $n_2$   are large enough to treat   $\hat{p}_1$   and   $\hat{p}_2$   as normal variables.   Then the preceding theorem may be applied in the following manner.

Theorem: When the number of trials   $n_1$   and   $n_2$   are sufficiently large, the difference of the sample proportions   $\hat{p}_1 -\hat{p}_2$   will be approximately normally distributed with mean   $\mu_{\hat{p}_1 -\hat{p}_2} =p_1 -p_2$   and variance   $\sigma_{\hat{p}_1 -\hat{p}_2}^2 =\frac{p_1 \left(1-p_1\right)}{n_1} +\frac{p_2 \left(1-p_2\right)}{n_2}\;$ .

As with simple binomial distributions, the normal approximation will usually be satisfactory if each
  $n_i\, p_i \ge 5$   when   $p_i \le .5$   and   $n_i\, \left(1-p_i\right) \ge 5$   when   $p_i \ge .5\;$ .

Example

Suppose that a glass-maker is investigating two methods for making plate glass, and is concerned with the percentage of panes that break on transportation.   Suppose that method I had seven panes break from a batch of three hundred, and method II had eleven break from a batch of four hundred.   Give a   $95\%$   confidence interval for   $p_1 -p_2\;$ .

We proceed as in the case of the difference of two means.   A   $95\%$   confidence interval for   $p_1 -p_2$   is given by
$$
\hat{p}_1 -\hat{p}_2 -1.96\,\sigma_{\hat{p}_1-\hat{p}_2} \lt p_1 -p_2 \lt
\hat{p}_1 -\hat{p}_2 +1.96\,\sigma_{\hat{p}_1-\hat{p}_2}
$$
with
$$
\sigma_{\hat{p}_1-\hat{p}_2} =\sqrt{\frac{p_1 (1-p_1)}{n_1} +\frac{p_2 (1-p_2)}{n_2}} \;\text{.}
$$
As in the example considering the difference of two means, we will use the sample estimates   $\hat{p}_1$   and   $\hat{p}_2$   for   $p_1$   and   $p_2$   in the expression for   $\sigma_{\hat{p}_1-\hat{p}_2}\;$ .

In our example, we have   $\hat{p_1}\doteq 0.0233$   and   $\hat{p_2}\doteq 0.0275\,$ , with
$$
\sqrt{\frac{\hat{p_1} (1-\hat{p_1})}{n_1} +\frac{\hat{p_2} (1-\hat{p_2})}{n_2}} =
\sqrt{\frac{0.0233 (1-0.0233)}{300} +\frac{0.0275 (1-0.0275)}{400}} \doteq 0.0120 \;\text{.}
$$
so that
$$ 0.0233 -0.0275 -1.96\,0.0120 \lt p_1 -p_2 \lt 0.0233 -0.0275 +1.96\,0.0120 $$
or
$$ -0.0161 \lt p_1 -p_2 \lt 0.0078 \;\text{.} $$

There is no clear signature (positive or negative) to this approximate   $95\%$   confidence interval.   As a result, no clear statement can be made regarding the preference of method I or method II.