Hypotheses, Hypothesis Testing, and Error

Hypotheses

A statistical hypothesis is, in general, a statement about the density function of a random variable.   So, for example, the claim that data is generated by a binomial distribution is a statistical hypothesis.   The statement that the standard deviation of a normal random variable is   $2$   is also a statistical hypothesis.   In much of what follows, we will be formulating statistical hypotheses of the latter sort   –   that is, assuming that a density function is of a certain type and making assertions regarding values of its parameters.

Here is an example of such when dealing with a discrete random variable.   A region has more male than female births.   Assuming that the data supporting this observation is generated by a Bernoulli process (reasonable in this case), and letting   $p$   denote the proportion of male births, the statement that   $p=0.55$   is a statistical hypothesis.

An example of an assertion regarding variable parameters when dealing with a continuous random variable is the following.   Suppose that random variable   $X$   is the height of a female student at a certain university, known to have mean   $1.62$   meters.   Modelling   $X$   by a normal random variable, i.e.   $X$   is assumed to have density function
$$ \frac{1}{s\,\sqrt{2\,\pi}}\, {\rm e}^{-\frac{1}{2}((x-1.62)/s)^2} \,\text{,} $$
the statement that   $s=0.17$   is a statistical hypothesis.

Testing Hypotheses

Hypothesis testing means, for us, a mechanism for determining whether to accept or reject a particular hypothesis.

There are, of course, silly and useless tests   –   for example tossing a coin to decide whether to accept or reject the hypothesis on the standard deviation   $s$   of the normal variable   $X$   in the discussion above.   The toss of the coin has nothing to do with the value of   $s\,$ , and so is seen as irrelevant.

On the other hand, there are many real-life problems concerning the truth/falsehood of some proposed hypothesis.   For example, is a purported anti-viral drug an effective treatment against Ebola?   Or is one type of rubber less brittle than another at low temperature (think of the space shuttle Challenger).   Such problems can often be interpreted as a test of some hypothesis regarding the parameters of a probability density function.

Here is an example of how statistics might guide us when attempting to design a test with desirable properties.   Consider a marketer wishing to decide how many of various sizes of clothing she wishes to make.   She understands that the average woman in her market has height   $1.62m\,$ , and that heights are normally distributed.   Thus she models understands that the height function for her market is described by the density function
$$ \frac{1}{s\,\sqrt{2\,\pi}}\, {\rm e}^{-\frac{1}{2}((x-1.62)/s)^2} $$
above.   She needs to know how heights are distributed about this average within her market   –   that is, how large   $s$   is.   Suppose further that she knows that in another, similar, market the standard deviation is   $s=.10\,$ , but that that this market is somewhat more diverse and it might be better to operate with   $s=.20\;$ .   To inform the business decision, the statistician might proceed as follows.

Assume that density function
$$ \frac{1}{s\,\sqrt{2\,\pi}}\, {\rm e}^{-\frac{1}{2}((x-1.62)/s)^2} $$
holds, and assume that   $s=.10\;$ .   This is the hypothesis to be tested, and will be denoted by   $H_0\;$ .   Letting   $H_1$   denote the hypothesis that   $s=.20\,$ , our alternative.

To test   $H_0\,$ , a single observation will be made on the random variable   $X\,$ ; that is, the height of a random woman in the desired market will be taken.   Of course, in practice one takes several (or many) observations.   But for the sake of simplicity at this stage, only one observation will be made.   On the basis of this observation, a decision will be made to either accept or reject   $H_0\,$ , with rejection meaning accepting   $H_1\;$ .   The problem is to determine what values of   $X$   should tell us to accept   $H_0\,$ , so that other values tell us to reject   $H_0\;$ .   It is customary to call the   $X$   values leading to rejection of   $H_0$   the critical region of the test.   For this problem, the sample space may be considered as the   $x$   axis, as a normal random variable can assume any value.   We keep in mind that when measuring height of women we cannot get a negative value.   In fact we are very certain not to get a value less than   $x=1$   (remembering that this is in meters).   Since only one observation is being taken here, the sample space is one-dimensional.   If   $n$   observations were being taken, the sample space would be   $n$   dimensional, with one coordinate axis for each observation.   With respect to the above discussion, the problem of designing a test for   $H_0$   is that of choosing a critical region on the   $x$   axis.

Two Types of Error

Suppose that we select the part of the   $x$   axis to the right of   $x=1$   as our critical region.   That is, if a sample takes on a value   $X\gt 1 $   we reject hypothesis   $H_0\,$ , and if a sample takes on a value   $X\le 1 $   we accept hypothesis   $H_0\;$ .   To decide whether this was a wise choice, consider its consequences.   If   $H_0$   is actually true and the observed value of   $X$   is greater than   $1\,$ , $\, H_0$   will be rejected because of our choice of critical region.   This, clearly, is an error.   We call it an error of type I   –   a false negative.   If, on the other hand, $\, H_1$   is actually true and $\, X \le 1\,$ , $\, H_0$   will be accepted.   This is also an error.   We call this an error of type II   –   a false positive.   In the following table are laid out these two types of errors, along with the two types of correct decisions possible in this test.

$$
\begin{array}{c|cc}
\hline & H_0 \quad \text{true} & H_1 \quad \text{true} \\
\hline x\gt 1 \quad (\text{reject } H_0\,) & \text{type I error} & \text{correct decision} \\
\hline x\le 1 \quad (\text{accept } H_0\,) & \text{correct decision} & \text{type II error} \\ \hline
\end{array}
$$

We need to measure in some fashion the seriousness of each error when trying to judge whether our choice of critical region was appropriate.   There are various ways of doing this, but we will take what is called “the size of an error” as a measure of its seriousness.   The size of a type I error is the probability of making this error, or (in other words) the probability that a sample point will be from the critical region when   $H_0$   is true.   Similarly the size of a type II error is the probability that a sample point will be from outside of the critical region when   $H_1$   is true.

In terms of the sizes of the two types of error, there is a simple principle for judging whether our critical region is well chosen.   Among all of the possible tests having the same size type I error, is the size of the type II error as small as possible?   If so, then we consider our critical region to be well chosen.

Before examining the mathematical meaning of this principle, let us pay attention to the above comment that there are various ways of judging whether our choice of critical region was appropriate.   Another possible test is minimizing the sum of the sizes of the two errors.   But the preceding principle has been widely accepted for the following reason.

In many settings, the statistician will determine in advance what size of type I error is tolerable.   Then, anticipating performing an experiment a specified number of times, a test is constructed to minimize the size of the type II error.   For a fixed number of experimental trials, the size of the type II error generally increases as the size of the type I error decreases.   Hence one cannot make the type I error small without paying for it with a large type II error.   In real life, it is often necessary to adjust the type I error until a satisfactory balance has been reached between the sizes of the two errors.

The type I and type II error sizes will be denoted by   $\alpha$   and   $\beta\;$ .   Without going in to the practical considerations of possible choices for sizes of   $\alpha$   and   $\beta\,$ , we note that it is conventional to choose   $\alpha =.05\;$ .   This means that approximately   $5\%$   of the time true hypotheses being tested will be rejected.   The value   $\alpha =.05$   is arbitrary, and some other value could have been agreed upon; but   $\alpha =.05$   is commonly used.   In any applied problem on can calculate (or approximate) the value of   $\beta$   and then adjust the value of   $\alpha$   if the value of   $\beta$   is unsatisfactory when   $\alpha =.05\;$ .   This naturally works both ways.   In a particular experiment it might turn out that with   $\alpha$   set at   $.05$   the value of   $\beta$   would be very small, maybe considerably smaller than   $.05\;$ .   If the type I error were considered more serious than a type II error, we could then adjust   $\alpha$  
to be smaller and so make   $\beta$   greater.

An example of when it might be prudent to have   $\beta$   smaller than   $\alpha$   occurs when a new cancer identifier is being tested.   Consider the setting where there is an existing and not terribly effective cancer identifier, and a new identifier is developed.   We wish to compare the efficacy of the new one to determine if it is superior to the old.   If   $H_0$   is the hypothesis that the old identifier is superior, and   $H_1$   is that the new is superior, then   $\alpha$   is the probability that the new identifier is seen as superior when it is in fact not, and   $\beta$   is the probability that the old drug is seen as superior when it is not.   Given that the old is not very effective, it is worth our while to not easily reject the new drug.   If the new identifier is not as effective as the old, this will be evident after time passes and the new identifier has been tested for a while.   But if the new identifier is superior but easily rejected, then it will never have a chance to demonstrate its superiority.

In what follows, we consider the example mentioned in “Testing Hypotheses” above, with supporting mathematics.

Our Example, Mathematically Analyzed

Consider the problem under discussion from the point of view of our principle.   If the critical region is
$$ \left\| \frac{x-1.62}{.10}\right\|\gt .18 \,\text{,} $$
or   $x\in (-\infty. 1.44) \cup (1.80, +\infty)\;$ .   If our sample takes on values in this set, our marketer will proceed based on the assumption that   $s=.20\;$ .   The sizes of the two types of error are
$$
\alpha = \int_{-\infty}^{1.44} +\int_{1.80}^{+\infty} \,
\frac{1}{.10\,\sqrt{2\,\pi}}\, {\rm e}^{-\frac{1}{2}((x-1.62)/.10)^2} \, {\rm d}x = .072
$$
and
$$
\beta = \int_{1.44}^{1.80} \,
\frac{1}{.20\,\sqrt{2\,\pi}}\, {\rm e}^{-\frac{1}{2}((x-1.62)/.20)^2} \, {\rm d}x = .632 \;\text{.}
$$

Since the integrals are precisely area under curves, these may be represented graphically as follows.

IMAGE

To decide whether or not the preceding choice of critical region was a good one, we compare with others for which   $\alpha =.072\;$ .   Here is another test that might be a good challenger   –   one that uses a right “tail” rather than our symmetric combination of right and left “tails” as the critical region.   That is, the critical region in this new test is comprised of all   $x$   values greater than some fixed   $x_0$   value, where   $x_0$   satisfies
$$
\int_{x_0}^{+\infty} \,\frac{1}{.10\,\sqrt{2\,\pi}}\, {\rm e}^{-\frac{1}{2}((x-1.62)/.10)^2} \, {\rm d}x = .072 \;\text{.}
$$
This value can be determined within spreadsheets to be 1.766.   With the critical region  
$x\in ( 1.766, +\infty)$   we find that
$$
\beta = \int_{-\infty}^{1.766} \,
\frac{1}{.20\,\sqrt{2\,\pi}}\, {\rm e}^{-\frac{1}{2}((x-1.62)/.20)^2} \, {\rm d}x = .767 \;\text{.}
$$

Here is a graphical representation of the two types of error in this competing test.

IMAGE

On comparing the two values of   $\beta$   it is clear that the first test is superior to the second.   The second test would incorrectly reject   $H_1$   almost   $77\%$   of the time, while the first test would do so only   $63\%$   of the time.   Both test have very large type II errors, but this is typical when only one observation is taken.   It can be shown that the first test is optimal, given our principle of minimizing   $\beta$   for fixed   $\alpha\,$ , but this would take us too far afield at present.

The principle of test construction applies as well to problems with discrete random variables.   Here is as example illustrating such.

A Discrete Example, Mathematically Analyzed

XXX