In an earlier section we mentioned the range of a data set. This can serve as a measure of how spread the data is. Here we introduce two refinements on the concept of spread. These are more precise measures, giving us an idea of how the data fills out the range.
Consider the following two data sets, each containing five values.

There are several measures of dispersion, but we will only concern ourselves with two closely related such measures. The variance, and its square root – the standard deviation. Variance is easier to calculate, but standard deviation has proper units and so is easily interpreted when visualizing data. When the standard deviation is small the data will tend to be clustered close to the mean, and when the standard deviation is large the data will tend to be more distributed away from the mean.
Variance
The variance of a set of numbers is a measure of how spread the set is about its mean. Here is an explanation of how it is computed.
Since the square of a number is a way to measure how far it is from
NOTE! The above is a fine notion of dispersal of data values about the mean, except for a small issue that cannot reasonably be explained quite yet.
Because of this small issue, a variation on the above notion of dispersal is routinely used, and implemented in spreadsheets and statistical analysis packages. Given data values
When comparing the variances of two data sets, the greater variance corresponds to data points being on average further from the mean.
Example: Find the variances of data sets 1 and 2 above and compare.
The variance of data set 1 is
Spreadsheets usually allow us to represent the variance of a collection of cells by a command like “=VAR(A1; A2; B5; C3)” for scattered cells, or “=VAR(A1:A8)” or “=VAR(A1:F1)” for values one after another in a row or column, or for values in a rectangular region with a command like “=VAR(A1:F8)”. With this in mind, examine the following spreadsheet. In column A are given twenty data values. The mean of the values is given in cell D1 and the variance, using the command “=VAR(A1:A20)”, in cell D2. In column E are given the values obtained by squaring the difference between the values in column A and the value in cell D1. These are summed and then divided by
Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.
The following is an example illustrating this for more complicated data sets.
Example: Find the variances of two data sets, and compare their histograms.
Here are two data sets each with forty values: the first is in cells A2 through A41, and the second in cells E2 through E41. In cells C2 and G2 you will find the means of the two data sets. & note that they are close (i.e. if you reload the page several times you will routinely see that the difference between the two averages is small compared to their respective ranges). In cells C3 and G3 you will find the ranges of the two data sets. In cells C4 and G4 you will find the variances of the two data sets. Histograms for the two data sets are plotted. Note that the first data set is routinely more disperse than the second, and its variance is correspondingly larger.
Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.
Standard Deviation
The standard deviation of a set of numbers is the square root of the variance. Thus, if
Example: Find the standard deviations of data sets 1 and 2 above. Then double the data values in set 2 and observe how this influences the standard deviation.
The variance of data set 1 is
Doubling the second data set gives data set 3 with
Spreadsheets usually allow us to represent the standard deviation of a collection of cells by a command like “=STDEV(A1; A2; B5; C3)” for scattered cells, or “=STDEV(A1:A8)” or “=STDEV(A1:F1)” for values one after another in a row or column, or for values in a rectangular region with a command like “=STDEV(A1:F8)”. With this in mind, examine the following spreadsheet, an extension of the spreadsheet in the section above. In cell C3 you will find the standard deviation given by the command “=STDEV(A1:A20)”. In cell C4 you will find the same value given by the command “=SQRT(VAR(A1:A20))”. Finally, in cell F2 you will again find the standard deviation, this time given by the command “=SQRT(SUM(D1:D20)/(COUNT(D1:D20)-1))” . Thus, as in the case of the variance, the standard deviation can be computed as necessary using its definition rather than a single command.
Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.
We continue our above example with the more complicate data sets.
Example (continued): Find the standard deviations of the two data sets in the example above, and illustrate.
Repeated here are the two data sets from our preceding complicated example. As before, in cells D1 and H1 you will find the means of the two data sets. In cells D2 and H2 you will find the ranges of the two data sets. In cells D3 and H3 you will find the standard deviations of the two data sets. Plotted below the spreadsheet are histograms for the two data sets, this time indicating the respective standard deviations.
Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.
If data is compiled into frequencies (or relative frequencies), then just as in computations of the mean the computations of variance and standard deviation are greatly simplified.
Computing Spread from Frequency Data
As in the case of computing means, when data is already collected into frequency information, it is very easy to compute many of the variance and standard deviation.
Consider again the data set
Of course, for the standard deviation we take the square root:
Here is a general summary of this process.
In general, when given frequencies for a data set, we compute the mean of the data set as follows. Let
This is illustrated in the following example.
Example: Compute Variance and Standard Deviation from Frequency Data
Consider the fifty data, along with frequency information, in the following spreadsheet.
Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.
We compute the mean of the data to be
When the data is sufficiently analyzed to present relative frequency information (as may be the case with huge data sets, when large frequency values can be obfuscating), computations of data descriptors is still easier. The mean is again quite simple to calculate.
Consider the discussion above, with the following data and frequencies redisplayed, along with relative frequencies as shown. We again reorganize the computation of the variance.
In general, when given relative frequencies for a data set, we compute the variance of the data set as indicated above. We summarize this as follows. Let
When the relative frequencies are given but the frequencies themselves are not known, we have to adjust this computation slightly. We do not know the fraction
Example: Compute Variance and Standard Deviation from Relative Frequency Data
Consider the fifty data already seen above, but now with relative frequency information, in the following spreadsheet.
Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.
We now approximate the variance by
This is highlighted in yellow in cell H2. Compare with the actual variance, highlighted in green in I2.
When data is presented in histograms, or in bins where the exact values of the data are not available, we can use a variation on the above technique, introduced for approximating the mean, to approximate the variance of the data. We will again assume that the widths of the bins is fixed.
Let
Let
Here is an example illustrating this.
Example: Approximating Variance and Standard Deviation from a Histogram
Given here is a histogram with frequencies indicated for each bin.
Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.
Our task is to approximate the variance of the data generating this histogram. To do so, we collect the necessary information.
Since the cut-offs for the bins are
Here the frequencies are known, so that we will use the approximation to variance
We first approximate the average by
If we just had the relative frequency data and not the total number of data, we use the observation that for large data sets