Presentation of Data and Histograms: Frequency, Relative Frequency, and Weights – Introduction to Statistics via Spreadsheets

Given that we have a collection of numerical data, and a sense for where the data values lay, it is common to want to display the data in some sort of fashion that helps us understand it. Graphical representation is a common presentation method.

The idea is easy to understand when there are few values that the data can assume. For example, in the case of a coin flip, with $0$ representing tails and $1$ representing heads, there are two possible data values. Or in the case of a die roll, with the six possible outcomes $1\,$ , $\, 2\,$ , $\, 3\,$ , $\, 4\,$ , $\, 5\,$ , and $6\,$ . We first establish places for the counting of the different possible outcomes, and then count the number of each of the possible outcomes. If we count through the data, one datum at a time, we can indicate with marks in the appropriate places the values as we collect them. What is given below is a list of one hundred values resulting from one hundred rolls of a die. Included in the spreadsheet are three different graphic tabulations of these one hundred data. At left is a chart using horizontal bars to indicate number of occurrences. At center is chart of the same data using horizontal bars – the height of a column then indicates the number of occurrences of the corresponding value. Finally, in the third chart at right we have a polygonal curve connecting the points $(1, \,$ number of occurrences of $1)$ to $(2, \,$ number of occurrences of $2)$ to … to $(6, \,$ number of occurrences of $ 6)$ .

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

A graphical representation of data is called a histogram. The chart at left is a horizontal bar-chart, at center is a vertical bar-chart, and at right is a dependency plot because it is the graph of a function. Histograms are commonly used to organize and present data. We use function plots when we want to emphasize the functional relationship between the values (here $1$ through $6\,$ ) and the number of occurrences for each value.

Below is given a spreadsheet with six hundred randomly generated data (in six columns of a hundred values) representing rolls of a die. To the right of this data in column I are six numbers representing the number of times each of the values $1\,$ , $\, 2\,$ , $\, 3\,$ , $\, 4\,$ , $\, 5$ or $\, 6$ occur. For each number in column H, the value in column I is called the frequency of the value, and the column of these values is so labelled. The frequency numbers are compiled using the “=COUNTIF” command. Here we see the number of occurrences of $3$ (for example) using the command “=COUNTIF(A1:F100;”=3″)” , which counts the number of entries in the rectangle A1:F100 which satisfy the condition “=3”. As a test of the honesty of this, below the values is given their sum, with command “=SUM(I2:I7)” . A histogram of the data is included.

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

What follows is another example, with somewhat more complicated data.

Example

Given below is another randomly generated spreadsheet, with one hundred data values from $0$ to $9$ in column A. Frequency data is given in column D, and a histogram is available again at the click of a button. Note that in order to produce the histogram all that is needed is the frequency information. This is compiled and a vertical bar-chart histogram is presented.

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

In the above examples, we compiled the frequency of occurrence of the various data values. Sometimes it is useful to know what fraction of data satisfies a condition – particularly when the data set is huge. In considering the example of six hundred rolls of a die, for example, we might not be interested so much in the exact number of times each of the six possible outcomes occurred so much as what fraction of the time they each occurred. This would arise, for example, in issues concerning whether the die is fair or not. For a given data value, this information is called the relative frequency of the value. Below is given the same data and frequency compilation as above, with relative frequencies given in column J. Note that these values can be obtained by the spreadsheet command “=COUNTIF/COUNT” . Thus to obtain the relative frequency of $3\,$ , we use the command “=COUNTIF(A1:F100;”=3″)/COUNT(A1:F100)”, which divides the number of occurrences of $3$ by the number of data values.

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

Bins and Data Organization

Consider for a moment the following problem. We have a school with one hundred fifth-graders. We have the birthdays of these kids, and wish to understand how their birthdays are organized. It might be of some use to create a histogram showing their birthdays, but this will probably be fairly uninformative. It is easy to imagine that there might be a few days on which two children share birthdays, and maybe even a day or two on which more than two share. But, for the most part, there are probably not enough data points for this to be of much use.

We would have the same problem if we considered the results of two rolls of a die. A histogram would not reveal much. The following data and histogram is for two data points each of value $1$ through $6$ (i.e., three times as many possible values as data points).

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

Better for purposes of understanding how the data is distributed might be to sort the data into values of sizes desirable to capture what is under consideration. Such a cluster of data by value is called a bin. In the case of the one hundred fifth-graders, we might wish to sort them by month. This would allow us to see that each month has about one-twelfth of the birthdays (with some variation, of course). In columns A and B of the following spreadsheet are given one hundred values from $1$ through $360$ (rounded down for purposes of illustration). Think of these as the days of the year. We think of data bin #1 as representing January births, and we consider any data value from $1$ to $30$ to be in this bin. Similarly, we think of data bin #2 as representing February births, and we consider any data value from $31$ to $60$ to be in this bin, etc.
Notice that in column D the bounds of the bins are listed, in column E are given the frequencies of occurrence of the data in the respective bins, and in column F are given the relative frequencies of occurrence for the respective bins. A histogram of the data by bin is shown.

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

When a tail end of the data (for example, the greatest values) are rather sparsely distributed, it is often convenient to define a bin of greatest values as one which simply contains all values beyond a certain reasonable point. In the example below, the data takes any non-negative integer value (i.e. $0\,$ , $\, 1\,$ , … ). Most of the data is clustered from $1$ to $8\,$ , so it is reasonable to cut off our display of bins by making the upper-most contain everything greater than or equal to $10\;$ .

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

Computing Averages

When data is already collected into frequency information, it is very easy to compute many of the items used to describe the data. The mean, in particular, is quite simple to calculate.

Consider for a moment the data set $$ 3, 5, 4, 2, 2, 3, 1, 4, 3, 2, 1, 3, 4, 5, 3, 4, 1, 1, 1, 1 $$ so that the occurring values $1$ through $5$ have frequencies $6\,$ , $\,3\,$ , $\,5\,$ , $\,4\,$ , and $2\;$ . This might be presented to us in a histogram or a spreadsheet, and we could organize it into value-frequency pairs as $$ [[1,6], [2,3], [3,5], [4,4], [5,2]] $$ (with the first entry of each pair representing the value, and the second the frequency). To calculate the mean of these values, we sum them and divide by the number of values. That is, $$ \bar{x} =\frac{3 + 5 + 4 + 2 + 2 + 3 + 1 + 4 + 3 + 2 + 1 + 3 + 4 + 5 + 3 + 4 + 1 + 1 + 1 + 1}{20} \;\text{.} $$ Reorganizing this sum, we have $$
\begin{array}{rl}
\bar{x} & =\frac{(1+1+1+1+1+1) + (2+2+2) + (3+3+3+3+3) + (4+4+4+4) + (5+5)}{20} \\ & \\
& =\frac{1\cdot 6 +2\cdot 3 +3\cdot 5 +4\cdot 4 +5\cdot 2}{6+3+5+4+2} \\ & \\ & =2.65 \;\text{.}
\end{array}
$$
We see that the multipliers $6$ of $1\,$ , $\, 3$ of $2\,$ , $\, 5$ of $3\,$ , $\, 4$ of $4\,$ , and $2$ of $5\,$ , are precisely the frequencies of $1\,$ , $\, 2\,$ , $\, 3\,$ , $\, 4\,$ , and $5\;$ . This was seen in the way we organized the data above. The computation of $\bar{x}$ is then precisely the the various values of the data set multiplied by their frequencies, and then divided by the sum of the frequencies (i.e. the total number of data values).

Here is a summary of this process.

In general, when given frequencies for a data set, we compute the mean of the data set as follows. Let $\displaystyle{ f_1 }\,$ , $\,\displaystyle{ f_2 }\,$ , … , $\,\displaystyle{ f_k }$ be the frequencies of values $\displaystyle{ v_1 }\,$ , $\,\displaystyle{ v_2 }\,$ , … , $\,\displaystyle{ v_k }\;$ . The total number of values is exactly the sum of the frequencies: $\displaystyle{ f_1 +f_2 +\cdots +f_k }\,$ , and the mean is $$ \bar{x} =\frac{f_1\cdot v_1 +f_2\cdot v_2 +\cdots +f_k\cdot v_k}{f_1 +f_2 +\cdots +f_k} \; \text{.} $$

This is illustrated in the following example.

Example: Compute Mean from Frequency Data

Consider the fifty data, along with frequency information, in the following spreadsheet.

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

We compute the mean of the data to be $$\frac{ 6\cdot 0 +3\cdot 1 +\cdots +7\cdot 10 }{ ( 6+3+\cdots +7) } =\frac{ 277 }{50} = 5.54 \;\text{.} $$ The reader is invited to sum the data values and divide by the number of values. But it should be clear, given the already compiled frequency information, that the above computation is significantly more efficient.

When the data is sufficiently analyzed to present relative frequency information (as may be the case with huge data sets, when large frequency values can be obfuscating), computations of data descriptors is still easier. The mean is again quite simple to calculate.

Consider the discussion above, with the following data and frequencies redisplayed, along with relative frequencies as shown. We again reorganize the computation of the mean. $$
\begin{array}{rl}
\bar{x} & =\frac{f_1\cdot v_1 +f_2\cdot v_2 +\cdots +f_k\cdot v_k}{f_1 +f_2 +\cdots +f_k} \\
& \\
& =\color{red}{\frac{f_1}{f_1 +f_2 +\cdots +f_k}} \cdot v_1 +\color{red}{\frac{f_2}{f_1 +f_2 +\cdots +f_k}}
\cdot v_2 +\cdots +\color{red}{\frac{f_k}{f_1 +f_2 +\cdots +f_k}} \cdot v_k
\end{array}
$$
We see that the coefficients $\displaystyle{\frac{f_1}{f_1 +f_2 +\cdots +f_k}}$ of $\displaystyle{v_1}\,$ , $\displaystyle{\frac{f_2}{f_1 +f_2 +\cdots +f_k}}$ of $\displaystyle{v_2}\,$ , $\,\cdots\,$ , and $\displaystyle{\frac{f_k}{f_1 +f_2 +\cdots +f_k}}$ of $\displaystyle{v_k}\,$ , are precisely the relative frequencies of $\displaystyle{v_1}\,$ , $\displaystyle{v_2}\,$ , $\,\cdots\,$ , and $\displaystyle{v_k}\;$ .

In general, when given relative frequencies for a data set, we compute the mean of the data set as indicated above. We summarize this as follows. Let $\displaystyle{ r_1 }\,$ , $\,\displaystyle{ r_2 }\,$ , … , $\,\displaystyle{ r_k }$ be the relative frequencies of values $\displaystyle{ v_1 }\,$ , $\,\displaystyle{ v_2 }\,$ , … , $\,\displaystyle{ v_k }\;$ . The mean is $$ \bar{x} =\color{red}{r_1}\cdot v_1 +\color{red}{r_2}\cdot v_2 +\cdots +\color{red}{r_k}\cdot v_k \; \text{.} $$ This is because the relative frequency $\displaystyle{ r_1 }$ is obtained from the frequencies $\displaystyle{ f_1 }\,$ , $\,\displaystyle{ f_2 }\,$ , … , $\,\displaystyle{ f_k }$ by $$ r_1 =\frac{f_1}{f_1 +f_2 +\cdots +f_k} \,\text{,} $$ etc.

Example: Compute Mean from Relative Frequency Data

Consider the fifty data already seen above, but now with relative frequency information, in the following spreadsheet.

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

We now compute the mean of the data to be $$ 0.12 \cdot 0 +0.06 \cdot 1 +0.06\cdot 2 +0.06\cdot 3 +0.08\cdot 4 +0.02\cdot 5 +0.14\cdot 6 +0.1\cdot 7 +0.16\cdot 8 +0.06\cdot 9 +0.14\cdot 10 = 5.54 \;\text{.} $$ Again, compare the ease of this computation with the process of summing the data values and dividing by the number of values.

When data is presented in histograms, or in bins where the exact values of the data are not available, we can use the above technique to approximate the mean of the data. We will assume that the widths of the bins is fixed (as this computation otherwise has problems). When a histogram is presented, the presentation contains either the number of data points in each bin or the relative frequencies of the values defining the bins. What is missing in the above computation is the value associated to each bin. Here is how this is usually handled.

Let $\displaystyle{r_1}\,$ , $\, \displaystyle{r_2}\,$ , $\,\cdots\,$ , $\,\displaystyle{r_k}$ be the relative frequencies for the $k$ bins. (If frequency data is given, the relative frequencies can easily be computed. Also, in this setting where we are discussing the relative frequencies of occurrences in bins, the term weight is routinely used instead of “relative frequency”. We demand that the sum of the weights $\displaystyle{r_1 +r_2 +\cdots +r_k =1\;}$ .)
Convention is to assign values to bins as follows. Let $\displaystyle{B_1}\,$ , $\, \displaystyle{B_2}\,$ , $\,\cdots\,$ , $\,\displaystyle{B_{k-1}}$ be the cut-offs for the $k$ bins. We have assumed that $\displaystyle{B_2 -B_1 =B_3 -B_2 = \cdots = B_{k-1} -B_{k-2}}\,$ , as the bins all are taken to have the same width. With $\displaystyle{ m =\frac{B_2 -B_1}{2} =\cdots =\frac{B_{k-1} -B_{k-2}}{2} }$ being half the width of each bin, we take $$ v_2 =\frac{B_1 +B_2}{2}, \cdots , v_{k-1} =\frac{B_{k-2} +B_{k-1}}{2} \,\text{,} $$ i.e., the midpoint of each interior bin, as an approximating value for the bin. Slight adjustments need to be made for the first and last bins. For the first bin we take its value to be its upper bound less half the width of the general bin – $\displaystyle{ v_1 =B_1 -m }\,$ , and for the last bin we take its value to be its lower bound plus half the width of the general bin – $\displaystyle{ v_k =B_{k-1} +m }\;$ . We use these values as above to approximate: $$ \bar{x}_{\rm approx} =r_1\cdot v_1 +r_2\cdot v_2 +\cdots +r_k\cdot v_k \; \text{.} $$

Here is an example illustrating this.

Example: Approximating an Average from a Histogram

Given here is a histogram with frequencies indicated for each bin.

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

Our task is to approximate the mean of the data generating this histogram. To do so, we collect the necessary information.

Since the cut-offs for the bins are $\displaystyle{ B_1 =3 }\, $ , $\, \displaystyle{ B_2 =8 }\, $ , $\, \displaystyle{ B_3 =13 }\, $ , $\, \displaystyle{ B_4 =18 }\, $ , and $\displaystyle{ B_5 =23 }\, $ , half of the width of the interior bins is $\displaystyle{ \frac{B_2 -B_1}{2} =\frac{5}{2} }\; $ . Thus the values we give to the bins are $\displaystyle{ 3 -\frac{5}{2} =\frac{1}{2} }$ to the first bin, $\, \displaystyle{ \frac{3+8}{2} =\frac{11}{2} }$ to the second bin, $\, \displaystyle{ \frac{8+13}{2} =\frac{21}{2} }$ to the third bin, $\, \displaystyle{ \frac{13+18}{2} =\frac{31}{2} }$ to the fourth bin, $\, \displaystyle{ \frac{18+23}{2} =\frac{41}{2} }$ to the fifth bin, and $\displaystyle{ 23 +\frac{5}{2} =\frac{51}{2} }$ to the sixth bin. We also find that the relative frequencies are $\displaystyle{ \frac{38}{38+25+14+7+4+12} =\frac{38}{100} }$ for the first bin, $\, \displaystyle{ \frac{25}{100} }$ for second fourth bin, $\, \displaystyle{ \frac{14}{100} }$ for the third bin, $\, \displaystyle{ \frac{7}{100} }$ for the fourth bin, $\, \displaystyle{ \frac{4}{100} }$ for the fifth bin, and $\displaystyle{ \frac{12}{100} }$ for the sixth bin.

We thus approximate the average by
$$ \bar{x}_{\rm approx} =\frac{38}{100}\cdot \frac{1}{2} +\frac{25}{100}\cdot \frac{11}{2} +\frac{14}{100}\cdot \frac{21}{2} +\frac{7}{100}\cdot \frac{31}{2} +\frac{4}{100}\cdot \frac{41}{2} +\frac{15}{100}\cdot \frac{51}{2} \doteq 8.00\; \text{.} $$