Where are the Data?

First of all, data is the plural of datum.  So the verb in the title is correct.   Secondly, when we collect numbers as data, the question “Where are the data?” refers to where the numbers are situated on a number-line.  Are they all tiny (as might be the case when studying manufacturing variations of nano-bots)?  Are they huge (as might be the case when studying variations in estimates of the US national debt)?  Are they varying from the small to the huge (as might be the case when considering income distribution in Seattle)?  How we answer this question helps us understand how we view the data.

Assuming we’ve collected a bunch of numbers in an attempt to quantitatively understand our phenomenon, the usual first task is to organize the numbers in a way that makes sense.  One place to start this is to identify the span of values, where they seem to be centered, and how they are clustered.


Location and Spread of Data

Given a data set, perhaps the most primitive description is where the data are is to identify the maximum and minimum values of the set.  Below is given a set of twenty data values in the first column (in cells A1 through A20) of a spreadsheet.  Within spreadsheets it is common to have a built-in function to identify the maximum and minimum of a collection of values.  Thus “=MAX(A1; A5; A17)” will find the maximum of the entries in cells A1, A4, and A17.  The maximum and minimum of a cluster of cells can usually be simplified rather than demanding that we list each cell in the command; thus “=MAX(A1:A20)” will find the maximum over the entire column.  And if we have a rectangle of numbers (as in a subsequent example) the command “=MAX(A1:F10)” will find the maximum over the six-by-ten rectangular cluster of sixty cells formed by A1 through A10, then B1 through B10, etc. up to F1 through F10.

In a similar fashion we have the commands “=MIN(A1; A5; A17)” and “=MAX(A1:A20)” and “=MAX(A1:F10)” for the miniima over the appropriate collections of cells.  Some spreadsheet applications include a command like “=RANGE(A1; A5; A17)” and “=RANGE(A1:A20)” and “=RANGE(A1:F10)” by which one can quickly identify the breadth of the data set, i.e. the difference between the maximum and the minimum.  Thus “=RANGE(A1:A20)” gives precisely the same value as “=MAX(A1:A20)-MIN(A1:A20)”. Pay attention to the commands generating the values in D1, D2, and D3 below.

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

Given below are three commonly used descriptions of where the data values are centered: mean, median, and mode.  These differ from one another in terms of what they try to extract from a data set.

Mean

The mean of a set of numbers is the average.  That is, the sum of the numbers divided by how many there are.  In formula, if we have  $n$  data values   $\displaystyle{ x_1 }\,$ , $\,\displaystyle{ x_2 }\,$ , $\,\cdots\,$ , $\,\displaystyle{ x_n }\,$ , the mean is $$ \frac{x_1 + x_2 + \cdots x_n}{n} = \frac{\sum_{k=1}^n\, x_k}{n} \;\text{.} $$ This introduces the notation $\Sigma\,$ , which is a shorthand notation for a sum.  The notation  $\displaystyle{ \sum_{k=1}^n\, x_k }$ here should be read as meaning the sum of the numbers  $\displaystyle{ x_k }$ as  $k$   goes from $1$  to   $n\;$ .  This notation will recur and should not be ignored here.

Example: Find the mean of the values  $4\,$ , $\, 1\,$ , $\, 4\,$ , $\, 5\,$ , $\, 2\,$ , $\, 6\,$ , $\, 6\,$ , $\, 1\,$ , $\, 3\,$ , $\, 1$ obtained on rolling a die ten times.

The mean is $$ \frac{4+1+4+5+2+6+6+1+3+1}{10} =\frac{31}{10} =3.1 \;\text{.} $$

Spreadsheets usually allow us to represent the sum of a collection of cells by a command like “=SUM(A1; A2; B5; C3)” for scattered cells, or “=SUM(A1:A8)” or “=SUM(A1:F1)” for values one after another in a row or column. They will usually allow for summing over values in a rectangular region with a command like “=SUM(A1:F8)”.  With this in mind, if we know how many items are being added we can compute averages by including the appropriate division: “=SUM(A1; A2; B5; C3)/4” or “=SUM(A1:A8)/8” or “=SUM(A1:F1)/6” or “=SUM(A1:F8)/48”.  Spreadsheets can also keep track of the number of terms being added with a “COUNT” command, so that “=COUNT(A1; A2; B5; C3)” for scattered cells, or “=COUNT(A1:A8)” or “=COUNT(A1:F1)” for values one after another in a row or column, or “=COUNT(A1:F8)” for values in a rectangular region.  Thus we can combine these to average: “=SUM(A1; A2; B5; C3)/COUNT(A1; A2; B5; C3)” or “=SUM(A1:A8)/COUNT(A1:A8)” or “=SUM(A1:F1)/COUNT(A1:F1)” or “=SUM(A1:F8)/COUNT(A1:F8)”.

Spreadsheets also allow for averaging directly with a command like “AVERAGE”.  Thus, following the preceding paragraph we have “=AVERAGE(A1; A2; B5; C3)” or “=AVERAGE(A1:A8)” or “=AVERAGE(A1:F1)” or “=AVERAGE(A1:F8)”.

Example: Examine ways to sum and average numbers in a spreadsheet.

Consider the data in rectangular array of cells A1 through J10 (one hundred cells).   You will find the sum of all the values in the data set in cell L1, the count of all the values in the data set in L2, the average computed by “=SUM/COUNT” in cell L3, and the average computed using “=AVERAGE” in L4.  Note that the values in L3 and L4 are equal.  You should play with this to compute averages over scattered collections of cells, columns and rows.

Click here to open a copy of this so you can experiment with it and the commands. You will need to be signed in to a Google account.

Median

The median of a set of numbers is the middle value of the set. It is not always so clear what is meant here, so what follows is a bit of explanation and a couple of examples. If there is an odd number ( $\, 2\, n+1\,$ ) of data, then there is a datum for which there are no more than $n$ values no greater, and no more than  $n$  values no less.  This value is the median.  If there is an even number ( $\, 2\, n\,$ ) of data, then there is a minimum datum (call it  $\displaystyle{ x_1 }\,$ ) for which there are no more than  $n-1$  values no greater, and a maximum datum (call it $\displaystyle{ x_2 }\,$ ) for which there are no more than $n-1$  values no greater.  In this case, the median is the average ( $\,\displaystyle{ \frac{x_1 +x_2}{2} }\,$ ) of  $\displaystyle{ x_1 }$ and  $\displaystyle{ x_2 }\;$ .

Example: Find the median of the values  $4\,$ , $\, 1\,$ , $\, 4\,$ , $\, 5\,$ , $\, 2\,$ , $\, 6\,$ , $\, 6\,$ , $\, 1\,$ , $\, 3$ obtained on rolling a die nine times.

Ordering the nine values ( $\,1\,$ , $\, 1\,$ , $\, 2\,$ , $\, 3\,$ , $\, 4\,$ , $\, 4\,$ , $\, 5\,$ , $\, 6\,$ , $\, 6\,$ ), we see that the middle value is  $4\;$ .  This is the median.

Example: Find the median of the values  $4\,$ , $\, 1\,$ , $\, 4\,$ , $\, 5\,$ , $\, 2\,$ , $\, 6\,$ , $\, 6\,$ , $\, 1\,$ , $\, 3\,$ , $\, 1$  obtained on rolling a die ten times.

Ordering the ten values ( $\,1\,$ , $\, 1\,$ , $\, 1\,$ , $\, 2\,$ , $\, 3\,$ , $\, 4\,$ , $\, 4\,$ , $\, 5\,$ , $\, 6\,$ , $\, 6\,$ ), we see that there are five data no greater than $3\,$ , and five data no less than $4\;$ . Thus, with $\displaystyle{ x_1 =3 }$  and $\displaystyle{ x_2 =4 }\,$ , the median is  $\displaystyle{ \frac{x_1 +x_2}{2} =\frac{3+4}{2} =3.5 }\;$ .



Spreadsheets usually allow for determination of median with a command like “MEDIAN”.   Thus, following the preceding paragraph we have “=MEDIAN(A1; A2; B5; C3)” or “=MEDIAN(A1:F8)”.


Try these commands in the following example.


Example: Determine the median of numbers in a spreadsheet.

Similar to the preceding sheet, consider again the values in cells A1 through J10. You will find in cell M1 the number of occurrences of  $0$  in the data set, in cell M2 the number of occurrences of $1\,$ , etc.  The median, obtained using MEDIAN is in M12. You should play with this to compute medians over scattered collections of cells, columns and rows.

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

Mode

The mode of a set of numbers is the most frequently occurring value. If there are multiple values occurring most frequently, one might say either that the mode is not unique or that it is not well-defined.  Different treatments handle this differently. Here we will say that the mode is not unique.


Example: Find the mode of the values   $4\,$ , $\, 1\,$ , $\, 4\,$ , $\, 5\,$ , $\, 2\,$ , $\, 6\,$ , $\, 6\,$ , $\, 1\,$ , $\, 3\,$ , $\, 1$   obtained on rolling a die ten times.

The values are given here with their frequencies. $$ \begin{array}{c|cccccc} \text{value} & 1 & 2 & 3 & 4 & 5 & 6 \\ \hline \text{frequency} & 3 & 1 & 1 & 2 & 1 & 2 \end{array} $$ The value $1$  occurs most often –  three times.


Spreadsheets usually allow us to represent the mode by a command like “=MODE(A1; A2; B5; C3)” for scattered cells, or “=MODE(A1:F8)” for values in a rectangular region.

Observe this command in the following example.


Example: Find the mode of numbers in a spreadsheet.

Once again, consider again the values in cells A1 through J10.  You will find in cell M1 the number of occurrences of $0$ in the data set, in cell M2 the number of occurrences of $1\,$ , etc. The mode, obtained using “=MODE” is in M12.  You should play with this to compute modes over scattered collections of cells, columns and rows. Note also cell M14. If a set has more than one mode, the command “=MODE.MULT” will list them all in increasing order.

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.