Other Descriptions of Data: Skewness, Kurtosis – Introduction to Statistics via Spreadsheets

There are other, more precise, ways to describe data sets. These are for a more refined use than we will pursue here, but two such will be mentioned briefly for the sake of completion.

Skewness

Consider the following two data sets.
$$ -3, -3, -2, -2, -2, -1, -1, 6, 8 $$
and
$$ -8, -6, 1, 1, 2, 2, 2, 3, 3 \;\text{.} $$
These are given with histograms below.

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

Both data sets have mean $0$ and standard deviation $\displaystyle{ \sqrt{16.5}} \doteq 4.062\;$ . But in the first, there are relatively many data values less than the mean while few which are greater than (but further from) the mean. In the second, this is the other way around – few data relatively many data which are greater than the mean while few which are less than (but further from) the mean. There is a way to identify such imbalance, called skewness. This is a computation based on the fact that the cube (third power) of the difference between a datum value and the mean of the data measures the extent to which the datum is less-than/greater-than the mean.

With this in mind, we compute for each datum the cube of its signed distance from the mean (with a scaling for the standard deviation): i.e. for each datum $\displaystyle{ x_i }$ we compute
$$ \left( \frac{x_i -\bar{x}}{s} \right)^3 $$
and average over the entire data set. That is, we compute
$$ \frac{1}{n}\cdot \sum_{i=1}^n\, \left( \frac{x_i -\bar{x}}{s} \right)^3 \;\text{.} $$
But, just as we made an adjustment to the computation of standard deviation when we changed – with temporary apology – the denominator from $n$ to $n-1\,$ , we adjust this computation to define
$$ g = \frac{n^2}{(n-1)\, (n-2)} \cdot \left( \frac{1}{n}\cdot \sum_{i=1}^n\, \left( \frac{x_i -\bar{x}}{s} \right)^3 \right) \;\text{.} $$
We call this the skewness of the data set. This adjustment is needed for essentially the same technical reasons as our computational tweak in computation of variance, and cannot reasonably be explained at present. Notice that the adjustment factor $\displaystyle{ \frac{n^2}{(n-1)\, (n-2)} }$ gets closer and closer to $1$ as $n\to +\infty\; $ .

On considering our two data sets above, we obtain
$$
g_1 =\frac{9}{56}\cdot \left( \frac{(-3)^3 +(-3)^3 +(-2)^3 +(-2)^3 +(-2)^3 +(-1)^3 +(-1)^3 +6^3 +8^3}{16.5^{3/2}} \right)\doteq 1.55
$$
as the skewness of the first data set, and
$$
g_2 =\frac{9}{56}\cdot \left( \frac{(-8)^3 +(-6)^3 +1^3 +1^3 +2^3 +2^3 +2^3 +3^3 +3^3}{16.5^{3/2}} \right) \doteq -1.55
$$
for the skewness of the second. The signs indicate to which side of the mean the data is more widely spread. That is, in the first case $\displaystyle{ g_1 \doteq 1.55 \gt 0 }\,$ , indicating that the data is on average further to the right from the mean, and in the second case $\displaystyle{ g_2 \doteq -1.55 \lt 0 }\,$ , indicating that the data is on average further to the left from the mean.

In the following spreadsheet we have the first data set above in column A, and use commands “=AVERAGE(A1:A9)” , “=STDEV(A1:A9)” , and “=SKEW(A1:A9)” to obtain the average, standard deviation, and skewness of the data in cells D1, D2 and D3. In column F we compute “=((A1-D1)/D2)^3” through “=((A9-D1)/D2)^3” . Finally in cell G1 we sum the entries of column F and multiply by $\frac{9}{8\cdot 7}$ as per our formula to compute the skewness explicitly. Note that this computation agrees with the result in cells D3 (the two cells are colored).

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

Kurtosis

There is a measure of the flatness of the data, particularly near to the mean. Consider the following two data sets.
$$ 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6 $$
and
$$ 2, 3, 4, 4, 4, 4, 4, 4, 4, 5, 6 \;\text{.} $$
These are given with smooth-curve histograms below.

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

Note that the first data set is quite flat, while the second has a sharp peak. This distinction is what we wish to detect.

With this in mind, we compute for each datum the fourth power of its signed distance from the mean (with a scaling for the standard deviation): i.e. for each datum $\displaystyle{ x_i }$ we compute
$$ \left( \frac{x_i -\bar{x}}{s} \right)^4 $$
and average over the entire data set. That is, we compute
$$ \frac{1}{n}\cdot \sum_{i=1}^n\, \left( \frac{x_i -\bar{x}}{s} \right)^4 \;\text{.} $$
First, an adjustment is made by comparing with an important, commonly occurring family of data sets by computing
$$ \frac{1}{n}\cdot \sum_{i=1}^n\, \left( \frac{x_i -\bar{x}}{s} \right)^4 -3 \;\text{.} $$
Without going in to detail here, the commonly occurring family takes on the value $3\;$ . When this computation is negative, our data will be more flat than the common family. And when this computation is positive, our data will be more peaked than the common family. Secondly, we include two adjustment factors, as was done for variance and skewness (above). This adjustment is needed for essentially the same technical reasons as our computational tweak in computation of variance, and again cannot reasonably be explained at present. The resulting computation is
$$
k = \frac{n^2\, (n+1)}{(n-1)\, (n-2)\, (n-3)} \cdot \left( \frac{1}{n}\cdot
\sum_{i=1}^n\, \left( \frac{x_i -\bar{x}}{s} \right)^4 \right) -3\,\frac{(n-1)^2}{(n-2)\, (n-3)} \;\text{.}
$$
We call this the kurtosis of the data set. The word comes from the greek “kyrtos” or “kurtos”, meaning “curved, arching”. Notice that the adjustment factors
$\displaystyle{ \frac{n^2\, (n+1)}{(n-1)\, (n-2)\, (n-3)} }$ and $\displaystyle{ \frac{(n-1)^2}{(n-2)\, (n-3)} }$ gets closer and closer to $1$ as $n\to +\infty\; $ .

On considering our two data sets above, we obtain
$$
\begin{array}{rl} k_1 & =\textstyle{ \frac{132}{720}\cdot \left(
\frac{(2-4)^4 +(2-4)^4 +(3-4)^4 +(3-4)^4 +(4-4)^4 +(4-4)^4 +(4-4)^4 +(5-4)^4 +(5-4)^4 +(6-4)^4 +(6-4)^4}{\sqrt{2}^4}
\right) -3\cdot \frac{100}{72} } \\ & \\ & \doteq -1.05
\end{array}
$$
as the kurtosis of the first data set, and
$$
\begin{array}{rl} k_2 & =\textstyle{ \frac{132}{720}\cdot \left(
\frac{(2-4)^4 +(3-4)^4 +(4-4)^4 +(4-4)^4 +(4-4)^4 +(4-4)^4 +(4-4)^4 +(4-4)^4 +(4-4)^4 +(5-4)^4 +(6-4)^4}{1^4}
\right) -3\cdot \frac{100}{72} } \\ & \\ & \doteq 2.07
\end{array}
$$
for the kurtosis of the second. The signs indicate whether the data is flat or peaked. That is, in the first case $\displaystyle{ k_1 \doteq -1.05 \lt 0 }\,$ , indicating that the data is rather flat about the mean, and in the second case $\displaystyle{ k_2 \doteq 2.07 \gt 0 }\,$ , indicating that the data is peaked at the mean.

Note that we have illustrated the computational method here using data sets which are symmetric about the mean. This is not necessary for either the definition or computation of kurtosis.

In the following spreadsheet we have the first data set above in column A, and use commands “=AVERAGE(A1:A9)” . “=STDEV(A1:A9)” , and “=KURT(A1:A9)” to obtain the average, standard deviation, and kurtosis of the data in cells D1, D2 and D3. In column F we compute “=((A1-D1)/D2)^4” through “=((A11-D1)/D2)^4” . Finally in cell G1 we average the entries of column F and multiply by $\frac{11^2\cdot 12}{10\cdot 9\cdot 8}\,$ , then subtract $3\cdot\frac{10^2}{9\cdot 8}$ as per our formula, to compute the kurtosis explicitly. Note that this computation agrees with the result in cells D3 (the two cells are colored).

Click here to open a copy of this so you can experiment with it. You will need to be signed in to a Google account.

For our purposes, it suffices to know that these measures of the shape of a data set exist, and are typical functions in spreadsheets.