Range and standard deviation
Today I illustrate the concepts of range and standard deviation and a few closely related functions that fall under the heading of descriptive statistics. (Inferential statistics, the title of our course, are still a few weeks away).
Here are two sets of cholesterol readings from an old study. You can see they have the same mean of 200, but in the first set the values are more spread out.
set1 <- c(177,185,195,198,199,200,202,218,226)
set2 <- c(192,196,198,199,200,200,202,204,209)
mean(set1)
## [1] 200
mean(set2)
## [1] 200
How do we quantify “more spread out”? One way is the range, which is simply the difference between the highest and lowest values. The range() function in R achieves this:
range(set1)
## [1] 177 226
range(set2)
## [1] 192 209
Note that this shows you the maximum and minimum values, not their difference. There are two functional differences of “range” - one is the maximum and minimum values, the other is their difference.
Another useful function is the interquartile range, or IQR. It is the difference between the third quartile value and the first quartile value. In other words, the value that is 1/4 from the top and the value that is 3/4 from the top (or 1/4 from the bottom). I deliberately chose 9 data points here so that the quartiles can be calculated exactly - the data point with 2 values above and 6 values below, and the data point with 6 values above and 2 values below. If this sounds like a special case of the median, which is the data point with half of the values above and half below, you are right.
Another way to think of the IQR is the range of the middle half of the data.
The IQR() function reports the other definition of range - the difference:
IQR(set1)
## [1] 7
IQR(set2)
## [1] 4
However, we can get the actual first and third quartile values from the summary() function:
summary(set1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 177 195 199 200 202 226
summary(set2)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 192 198 200 200 202 209
Range is easy to calculate but is very sensitive to isolated extreme values. Since data sets often have outliers, another name for extreme values, often the range just highlights these. The IQR solves this problem but is still a crude measure.
The most commonly used measure of spread is the standard deviation. This is calculated this by taking the difference between each observation and the mean, then squaring these, then adding them up and dividing by n-1, where n is the sample size. In past semesters I’ve had students calculate standard deviations by hand for smallish data sets, but this semester I am going to show you the R function sd() straight away:
sd(set1)
## [1] 14.94992
sd(set2)
## [1] 4.821825
We can see that the standard deviation of the first set is much higher, meaning the data are much more spread out. I invite you to play around with this a bit in R. For example, here I’ve replaced the last value of set1 with 1000 and the standard deviation got massively larger, as would be expected:
set1[9] <- 1000 #bracket notation for replacing a single value
sd(set1)
## [1] 267.9897
Since the standard deviation is extremely important in statistics – it will feature prominently throughout the course – I will expand a bit on its meaning. What does this number mean exactly? If we say some data have a mean of 50 and a standard deviation of 10, is that high? low? good? bad?
If your data are normally distributed, which means they follow a bell-shaped curve, then the standard deviation tells you a lot. About 2/3 of such data will fall within 1 standard deviation, or between 40 and 60 in our example. About 95% of the data will fall within 2 standard deviations, or between 30 and 70. Anything outside of 3 standard deviations – below 20 or above 80 – would be very unusual, a true outlier. In fact, we can predict, with high confidence and accuracy, how much of your data will fall within any range that you choose. The 2/3 (for one standard deviation) and 95% (for two standard deviations) come up a lot in the course, and worth committing to memory. We never have data that perfectly follow a bell-shaped curve, but these rules work pretty well even when the data are kind-of bell-shaped, which is most of the time. Even when your data are not at all bell-shaped, there are tricks for making it so.
What about when your data have a fixed upper or lower limit? For example, suppose the results of a class exam have a mean of 80 and a standard deviation of 15. You would say that 95% of the scores should fall between 50 and 110, but you know that none of them could have been above 100. It’s a bell-shape curve that is truncated on both ends (also on the low end, as you can’t get below zero). There are complicated ways of dealing with this issue, but it’s too early in the course for that. For now, understand that even a truncated distribution of data can be described reasonably well with a standard deviation.
How about the standard deviation formula? Why do we sum the squares of the differences from the mean? That hardly seems intuitive. Why not just sum the differences?
That is an alternative measure, actually, known as the average deviation.
Here is a simple example. Suppose you have the following data:
3,4,5,6,7,8,9
It’s easy to see the mean here is 6.
The differences from the mean are 3,2,1,0,1,2,3. The average difference is therefore 12/7, or 1.71.
In R, I can get this result from the MeanAD() function in the DescTools package:
x <- c(3,4,5,6,7,8,9)
mean(x)
## [1] 6
library(DescTools)
MeanAD(x)
## [1] 1.714286
The variance is the name for the average of the squares of the differences, which here are 9,4,1,0,1,4,9. Add these and divide by 7 and you get 4. The variance is confusing to work with because it’s in square units. To get back to regular units we take the square root, which is 2, which is the standard deviation!
Notice that the standard deviation is higher than the average deviation. That’s because the squaring gives more weight to the values farthest from the mean. It happens to be the “correct” amount of weight for many purposes, and it ties into all those nice properties of the bell-shaped curve I mentioned earlier. How do we know this is the “correct” amount? Trust me, it has been proven through a lot of theoretical work that is well beyond this course.
You could decide to give even more weight to the farthest-away values and use the cubes of the differences, which are 27,8,1,0,1,8,27. Then you would take the cube root and get 2.17 - an even bigger number. As far as I know, this doesn’t even have a name, because I don’t think anyone has ever found it useful. If you like, you can think of this as a Goldilocks-type problem. The average deviation gives too little weight to outliers. Cubing the deviations gives too much weight. Squaring them is just right.
Above, I calculated the standard deviation by dividing by 7. It is customary to divide by n-1, in this case 6, which represents a conservative adjustment that accounts for the sample size (the sd() function does this by default). By conservative adjustment, I mean it gives you a bigger number. It is like saying, “you have so few data points, just to be safe, we will say the data are a bit more spread out than they seem”. When your sample size is large, this has little effect (dividing by 100 vs. 101 will give almost the same result). It only matters when your sample size is small.
We will see a bunch of these conservative adjustments for small sample sizes in this course. You can avoid them by only working with large data sets, but of course that is not always possible.
One more point about standard deviations. We can compare them between similar data sets, but not across dissimilar ones. Suppose we have:
x <- c(3,4,5,6,7,8,9)
y <- c(30,40,50,60,70,80,90)
sd(x)
## [1] 2.160247
sd(y)
## [1] 21.60247
Is y more spread out than x? Well, yes, obviously, but only because all the numbers are larger. We want to know if y is more spread out than x relative to the size of its values. This is the only fair way to compare data from a place like Vermont to a place like California.
The way around this is to calculate the coefficient of variation, which is the ratio of the standard deviation to the mean. I am not aware of any function in R to do this, though I’m sure it exists somewhere. But it is easy enough to calculate:
cvx <- sd(x)/mean(x)
cvy <- sd(y)/mean(y)
cvx
## [1] 0.3600411
cvy
## [1] 0.3600411
Here we see the data are equally spread out, relative to their own size.