Quantiles
Today’s topic is quantiles. This refers to dividing your data into groups so that there are an equal number of observations in each group. There are special words to go with particular numbers of groups: for three groups, we say tertiles, four=quartiles, five=quintiles, ten=deciles, hundred=percentiles. There is presumably a prefix for every number, but these are the only ones I’ve ever seen being used.
Quantiles represent a way to simplify your data sets into more manageable pieces. The quantiles() function in R does the job and is vastly easier than doing it by hand:
x <- c(1:100) #create data consisting of the numbers 1 to 100
head(x)
## [1] 1 2 3 4 5 6
tail(x)
## [1] 95 96 97 98 99 100
quantile(x)
## 0% 25% 50% 75% 100%
## 1.00 25.75 50.50 75.25 100.00
Note that when you include no parameters, you get quartiles by default. To get something other than quartiles, you have to add the probs parameter shown here. What it means is that from the lowest quantile (0) to the highest (1), cut the data into 10 pieces (1/10 - I could have also typed 0.1)
quantile(x, probs=seq(0,1,1/10))
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1.0 10.9 20.8 30.7 40.6 50.5 60.4 70.3 80.2 90.1 100.0
You’ll notice the answer contains decimal points. These are kind of arbitrary as our data only contain integers. But if you look at how many observations are in each decile, you’ll see it works out. The first decile should include 1-10, and the quantile() function tells us 1-10.9. Check. The second should include 11-20, and quantile() returns 10.9-20.8. Check.
In fact, quantile() offers 9 different ways to calculate the quantiles, using the type parameter!
quantile(x, probs=seq(0,1,1/10), type=1)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1 10 20 31 40 50 61 70 80 90 100
quantile(x, probs=seq(0,1,1/10), type=2)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1.0 10.5 20.5 31.0 40.5 50.5 61.0 70.5 80.5 90.5 100.0
quantile(x, probs=seq(0,1,1/10), type=3)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1 10 20 30 40 50 60 70 80 90 100
quantile(x, probs=seq(0,1,1/10), type=4)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1 10 20 30 40 50 60 70 80 90 100
quantile(x, probs=seq(0,1,1/10), type=5)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1.0 10.5 20.5 30.5 40.5 50.5 60.5 70.5 80.5 90.5 100.0
quantile(x, probs=seq(0,1,1/10), type=6)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1.0 10.1 20.2 30.3 40.4 50.5 60.6 70.7 80.8 90.9 100.0
quantile(x, probs=seq(0,1,1/10), type=7) #default
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
## 1.0 10.9 20.8 30.7 40.6 50.5 60.4 70.3 80.2 90.1 100.0
quantile(x, probs=seq(0,1,1/10), type=8)
## 0% 10% 20% 30% 40% 50% 60% 70%
## 1.00000 10.36667 20.40000 30.43333 40.46667 50.50000 60.53333 70.56667
## 80% 90% 100%
## 80.60000 90.63333 100.00000
quantile(x, probs=seq(0,1,1/10), type=9)
## 0% 10% 20% 30% 40% 50% 60% 70% 80% 90%
## 1.000 10.400 20.425 30.450 40.475 50.500 60.525 70.550 80.575 90.600
## 100%
## 100.000
In this case, type=5 is how I would do it by hand. Notice that type=1 doesn’t even put the same number into each category! I include this here only to point out that many statistics do not have a single agreed-upon definition.