3 min read

Quantiles*

Quantiles

Today’s topic is quantiles. This refers to dividing your data into groups so that there are an equal number of observations in each group. There are special words to go with particular numbers of groups: for three groups, we say tertiles, four=quartiles, five=quintiles, ten=deciles, hundred=percentiles. There is presumably a prefix for every number, but these are the only ones I’ve ever seen being used.

Quantiles represent a way to simplify your data sets into more manageable pieces. The quantiles() function in R does the job and is vastly easier than doing it by hand:

x <- c(1:100) #create data consisting of the numbers 1 to 100
head(x)
## [1] 1 2 3 4 5 6
tail(x) 
## [1]  95  96  97  98  99 100
quantile(x)
##     0%    25%    50%    75%   100% 
##   1.00  25.75  50.50  75.25 100.00

Note that when you include no parameters, you get quartiles by default. To get something other than quartiles, you have to add the probs parameter shown here. What it means is that from the lowest quantile (0) to the highest (1), cut the data into 10 pieces (1/10 - I could have also typed 0.1)

quantile(x, probs=seq(0,1,1/10))
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##   1.0  10.9  20.8  30.7  40.6  50.5  60.4  70.3  80.2  90.1 100.0

You’ll notice the answer contains decimal points. These are kind of arbitrary as our data only contain integers. But if you look at how many observations are in each decile, you’ll see it works out. The first decile should include 1-10, and the quantile() function tells us 1-10.9. Check. The second should include 11-20, and quantile() returns 10.9-20.8. Check.

In fact, quantile() offers 9 different ways to calculate the quantiles, using the type parameter!

quantile(x, probs=seq(0,1,1/10), type=1)
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    1   10   20   31   40   50   61   70   80   90  100
quantile(x, probs=seq(0,1,1/10), type=2)
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##   1.0  10.5  20.5  31.0  40.5  50.5  61.0  70.5  80.5  90.5 100.0
quantile(x, probs=seq(0,1,1/10), type=3)
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    1   10   20   30   40   50   60   70   80   90  100
quantile(x, probs=seq(0,1,1/10), type=4)
##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
##    1   10   20   30   40   50   60   70   80   90  100
quantile(x, probs=seq(0,1,1/10), type=5)
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##   1.0  10.5  20.5  30.5  40.5  50.5  60.5  70.5  80.5  90.5 100.0
quantile(x, probs=seq(0,1,1/10), type=6)
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##   1.0  10.1  20.2  30.3  40.4  50.5  60.6  70.7  80.8  90.9 100.0
quantile(x, probs=seq(0,1,1/10), type=7) #default
##    0%   10%   20%   30%   40%   50%   60%   70%   80%   90%  100% 
##   1.0  10.9  20.8  30.7  40.6  50.5  60.4  70.3  80.2  90.1 100.0
quantile(x, probs=seq(0,1,1/10), type=8)
##        0%       10%       20%       30%       40%       50%       60%       70% 
##   1.00000  10.36667  20.40000  30.43333  40.46667  50.50000  60.53333  70.56667 
##       80%       90%      100% 
##  80.60000  90.63333 100.00000
quantile(x, probs=seq(0,1,1/10), type=9)
##      0%     10%     20%     30%     40%     50%     60%     70%     80%     90% 
##   1.000  10.400  20.425  30.450  40.475  50.500  60.525  70.550  80.575  90.600 
##    100% 
## 100.000

In this case, type=5 is how I would do it by hand. Notice that type=1 doesn’t even put the same number into each category! I include this here only to point out that many statistics do not have a single agreed-upon definition.