2 min read

Classifying observations*

Classifying observations

There are several ways you can classify your data into a smaller number of groupings. Here are two. The first uses what is known as bracket notation, and the second uses the cut() function.

It is easiest to illustrate this with a tiny example of made-up data, which is generated in the first two lines.

df <- as.data.frame(c(3,4,5,10,15,20,30,38,77))
colnames(df)[1] <- "somedata"

#method 1 - bracket notation

df$cat[df$somedata < 10] <- "low"
df$cat[df$somedata >= 10 & df$somedata < 20] <- "medium"
df$cat[df$somedata >= 20] <- "high"

Think of the above as assigning a value to a new variable based on some condition.

The second method is perhaps a bit more popular. You cut the data into pieces based on some categories you define. -Inf and Inf (note the capital letter I) refer to negative infinity and positive infinity. That’s kind of overkill (your data has a much smaller range than that) but it’s how most people code it.

#method 2 - cut() function

df$cat2 <- cut(df$somedata, breaks=c(-Inf,10,20,Inf))

df$cat3 <- cut(df$somedata, breaks=c(-Inf,10,20,Inf), right=F)

df$cat2 <- cut(df$somedata, breaks=c(-Inf,10,20,Inf), 
               labels = c("3-10","11-20","21-77"))

df$cat3 <- cut(df$somedata, breaks=c(-Inf,10,20,Inf), right=F,
               labels = c("3-9","10-19","20-77"))

The difference between the first two lines above is how it handles values that land exactly on the boundary. The default is right=TRUE (recall you do not have to type out the defaults, though you may choose to do so). This means that a value exactly on the boundary will land in the lower category. right=FALSE puts it in the upper category. Another way to handle this is to make your cut points be values not in the data, like 10.01 and 20.01.

The labels parameter lets you customize how this variable will appear in tables and graphs.