Basic functions in the dplyr package*

Today we are going to learn about an important R package called dplyr (pronounced “de” as in design + “plier” as in a pair of pliers). Just like ggplot2, you will have to install it first. As a refresher, first, under the Tools menu, go to Install Packages. Then start typing the package name. By the time you get to dp you should see dplyr as one of your choices. Then, in the scripting window enter and run this line of code:

library(dplyr)

dplyr adds a bunch of functions and commands for organizing your data. With the R community, people like to use the word “tidy” in this context. When you remove unwanted observations, sort, group, rename, filter and so on, this is called “tidying” your data. Packages that are related to tidying your data are collectively called the “tidyverse”. ggplot2 is also part of the tidyverse. In fact, there is an rpackage called tidyverse which contains 8 packages related to tidying your data. You could, if you like, download and install the tidyverse package and then when you run library(tidyverse) you’ll have both dplyr and ggplot2 plus a bunch of others we won’t be using.

once more we will use the heart dataset, available here.

Today we are going to cover just 3 dpylr functions, select(), filter(), and rename(). First, though, let’s use a built-in function, colnames() to see what columns we have in our heart data. We could use head() or tail() for this, but colnames() gives just the names, without any data, and so is easier to see on the screen:

library(dplyr)
heart <- read.csv("C:/epi551/heart.csv", header=T)
colnames(heart)

##  [1] "Age"                      "Sex"                     
##  [3] "Chest_Pain_Type"          "Resting_Blood_Pressure"  
##  [5] "Serum_Cholesterol"        "Fasting_Blood_Sugar"     
##  [7] "Resting_ECG"              "Max_Heart_Rate_Achieved" 
##  [9] "Exercise_Induced_Angina"  "ST_Depression_Exercise"  
## [11] "Peak_Exercise_ST_Segment" "Num_Major_Vessels_Flouro"
## [13] "Thalassemia"              "Diagnosis_Heart_Disease" 
## [15] "diagnosis"

First, let’s use select() to select a subset of these variables. Here I am creating a new data frame consisting of just 3 variables. I named it heart3 to signify it is the heart data with only 3 variables – you can name it anything you like. Notice in the statement I wrote capital A for Age, but lowercase d for diagnosis - remember R is case-sensitive.

heart3 <- select(heart,Age,Sex,diagnosis)
colnames(heart3)

## [1] "Age"       "Sex"       "diagnosis"

Next, let’s use filter() to filter the observations. First, let’s pick only those over 50 years of age. In the first line I’m creating a new data frame called heartolder, and in the second line I’m applying the summary() function to heartolder to confirm that it worked. You can see that the minimum Age is now 51:

heartolder <- filter(heart3,Age>50)
summary(heartolder)

##       Age             Sex          diagnosis        
##  Min.   :51.00   Min.   :0.0000   Length:209        
##  1st Qu.:55.00   1st Qu.:0.0000   Class :character  
##  Median :59.00   Median :1.0000   Mode  :character  
##  Mean   :59.38   Mean   :0.6651                     
##  3rd Qu.:63.00   3rd Qu.:1.0000                     
##  Max.   :77.00   Max.   :1.0000

If you want a range of ages, the and operator in R is &. Here we get ages greater than or equal to 45 and less than 65:

heartmiddleage <- filter(heart3,Age>=45 & Age < 65)
summary(heartmiddleage)

##       Age             Sex         diagnosis        
##  Min.   :45.00   Min.   :0.000   Length:207        
##  1st Qu.:52.00   1st Qu.:0.000   Class :character  
##  Median :56.00   Median :1.000   Mode  :character  
##  Mean   :55.43   Mean   :0.686                     
##  3rd Qu.:59.00   3rd Qu.:1.000                     
##  Max.   :64.00   Max.   :1.000

The or operator is | (on most keyboards, it’s above the forward slash ). Here we get ages less than 40 or above 70. In the third line, I’ve used another function called unique() that helps illustrate that we have successfully removed everyone in their 40s, 50s, and 60s:

heartyoungold <- filter(heart3,Age <40 | Age > 70)
summary(heartyoungold)

##       Age             Sex          diagnosis        
##  Min.   :29.00   Min.   :0.0000   Length:21         
##  1st Qu.:35.00   1st Qu.:0.0000   Class :character  
##  Median :38.00   Median :1.0000   Mode  :character  
##  Mean   :46.81   Mean   :0.5238                     
##  3rd Qu.:71.00   3rd Qu.:1.0000                     
##  Max.   :77.00   Max.   :1.0000

unique(heartyoungold$Age)

##  [1] 37 71 39 34 35 29 77 38 74 76

If you want an exact age, the operator is == (two equals signs – not very intuitive, I know). Here we filter on just the 65 year olds:

heart65 <- filter(heart3,Age==65)
summary(heart65)

##       Age          Sex       diagnosis        
##  Min.   :65   Min.   :0.0   Length:8          
##  1st Qu.:65   1st Qu.:0.0   Class :character  
##  Median :65   Median :0.5   Mode  :character  
##  Mean   :65   Mean   :0.5                     
##  3rd Qu.:65   3rd Qu.:1.0                     
##  Max.   :65   Max.   :1.0

You can write statements that involve multiple variables. Here I am selecting 65 year old males with diagnosed heart disease.

heartx <- filter(heart3,Age==65 & Sex==1 & diagnosis=="Yes")
summary(heartx)

##       Age          Sex     diagnosis        
##  Min.   :65   Min.   :1   Length:3          
##  1st Qu.:65   1st Qu.:1   Class :character  
##  Median :65   Median :1   Mode  :character  
##  Mean   :65   Mean   :1                     
##  3rd Qu.:65   3rd Qu.:1                     
##  Max.   :65   Max.   :1

Finally, notice above how some of the variable names are quite long. That’s a matter of style and preference – long names are more informative, but they also take longer to type and take up more of your screen space. I’m more of a minimalist namer myself. Let’s change Chest_Pain_Type to pain and Fasting_Blood_Sugar to sugar using rename():

heart <- rename(heart,pain=Chest_Pain_Type,sugar=Fasting_Blood_Sugar)
colnames(heart)

##  [1] "Age"                      "Sex"                     
##  [3] "pain"                     "Resting_Blood_Pressure"  
##  [5] "Serum_Cholesterol"        "sugar"                   
##  [7] "Resting_ECG"              "Max_Heart_Rate_Achieved" 
##  [9] "Exercise_Induced_Angina"  "ST_Depression_Exercise"  
## [11] "Peak_Exercise_ST_Segment" "Num_Major_Vessels_Flouro"
## [13] "Thalassemia"              "Diagnosis_Heart_Disease" 
## [15] "diagnosis"

Notice that I didn’t create a new data frame this time – I simply updated heart. I generally don’t recommend this, because you may accidentally erase something you didn’t mean to. As you become more experienced, you’ll gain the confidence to let yourself apply minor changes this way. Even as an advanced user, I am very cautious about doing this.

Try coming up with a few of your own examples using select(), filter(), and rename().