Today we are going to learn about an important R package called dplyr (pronounced “de” as in design + “plier” as in a pair of pliers). Just like ggplot2, you will have to install it first. As a refresher, first, under the Tools menu, go to Install Packages. Then start typing the package name. By the time you get to dp you should see dplyr as one of your choices. Then, in the scripting window enter and run this line of code:
library(dplyr)
dplyr adds a bunch of functions and commands for organizing your data. With the R community, people like to use the word “tidy” in this context. When you remove unwanted observations, sort, group, rename, filter and so on, this is called “tidying” your data. Packages that are related to tidying your data are collectively called the “tidyverse”. ggplot2 is also part of the tidyverse. In fact, there is an rpackage called tidyverse which contains 8 packages related to tidying your data. You could, if you like, download and install the tidyverse package and then when you run library(tidyverse) you’ll have both dplyr and ggplot2 plus a bunch of others we won’t be using.
once more we will use the heart dataset, available here.
Today we are going to cover just 3 dpylr functions, select(), filter(), and rename(). First, though, let’s use a built-in function, colnames() to see what columns we have in our heart data. We could use head() or tail() for this, but colnames() gives just the names, without any data, and so is easier to see on the screen:
library(dplyr)
heart <- read.csv("C:/epi551/heart.csv", header=T)
colnames(heart)
## [1] "Age" "Sex"
## [3] "Chest_Pain_Type" "Resting_Blood_Pressure"
## [5] "Serum_Cholesterol" "Fasting_Blood_Sugar"
## [7] "Resting_ECG" "Max_Heart_Rate_Achieved"
## [9] "Exercise_Induced_Angina" "ST_Depression_Exercise"
## [11] "Peak_Exercise_ST_Segment" "Num_Major_Vessels_Flouro"
## [13] "Thalassemia" "Diagnosis_Heart_Disease"
## [15] "diagnosis"
First, let’s use select() to select a subset of these variables. Here I am creating a new data frame consisting of just 3 variables. I named it heart3 to signify it is the heart data with only 3 variables – you can name it anything you like. Notice in the statement I wrote capital A for Age, but lowercase d for diagnosis - remember R is case-sensitive.
heart3 <- select(heart,Age,Sex,diagnosis)
colnames(heart3)
## [1] "Age" "Sex" "diagnosis"
Next, let’s use filter() to filter the observations. First, let’s pick only those over 50 years of age. In the first line I’m creating a new data frame called heartolder, and in the second line I’m applying the summary() function to heartolder to confirm that it worked. You can see that the minimum Age is now 51:
heartolder <- filter(heart3,Age>50)
summary(heartolder)
## Age Sex diagnosis
## Min. :51.00 Min. :0.0000 Length:209
## 1st Qu.:55.00 1st Qu.:0.0000 Class :character
## Median :59.00 Median :1.0000 Mode :character
## Mean :59.38 Mean :0.6651
## 3rd Qu.:63.00 3rd Qu.:1.0000
## Max. :77.00 Max. :1.0000
If you want a range of ages, the and operator in R is &. Here we get ages greater than or equal to 45 and less than 65:
heartmiddleage <- filter(heart3,Age>=45 & Age < 65)
summary(heartmiddleage)
## Age Sex diagnosis
## Min. :45.00 Min. :0.000 Length:207
## 1st Qu.:52.00 1st Qu.:0.000 Class :character
## Median :56.00 Median :1.000 Mode :character
## Mean :55.43 Mean :0.686
## 3rd Qu.:59.00 3rd Qu.:1.000
## Max. :64.00 Max. :1.000
The or operator is | (on most keyboards, it’s above the forward slash ). Here we get ages less than 40 or above 70. In the third line, I’ve used another function called unique() that helps illustrate that we have successfully removed everyone in their 40s, 50s, and 60s:
heartyoungold <- filter(heart3,Age <40 | Age > 70)
summary(heartyoungold)
## Age Sex diagnosis
## Min. :29.00 Min. :0.0000 Length:21
## 1st Qu.:35.00 1st Qu.:0.0000 Class :character
## Median :38.00 Median :1.0000 Mode :character
## Mean :46.81 Mean :0.5238
## 3rd Qu.:71.00 3rd Qu.:1.0000
## Max. :77.00 Max. :1.0000
unique(heartyoungold$Age)
## [1] 37 71 39 34 35 29 77 38 74 76
If you want an exact age, the operator is == (two equals signs – not very intuitive, I know). Here we filter on just the 65 year olds:
heart65 <- filter(heart3,Age==65)
summary(heart65)
## Age Sex diagnosis
## Min. :65 Min. :0.0 Length:8
## 1st Qu.:65 1st Qu.:0.0 Class :character
## Median :65 Median :0.5 Mode :character
## Mean :65 Mean :0.5
## 3rd Qu.:65 3rd Qu.:1.0
## Max. :65 Max. :1.0
You can write statements that involve multiple variables. Here I am selecting 65 year old males with diagnosed heart disease.
heartx <- filter(heart3,Age==65 & Sex==1 & diagnosis=="Yes")
summary(heartx)
## Age Sex diagnosis
## Min. :65 Min. :1 Length:3
## 1st Qu.:65 1st Qu.:1 Class :character
## Median :65 Median :1 Mode :character
## Mean :65 Mean :1
## 3rd Qu.:65 3rd Qu.:1
## Max. :65 Max. :1
Finally, notice above how some of the variable names are quite long. That’s a matter of style and preference – long names are more informative, but they also take longer to type and take up more of your screen space. I’m more of a minimalist namer myself. Let’s change Chest_Pain_Type to pain and Fasting_Blood_Sugar to sugar using rename():
heart <- rename(heart,pain=Chest_Pain_Type,sugar=Fasting_Blood_Sugar)
colnames(heart)
## [1] "Age" "Sex"
## [3] "pain" "Resting_Blood_Pressure"
## [5] "Serum_Cholesterol" "sugar"
## [7] "Resting_ECG" "Max_Heart_Rate_Achieved"
## [9] "Exercise_Induced_Angina" "ST_Depression_Exercise"
## [11] "Peak_Exercise_ST_Segment" "Num_Major_Vessels_Flouro"
## [13] "Thalassemia" "Diagnosis_Heart_Disease"
## [15] "diagnosis"
Notice that I didn’t create a new data frame this time – I simply updated heart. I generally don’t recommend this, because you may accidentally erase something you didn’t mean to. As you become more experienced, you’ll gain the confidence to let yourself apply minor changes this way. Even as an advanced user, I am very cautious about doing this.
Try coming up with a few of your own examples using select(), filter(), and rename().