4 min read

Handling Missing Data*

Managing missing values in R

Many data sets used in teaching have no missing values, but that is not very realistic – in the real world, data are rarely perfect. One thing I like about our textbook is that it faces this topic head on in the very first chapter.

R uses “NA” to indicate a missing value, which stands for “not available”. But it is not a two-letter string; it literally means there is no value there. Many researchers do not follow this practice and instead use values like 9, 99, 999, -999, and so on (or even a question mark, as we will see). This is generally frowned upon because very often these numbers can also represent legitimate values. For example, there is a famous data set where 99 means missing in every field except for IQ score, where it means an IQ of 99!

The advantage of NA is that all R functions and packages know what this means and perform accordingly. When you calculate a mean for a column containing NA, these do not factor into the calculation. When you calculate a mean for a column containing -999, these do factor into the calculation and you get the wrong answer. This is definitely something to be mindful of throughout the course.

The heart data that we’ve been using only has a very small amount of missing data. For example, in the Thallassemia field (note the capital T, and the use of the table() function to see the distribution of values).

library(dplyr)
heart <- read.csv("C:/epi551/heart.csv", header=T)
table(heart$Thalassemia)
## 
##   ? 3.0 6.0 7.0 
##   2 166  18 117

Even if you’ve never heard of thalassemia, you might suspect that the any observations with a value of ? are questionable. In these data, only three values are legitimate for this variable:

thal: 3 = normal; 6 = fixed defect; 7 = reversable defect

To have this data set work the way we expect it to, we want to replace those ? values with NA. We do that with a function that is part of the dplyr package we’ve been using called na_if(). The next code chunk updates the contents of heart so that all places that ? occurs are replaced by NA. The summary() function shows that this was successful. There are now two NAs and no ?:

heart <- na_if(heart,"?")
table(heart$Thalassemia)
## 
## 3.0 6.0 7.0 
## 166  18 117
unique(heart$Thalassemia)
## [1] "6.0" "3.0" "7.0" NA

The question marks are gone! Note that table() did not report the NA values but unique() did. A bit tricky, I know.

Suppose we now wanted Thalassemia to be an integer and not a character. The easiest way to do that is to create a new variable, using the as.integer() function (there are also as.numeric() and as.character() functions that perform the same way).

When we run the summary() function on the new variable thal, we see that it calculates the mean and median and makes no mention of question marks. Compare that the summary() output for Thalassemia:

heart$thal <- as.integer(heart$Thalassemia)
summary(heart$thal)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##   3.000   3.000   3.000   4.734   7.000   7.000       2
summary(heart$Thalassemia)
##    Length     Class      Mode 
##       303 character character

Many R functions ignore NAs by default, which is often what you want. But not always - sometimes you want to delete the observations altogether. Here the desired function is drop_na, which is part of the tidyr package. Below I load this package and then drop the NA observations from the data set. In heart2 I remove the NAs from all variables. In heart3 I remove the NAs from only Thalassemia.

Note that heart has 303 rows, heart2 has 301 rows (we’ve dropped the two NA values from Thalassemia, and heart3 has 297 rows (four other observations contain NA values somewhere; we don’t know where).

library(tidyr)
heart2 <- drop_na(heart,Thalassemia)
heart3 <- drop_na(heart)
nrow(heart2)
## [1] 301
nrow(heart3)
## [1] 297