Here I talk about a few very basic R functions, including:
read.csv()
head()
tail()
mean()
min()
max()
summary()
A reminder that R is case sensitive! If something doesn’t work as expected, the first thing you should do is make sure you didn’t type V where you meant v.
Now I return to the heart dataset from the other day, which again is available here. I run a function called select() to limit it to just four variables to keep things simple:
library(dplyr)
heart <- read.csv("C:/epi551/old/heart.csv", header=TRUE)
heart <- select(heart,Age,Sex,Chest_Pain_Type,Diagnosis_Heart_Disease)
Notice the <- operator here. We are actually assigning the entire file to an entity called heart. The R term of art here is not entity, but data frame, consisting of a bunch of rows and columns.
read.csv() is a function, the purpose of which is to read in a comma-separated file from the specified location. header=TRUE is a called a parameter, which here means that the first row of the file contains a header consisting of the column names. If this was not the case we would type header=FALSE.
This is how all functions work, their name followed by one or more parameters in parentheses, separated by commas.
Here’s an example:
x <- pnorm(0.3, mean=1, sd=2)
Don’t worry about what this is doing; that will be the subject of a later post. For now simply recognize that we are assigning to x the result of some function pnorm() that contains 3 parameters.
To inspect the heart data itself, you just need to enter the name either as a line of code or in the console:
heart
## Age Sex Chest_Pain_Type Diagnosis_Heart_Disease
## 1 63 1 1 0
## 2 67 1 4 2
## 3 67 1 4 1
## 4 37 1 3 0
## 5 41 0 2 0
## 6 56 1 2 0
## 7 62 0 4 3
## 8 57 0 4 0
## 9 63 1 4 2
## 10 53 1 4 1
## 11 57 1 4 0
## 12 56 0 2 0
## 13 56 1 3 2
## 14 44 1 2 0
## 15 52 1 3 0
## 16 57 1 3 0
## 17 48 1 2 1
## 18 54 1 4 0
## 19 48 0 3 0
## 20 49 1 2 0
## 21 64 1 1 0
## 22 58 0 1 0
## 23 58 1 2 1
## 24 58 1 3 3
## 25 60 1 4 4
## 26 50 0 3 0
## 27 58 0 3 0
## 28 66 0 1 0
## 29 43 1 4 0
## 30 40 1 4 3
## 31 69 0 1 0
## 32 60 1 4 2
## 33 64 1 3 1
## 34 59 1 4 0
## 35 44 1 3 0
## 36 42 1 4 0
## 37 43 1 4 3
## 38 57 1 4 1
## 39 55 1 4 3
## 40 61 1 3 0
## 41 65 0 4 4
## 42 40 1 1 0
## 43 71 0 2 0
## 44 59 1 3 0
## 45 61 0 4 1
## 46 58 1 3 4
## 47 51 1 3 0
## 48 50 1 4 4
## 49 65 0 3 0
## 50 53 1 3 0
## 51 41 0 2 0
## 52 65 1 4 0
## 53 44 1 4 2
## 54 44 1 2 0
## 55 60 1 4 1
## 56 54 1 4 1
## 57 50 1 3 1
## 58 41 1 4 1
## 59 54 1 3 0
## 60 51 1 1 0
## 61 51 0 4 2
## 62 46 0 3 0
## 63 58 1 4 1
## 64 54 0 3 0
## 65 54 1 4 2
## 66 60 1 4 2
## 67 60 1 3 1
## 68 54 1 3 0
## 69 59 1 4 2
## 70 46 1 3 1
## 71 65 0 3 0
## 72 67 1 4 3
## 73 62 1 4 1
## 74 65 1 4 1
## 75 44 1 4 1
## 76 65 0 3 0
## 77 60 1 4 1
## 78 51 0 3 0
## 79 48 1 2 0
## 80 58 1 4 3
## 81 45 1 4 0
## 82 53 0 4 0
## 83 39 1 3 0
## 84 68 1 3 3
## 85 52 1 2 0
## 86 44 1 3 0
## 87 47 1 3 0
## 88 53 0 3 0
## 89 53 0 4 0
## 90 51 0 3 0
## 91 66 1 4 0
## 92 62 0 4 3
## 93 62 1 3 0
## 94 44 0 3 0
## 95 63 0 3 0
## 96 52 1 4 1
## 97 59 1 4 2
## 98 60 0 4 3
## 99 52 1 2 0
## 100 48 1 4 0
## 101 45 1 4 0
## 102 34 1 1 0
## 103 57 0 4 0
## 104 71 0 3 0
## 105 49 1 3 3
## 106 54 1 2 0
## 107 59 1 4 2
## 108 57 1 3 1
## 109 61 1 4 2
## 110 39 1 4 3
## 111 61 0 4 1
## 112 56 1 4 1
## 113 52 1 1 0
## 114 43 0 4 2
## 115 62 0 3 2
## 116 41 1 2 0
## 117 58 1 3 0
## 118 35 0 4 0
## 119 63 1 4 3
## 120 65 1 4 2
## 121 48 1 4 3
## 122 63 0 4 4
## 123 51 1 3 0
## 124 55 1 4 3
## 125 65 1 1 1
## 126 45 0 2 0
## 127 56 0 4 3
## 128 54 1 4 3
## 129 44 1 2 0
## 130 62 0 4 0
## 131 54 1 3 0
## 132 51 1 3 0
## 133 29 1 2 0
## 134 51 1 4 0
## 135 43 0 3 0
## 136 55 0 2 0
## 137 70 1 4 4
## 138 62 1 2 3
## 139 35 1 4 1
## 140 51 1 3 0
## 141 59 1 2 0
## 142 59 1 1 1
## 143 52 1 2 0
## 144 64 1 3 1
## 145 58 1 3 0
## 146 47 1 3 1
## 147 57 1 4 4
## 148 41 1 3 0
## 149 45 1 2 0
## 150 60 0 3 0
## 151 52 1 1 0
## 152 42 0 4 0
## 153 67 0 3 0
## 154 55 1 4 4
## 155 64 1 4 3
## 156 70 1 4 1
## 157 51 1 4 1
## 158 58 1 4 1
## 159 60 1 4 2
## 160 68 1 3 0
## 161 46 1 2 0
## 162 77 1 4 4
## 163 54 0 3 0
## 164 58 0 4 0
## 165 48 1 3 0
## 166 57 1 4 0
## 167 52 1 3 0
## 168 54 0 2 0
## 169 35 1 4 1
## 170 45 0 2 0
## 171 70 1 3 3
## 172 53 1 4 0
## 173 59 0 4 1
## 174 62 0 4 0
## 175 64 1 4 4
## 176 57 1 4 1
## 177 52 1 4 0
## 178 56 1 4 1
## 179 43 1 3 0
## 180 53 1 3 0
## 181 48 1 4 3
## 182 56 0 4 2
## 183 42 1 1 0
## 184 59 1 1 0
## 185 60 0 4 1
## 186 63 0 2 0
## 187 42 1 3 0
## 188 66 1 2 2
## 189 54 1 2 1
## 190 69 1 3 2
## 191 50 1 3 0
## 192 51 1 4 3
## 193 43 1 4 1
## 194 62 0 4 2
## 195 68 0 3 0
## 196 67 1 4 3
## 197 69 1 1 0
## 198 45 0 4 0
## 199 50 0 2 0
## 200 59 1 1 1
## 201 50 0 4 0
## 202 64 0 4 0
## 203 57 1 3 0
## 204 64 0 3 0
## 205 43 1 4 0
## 206 45 1 4 3
## 207 58 1 4 3
## 208 50 1 4 3
## 209 55 1 2 0
## 210 62 0 4 1
## 211 37 0 3 0
## 212 38 1 1 4
## 213 41 1 3 0
## 214 66 0 4 3
## 215 52 1 4 1
## 216 56 1 1 0
## 217 46 0 2 0
## 218 46 0 4 0
## 219 64 0 4 0
## 220 59 1 4 0
## 221 41 0 3 0
## 222 54 0 3 0
## 223 39 0 3 0
## 224 53 1 4 3
## 225 63 0 4 1
## 226 34 0 2 0
## 227 47 1 4 0
## 228 67 0 3 0
## 229 54 1 4 3
## 230 66 1 4 2
## 231 52 0 3 0
## 232 55 0 4 2
## 233 49 1 3 1
## 234 74 0 2 0
## 235 54 0 3 0
## 236 54 1 4 3
## 237 56 1 4 2
## 238 46 1 4 1
## 239 49 0 2 0
## 240 42 1 2 0
## 241 41 1 2 0
## 242 41 0 2 0
## 243 49 0 4 0
## 244 61 1 1 2
## 245 60 0 3 0
## 246 67 1 4 2
## 247 58 1 4 2
## 248 47 1 4 1
## 249 52 1 4 3
## 250 62 1 2 0
## 251 57 1 4 0
## 252 58 1 4 1
## 253 64 1 4 0
## 254 51 0 3 0
## 255 43 1 4 0
## 256 42 0 3 0
## 257 67 0 4 0
## 258 76 0 3 0
## 259 70 1 2 0
## 260 57 1 2 1
## 261 44 0 3 0
## 262 58 0 2 3
## 263 60 0 1 0
## 264 44 1 3 0
## 265 61 1 4 4
## 266 42 1 4 2
## 267 52 1 4 2
## 268 59 1 3 2
## 269 40 1 4 1
## 270 42 1 3 0
## 271 61 1 4 1
## 272 66 1 4 0
## 273 46 1 4 2
## 274 71 0 4 0
## 275 59 1 1 1
## 276 64 1 1 0
## 277 66 0 3 0
## 278 39 0 3 0
## 279 57 1 2 1
## 280 58 0 4 0
## 281 57 1 4 2
## 282 47 1 3 0
## 283 55 0 4 3
## 284 35 1 2 0
## 285 61 1 4 2
## 286 58 1 4 4
## 287 58 0 4 2
## 288 58 1 2 0
## 289 56 1 2 0
## 290 56 1 2 0
## 291 67 1 3 1
## 292 55 0 2 0
## 293 44 1 4 2
## 294 63 1 4 2
## 295 63 0 4 1
## 296 41 1 2 0
## 297 59 1 4 3
## 298 57 0 4 1
## 299 45 1 1 1
## 300 68 1 4 2
## 301 57 1 4 3
## 302 57 0 2 1
## 303 38 1 3 0
Try this, and you will see that it prints the entire contents of heart to the console. You’ll need to scroll to see it all. Depending on how you have your environment set up, it may show only the first 50 or 100 rows and then tell you that the rest were not printed. Regardless, this is not particularly helpful. A friendlier way to look at the data is to just look at the first few rows, and the function which accomplishes this is head().
head(heart)
## Age Sex Chest_Pain_Type Diagnosis_Heart_Disease
## 1 63 1 1 0
## 2 67 1 4 2
## 3 67 1 4 1
## 4 37 1 3 0
## 5 41 0 2 0
## 6 56 1 2 0
You should see the first 6 rows of the data. To see the last 6 rows, the function is tail():
tail(heart)
## Age Sex Chest_Pain_Type Diagnosis_Heart_Disease
## 298 57 0 4 1
## 299 45 1 1 1
## 300 68 1 4 2
## 301 57 1 4 3
## 302 57 0 2 1
## 303 38 1 3 0
To see a number of rows other than 6, add the desired number as a parameter:
head(heart,12)
## Age Sex Chest_Pain_Type Diagnosis_Heart_Disease
## 1 63 1 1 0
## 2 67 1 4 2
## 3 67 1 4 1
## 4 37 1 3 0
## 5 41 0 2 0
## 6 56 1 2 0
## 7 62 0 4 3
## 8 57 0 4 0
## 9 63 1 4 2
## 10 53 1 4 1
## 11 57 1 4 0
## 12 56 0 2 0
tail(heart,2)
## Age Sex Chest_Pain_Type Diagnosis_Heart_Disease
## 302 57 0 2 1
## 303 38 1 3 0
Because heart contains many variables and some have very long names, you will likely still have to scroll to see everything, but it’s a lot more manageable than before.
Very often it is helpful to clear the RStudio console to remove things we no longer need. That is accomplished by clicking on the little broom icon at the top right of the window, or by typing Ctrl-L.
Now it’s time for your first statistical function! The mean (also known as average), is the sum of all the values divided by the number of values. In R, this function is mean().
To refer to a particular column in R, we use a $. So, for example, to find the mean age, we type:
mean(heart$Age)
## [1] 54.43894
Note that Age contains a capital A. If you type mean(heart$age), you will get an error. Personally, I try not to use variable names with capital letters for this reason, but in this case the variable names were chosen by someone else, so it is beyond our control.
To find the minimum and maximum ages, the functions are min() and max():
min(heart$Age)
## [1] 29
max(heart$Age)
## [1] 77
Finally, the summary() function gives you some basic information about all the variables. We will be using this one a lot.
summary(heart)
## Age Sex Chest_Pain_Type Diagnosis_Heart_Disease
## Min. :29.00 Min. :0.0000 Min. :1.000 Min. :0.0000
## 1st Qu.:48.00 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:0.0000
## Median :56.00 Median :1.0000 Median :3.000 Median :0.0000
## Mean :54.44 Mean :0.6799 Mean :3.158 Mean :0.9373
## 3rd Qu.:61.00 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:2.0000
## Max. :77.00 Max. :1.0000 Max. :4.000 Max. :4.0000
For each variable in the data frame, it gives us the minimum, maximum, mean, median, first quartile, third quartile, and whether there are any missing values (indicated with the ?).