Z scores
R has many-built in data sets. We haven’t used any of them yet, but here I will draw upon some built-in ozone data from New York State in 1973. The built-in data set is called airquality and the variable we want is called ozone
o3 <- airquality$Ozone #because o3 is the chemical formula for ozone
mean(o3) #we get NA because there is missing data
## [1] NA
o3 <- na.omit(airquality$Ozone) #fixes that problem
mean(o3)
## [1] 42.12931
sd(o3)
## [1] 32.98788
Suppose we want to know the probability of a measurement being over 100 (assuming normality). We can use the pnorm() function that we saw yesterday. Note that instead of typing 42.12931 and 32.98788, which carries the risk of entering a typo, I can put the mean() and sd() functions right inside the pnorm() function:
1-pnorm(100,mean=mean(o3),sd=sd(o3))
## [1] 0.03968944
So about 4%. Since there are 116 observations, we would expect 4.6 such values. Let’s see how close it is to reality. Note the use of bracket notation to find the answer quickly, but I could have used filter() as well.
o3[o3>100]
## [1] 115 135 108 122 110 168 118
We see there were 7 observations over 100, not 4.6. A bit on the high side which suggests the data may be skewed, but that’s tangential to our main objective here.
Here we will use the method of standarization or normalization (the two terms are interchangeable) to answer the question of what is the probability of a reading above 100. To do this, we calculate what is known as a z-score. The formula is (X-mean)/sd, where X is our value of interest.
z <- (100-mean(o3))/sd(o3)
z
## [1] 1.754301
This means that 100 is 1.75 standard deviations above the mean. Recall from earlier lectures that the probability of being more than 1 standard deviation above the mean is about 34% and the probability of being more than 2 standard deviations above the mean is about 2.5%. Therefore we already know that our answer will be between 2.5% and 34%, but closer to 2.5%.
Now prepare to have your mind blown:
1-pnorm(z)
## [1] 0.03968944
If we apply pnorm() to our normalized value, we get our same answer! That’s because what we really did was:
1-pnorm(z,mean=0, sd=1)
## [1] 0.03968944
Remember than mean=0 and sd=1 are the default values. What we did with the z-score was to transform our variable into one with a mean of zero and a standard deviation of 1.
Why go to all this trouble just to get the same answer? I offer two reasons:
Old-school statisticians were taught to do things the second way. That’s because before computers this was a hard problem to solve. You could only look up the z-score in a table that would be in the appendix of your textbook. With computers, this is no longer necessary.
When you have many different variables on very different scales (imagine ozone, temperature, particulate matter, wind speed, etc.) it’s hard to compare the variables. Standardization puts them all on the same scale so you can. A z-score of 1.75 for ozone is the same as a z-score of 1.75 for temperature, in terms of ‘how unusual is this’. The next time someone accuses you of trying to compare apples and oranges, tell them that’s totally fine as long as you standardize them first.