3 min read

The normal distribution*

The normal distribution

The normal distribution is the most important distribution of all, used for continuous variables like weight, blood pressure, disease rates, just about anything that can take on fractional values. Even when the true distribution is unknown (which in real-world situations is most of the time), the normal distribution is often assumed! The shape of the normal distribution is the familiar bell-shaped curve. Most of the measurements are clustered in the middle, increasingly fewer in each direction.

For example, if I reported that a quiz had a mean of 75 and standard deviation of 10, I am implicitly assuming a normal distribution, because the standard deviation assumes a normal distribution.

Right away we know the scores cannot be truly normally distributed, because a normally distributed variable can take on any value between negative infinity and positive infinity. You could not have gotten a score of -45, or 158, or even 83.43652, because there were no fractional points awarded. Nevertheless, if the scores look like they roughly follow a bell-shaped curve, then that is close enough to a normal distribution that we typically ignore those details. I’ve pasted a picture of a bell-shaped curve below The probabilities of an event are given as a function of the number of standard deviations from the mean. Thus the probability that an outcome is within one standard deviation of the mean is about 68% (15% + 19.1% + 19.1% + 15%). Within two standard deviations is about 95%, and within three standard deviations is about 99.7%. Remember these three values: 68, 95, 99.7 - you’ll see them over and over.

In the class grade example, supposing there are 30 students in the class, then about 20 of them would have gotten a score between 65 and 85 (one standard deviation). About 28 or 29 would have gotten between 55 and 95, and all 30 should have gotten between 45 and 105. Of course, these are only expectations. If the observed values are too far away from the expected values, that’s when we can begin to question whether the normal distribution appropriate.

Here is the equation for the curve:

Y = { 1/[ σ * sqrt(2π) ] } * e-(x - μ)2/2σ2

It’s a pretty complicated equation, even though it has pieces that should look familiar. The Greek letters in there represent the mean ( μ ) and standard deviation ( σ ). You have the natural logarithm (e) and that special number which represents the ratio of the circumference to the diameter of a circle ( π ). You don’t need to memorize this, just admire it for a moment and consider how many features of the human and natural worlds it describes.

When I was an undergraduate, we did not work with this equation directly either. Instead, our statistics textbooks came with pages of printed tables you would use to look up solutions. A traditional textbook still has them. Part 2 of today’s post will illustrate how to calculate properties of the normal distribution using R.