Histograms
Histograms are a type of bar chart where the x-axis consists of continuous data divided into evenly-spaced bins and the y-axis are counts. In this example we have a file of 1,000 birth weights from an old study.
library(ggplot2)
library(dplyr)
bw <- read.csv("C:/epi551/old/BIRTHWEIGHT.csv", header=TRUE)
#we see there are only 2 fields - id number and birthweight in ounces
summary(bw)
## ï..id weight
## Min. : 0.0 Min. : 8.00
## 1st Qu.:249.8 1st Qu.: 97.75
## Median :499.5 Median :112.00
## Mean :499.5 Mean :106.86
## 3rd Qu.:749.2 3rd Qu.:125.00
## Max. :999.0 Max. :198.00
#Converting ounces to grams with a new field (could have also used mutate)
bw$grams <- bw$weight * 28.35
#The header of the first field came in strangely - this happens sometimes
#fixing it here
colnames(bw)[1] <- "id"
#And now a ggplot statement to create a histogram
ggplot(data=bw, aes(grams)) +
geom_histogram()
The histogram shows a bell-shaped curve, but with a strange spike at the low end. This deserves some investigation.
low_bw <- filter(bw, grams<750)
head(low_bw,20)
## id weight grams
## 1 17 13 368.55
## 2 37 15 425.25
## 3 57 11 311.85
## 4 77 13 368.55
## 5 97 11 311.85
## 6 117 8 226.80
## 7 137 14 396.90
## 8 157 9 255.15
## 9 177 13 368.55
## 10 197 11 311.85
## 11 217 11 311.85
## 12 237 12 340.20
## 13 257 15 425.25
## 14 277 11 311.85
## 15 297 10 283.50
## 16 317 11 311.85
## 17 337 11 311.85
## 18 357 12 340.20
## 19 377 9 255.15
## 20 397 12 340.20
Each of these is a value in the ounces column between 8 and 17. Since a baby this small could not live, these could have been recorded in pounds or perhaps they are in ounces with the pounds missing. You would want to check with the data provider or, if unable to do so (like here), just drop them.
One of the useful parameters for a histogram is the number of bins. The default is 30, but you can choose more or fewer, as seen here:
bw <- filter(bw, grams>750)
ggplot(data=bw, aes(grams)) +
geom_histogram(bins=10)
ggplot(data=bw, aes(grams)) +
geom_histogram(bins=50)