The limits of predictive modeling

One of the well-known datasets that is used to teach the principles of predictive modeling in the Boston housing data set. It contains 506 observations and 14 variables related to housing and environmental conditions of census tracts in Boston, Massachusetts and its suburbs in the 1970s. A common teaching exercise is to predict median house value from the other 13 variables. The data are included in the mlbench package and described here

library(mlbench)
data(BostonHousing2)

These data are used extensively in An Introduction to Statistical Learning by Gareth James et al. Here is one such example, a simple linear model to predict median house value (medv) from percent of households with low socioeconomic status (lstat).

lm.fit <- lm(medv ~ lstat, data=BostonHousing2)
summary(lm.fit)

## 
## Call:
## lm(formula = medv ~ lstat, data = BostonHousing2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.168  -3.990  -1.318   2.034  24.500 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.55384    0.56263   61.41   <2e-16 ***
## lstat       -0.95005    0.03873  -24.53   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6.216 on 504 degrees of freedom
## Multiple R-squared:  0.5441, Adjusted R-squared:  0.5432 
## F-statistic: 601.6 on 1 and 504 DF,  p-value: < 2.2e-16

The summary output matches what appears in the book on page 111. About 55% of the variation in housing price (as measured by R-Squared) is explained by just this one variable. We might be tempted to think that including all of the variables will give us the best model.

lm.fit2 <- lm(medv ~ crim + zn + indus + chas + nox + rm + age + dis + rad + tax + ptratio + b + lstat, data=BostonHousing2)
summary(lm.fit2)

## 
## Call:
## lm(formula = medv ~ crim + zn + indus + chas + nox + rm + age + 
##     dis + rad + tax + ptratio + b + lstat, data = BostonHousing2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.595  -2.730  -0.518   1.777  26.199 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
## crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
## zn           4.642e-02  1.373e-02   3.382 0.000778 ***
## indus        2.056e-02  6.150e-02   0.334 0.738288    
## chas1        2.687e+00  8.616e-01   3.118 0.001925 ** 
## nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
## rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
## age          6.922e-04  1.321e-02   0.052 0.958229    
## dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
## rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
## tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
## ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
## b            9.312e-03  2.686e-03   3.467 0.000573 ***
## lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.745 on 492 degrees of freedom
## Multiple R-squared:  0.7406, Adjusted R-squared:  0.7338 
## F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

While this bumps the R-Squared up to 0.74, that’s still a very long way from 1.00. In fact, it is closer to 0.55 than 1.00. In other words, low SES contains more predictive power on its own than the other 12 variables put together. This, in my experience, is fairly typical. A few variables explain a lot of the variation, a bunch more explain a little bit, and a large chunk remains unexplained entirely.

Here is a compromise model that keeps some of the more-useful variables and drops some of the less-useful ones. It uses low SES (plus its square, to capture non-linearity), average rooms per dwelling, distance to an employment center, and crime rate. It does not matter whether this is the best model, just that it is a reasonably good one. Its R-squared is also 0.74, differing only in the third decimal place from the model containing all variables. All variables here are very highly significant.

lm.fit3 <- lm(medv ~ lstat + I(lstat^2) + rm + dis + crim, data=BostonHousing2)
summary(lm.fit3)

## 
## Call:
## lm(formula = medv ~ lstat + I(lstat^2) + rm + dis + crim, data = BostonHousing2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -19.5418  -3.0427  -0.4683   2.0700  25.3554 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 19.016901   3.203534   5.936 5.46e-09 ***
## lstat       -2.059830   0.120876 -17.041  < 2e-16 ***
## I(lstat^2)   0.041461   0.003332  12.445  < 2e-16 ***
## rm           3.907517   0.393341   9.934  < 2e-16 ***
## dis         -0.823267   0.120274  -6.845 2.25e-11 ***
## crim        -0.166573   0.028238  -5.899 6.75e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.725 on 500 degrees of freedom
## Multiple R-squared:  0.7386, Adjusted R-squared:  0.736 
## F-statistic: 282.6 on 5 and 500 DF,  p-value: < 2.2e-16

An R-squared of 0.74 implies that there are at least some predictions that are pretty far off. Let’s look at the largest residuals. First I attach the predictions to the data, then print out the observations corresponding to the largest prediction errors.

preds <- predict(lm.fit3)
BostonHousing2 <- cbind(BostonHousing2,preds)

BostonHousing2[which.max(lm.fit3$residuals),]

##                   town tract      lon     lat medv cmedv    crim zn indus chas
## 373 Boston Beacon Hill   203 -71.0397 42.2182   50    50 8.26725  0  18.1    1
##       nox    rm  age    dis rad tax ptratio      b lstat    preds
## 373 0.668 5.875 89.6 1.1296  24 666    20.2 347.88  8.88 24.64458

BostonHousing2[which.min(lm.fit3$residuals),]

##                town tract     lon     lat medv cmedv    crim zn indus chas
## 365 Boston Back Bay   101 -71.059 42.2098 21.9  21.9 3.47428  0  18.1    1
##       nox   rm  age    dis rad tax ptratio      b lstat    preds
## 365 0.718 8.78 82.9 1.9047  24 666    20.2 354.55  5.29 41.44185

In Beacon Hill, the median housing price was predicted to be $24,644 (keep in mind this is in 1970 dollars) but the actual value was $50,000. Since the census top-codes this variable, the actual value was even higher than this. In Back Bay, conversely, the median housing price was predicted to be $41,442 but in reality was $21,900. These are pretty serious errors, off by a factor of two. You obviously would not want to use a model like this to decide how to price your own house for sale. We can also see that the model is accurate to within $5,000 78% of the time, that on average, the prediction is off by $3,441, and that the median error is 13%.

BostonHousing2$diff <- abs(BostonHousing2$medv-BostonHousing2$preds)
sum(BostonHousing2$diff <= 5)/sum(BostonHousing2$diff >= 0)

## [1] 0.7826087

mean(BostonHousing2$diff)

## [1] 3.441131

median(BostonHousing2$diff/BostonHousing2$medv)

## [1] 0.1298216

I think most statisticians would consider this a respectable model, suitable for many kinds of social science research and for use as a textbook example, even if you would never use it for any sort of real estate purposes.

This is, of course, what sites like Zillow are doing when they predict individual home values for the entire country, updating them on a continuous basis. They make use of a lot more data than the Boston data set, and they use a more sophisticated methodology than a linear regression model with a few variables, and the results tend to be more accurate. For off-market homes, they report a median accuracy of 7.5% versus 13% in the Boston data. (For on-market homes, the number is much smaller - about 2% - but that is because the asking price is known, and most homes sell for close to their asking price, so this number is less interesting).

Consider the implications of this - roughly speaking, a huge corporation which employs large numbers of data scientists is only able to reduce the uncertainty of housing price predictions by about half from a simple model that might be given as a homework exercise in an introductory statistics course. Even Zillow can’t know everything. There are always going to be specific, particular factors that influence the price of a house that no model can ever capture. Suppose the house had asbestos in the basement, or orange shag carpet full of cigarette burns, or was just inherited by a distant relative who wants nothing more than to get rid of it as quickly as possible. None of these downward pressures on the price would be captured in any model.

Suppose you decided, though, that when a house sells for $10,000 less than its predicted value, that is not because of these unknowable factors but it is a reflection of the quality of the real estate agent. Suppose you built a web site listing all of the real estate agents in some city along with their average sale price relative to their Zillow prediction. (I am pretty sure this would be possible with publicly accessible data). Would it make sense to use this measure to choose a real estate agent?

Maybe. It is possible, even likely, that through a combination of charm and savvy and marketing there are agents who manage to get above the predicted price more often than others, and vice versa. But it could also have nothing to do with charm; these could simply be the agents who steer clear of orange shag carpet. If your house has orange shag carpet, a real estate agent is not going to save you, assuming she even wants to work with you.

I have no idea if this sort of thing happens in the real estate world, but I can say from firsthand experience that it happens in the public health world. In order to measure the completeness of disease reporting, certain government agencies have determined that when a model underpredicts (that is, when the reported cases exceed what a model says they should be), that means the model is inaccurate, but when a model overpredicts (the reported cases are lower than what a model says they should be), that means cases are being underreported and corrective action is called for. The idea that the model can be inaccurate in either direction, as with Beacon Hill and Back Bay, is not appreciated, which I find very strange. In fact, because of the balanced nature of a regression model, for every Beacon Hill you would expect a Back Bay.

Here is another way to think about the problem. Suppose we have reason to think that the Boston housing data contains some typos or deliberate falsifications, and it is our job to determine which ones these are. This is exactly what is happening in the public health example above, where we have reason to believe that certain data values are severely underreported and should not be used in national statistics. One might also find an analogy in fraud detection, where the aim is to distinguish transactions that may be fraudulent from those that are merely strange. Suppose in the Boston data there is one erroneous data point, and suppose it is 25% lower than the correct value. What are the chances of spotting this error?

Pretty close to zero, I would say. Beacon Hill, which is 50% lower than expected, might be your first guess, and that would be wrong. Half of the data are above the expected prediction; take 25% from these and you’ll be well above Beacon Hill. Another quarter of the data are below the expected value by up to 13%; take 25% from these and you’ll still be above Beacon Hill. There are only a handful of observations for which reducing the value by 25% would render them suspicious.

In my public health project, the average modeling error is somewhere in the Zillow range, somewhere around 7.5%, and I am being asked to identify the observations that are 5% to 10% deficient. I think the predictive accuracy of such an endeavor is quite low.