15 October 2017
In the old days, “data mining” used to have a bad reputation because “if you torture data for long enough, they will confess to anything.” Although it is fairly easy to lie with statistics, I would like to point out that it is much easier to lie without them! We have come a long way in data science, and yet there is still lots and lots of ground to cover. One problem that is fairly well understood, though, is “overfitting” of models. Not in the least because so many colleagues (me too) have been stung by it!
Overfitting of models occurs when idiosyncrasies in the training data become part of the model you use in production. Consequently, the final model winds up performing worse than you expected based on the “fit” to the training data. In extreme cases, these models might even perform worse than not using a model at all… Predictive modeling can be an amazing tool, but if you point the gun down, it is eminently possible to shoot yourself in the foot. When overfitting causes your model to suggest misleading relations in the data, and you find out only after deployment, money blows out the window. Not only do you damage business objectives, but your stakeholders will lose confidence in value data science may offer.
From my experience, there are two common reasons why overfitting occurs. One is when data are (too) sparse, the other because of leakers (Berry & Linoff, 2004), sometimes called anachronistic variables (Pyle, 2003). These are completely different and unrelated reasons for overfitting. The “symptoms” show up very differently, and mitigating the risk for each requires (very) different measures. Some colleagues might not look at leakers as a cause for overfitting, and that’s a valid perspective (too).
Leakers are “predictive” variables that are in some way causally related to the target variable you are trying to predict. The causal direction, however, runs the wrong way around: they are the result, instead of the cause for whatever your target variable represents. Leakers are nasty little creatures that wreak havoc to your project. Because the same leaker variable is present in both the training and test data, no amount of cross-validation between subsamples of your model set will ever surface them. It’s not until you deploy a model with leakers that predictive accuracy plummets, and you as a data scientist wind up with egg on your face. If there is any way you might identify leakers, it would be that they (sometimes) show a conspicuously high correlation with the target variable, and possibly a dramatically higher (univariate) association than the other candidate predictor variables.
Overfitting due to data being too sparse is a different beast. Although we live in the age of Big Data, reality is that for many (if not most) modeling projects you predict an unbalanced target variable, and the minority class typically has insufficient records to be truly representative of the underlying population distribution. For example, when predicting fraud, the number of fraudulent transactions is much lower than the number of legitimate transactions. When you predict purchases from a webshop, the majority of sessions do not lead to a sale. Most credit card owners do not default, in fact very few do. In all these cases, the statistical representativeness for the minority class (fraud, online sales, credit default) may be insufficient to build a robust and reliable model. However, the models that can be built nonetheless offer legitimate business value. However, you need to be careful, because you are skating on thin ice.
When an algorithm “learns” patterns from a training data set, and those patterns don’t generalize to the population at large, you are overfitting your model to idiosyncrasies in this particular data set. Those patterns are not a reliable nor valid reflection of what goes on in the population from which they were drawn. For this reason, I consider leakers a similar problem, although the behavior is rather different from overfitting due to lack of data (specifically the minority class). Leakers work “perfectly” across training and test data, but fail spectacularly once deployed out in the wild. Overfitting due to a shortage of data is usually apparent from cross-validation between subsamples of your model set (training-, test-, and validation data). It becomes apparent when the predictive accuracy in your training data is (much) higher than in your test data. This variance should be cause for concern.
Reality for many data scientist is that the data at hand, in particular some minority class you are predicting, are almost always in short supply. You would like to have more data, but they simply aren’t available. Still, there might be excellent business value in building the best possible model from these data, as long as you safeguard against overfitting. Happy dredging!