Tom Breur

17 August 2016

Predictive modeling is an exciting, fun task. As I’ve written before (“Predicting is hard”: https://tombreur.wordpress.com/2016/07/15/predicting-is-hard/) when we talk about “predictive modeling”, I loosely use the term the way it usually gets used. We take a sample of historical records, where the outcome is known. For example, this could be a randomly targeted direct mailing, this could be the set of all accepted credit card applicants, etc., and then append the outcome (“target field”) we want to predict for new cases coming in. For our direct mail campaign this might be a Boolean flagging whether the prospect responded yes/no, or the amount donated to a charity (a metric variable), etc. For credit cards you are probably going to use an agreed upon definition for “default”, i.e. >3 months in arrears.

In the next step you set aside some holdout sample, and you begin the fun analytic work of determining which set of input variables that describe the “state” this prospect was in at the outset best describes the relation between the input array at the *beginning* of the time window with the output variable as measured *afterwards*. This time shift between input and output variables is crucial to avoid winding up with a bad (invalid) model. Variables from the input array that do not precede the measurement period are called “leakers” (Berry & Linoff, 2011) or “anachronistic variables” (Pyle, 2003), those are the same problems. If you have worked in this field for long enough (like me), the odds are you have been bitten by them – more than once!

The analytic relation between the selected set of input variables and the output variable is what we call “the model”, it’s a mathematical equation of sorts that we’ll apply for new records coming in (prospects for a future direct mail campaign, new credit card applicants, etc.) to determine their similarity to records that were flagged “1” in the training dataset. The model can be rescaled to generate a value between 0 and 1 that represents the similarity with “1” records in the training set. Obviously that’s a good thing for the direct marketing campaign, where “1” represents a response. For the credit card data, a “1” means the applicant has defaulted, so that is undesirable, but you get the idea. The process works similar for a metric output variable (a higher predicted donation is better, of course), with slightly altered math.

As you develop your model, you assess its accuracy in making predictions with a measure called “lift” that is uniform across different algorithms you may want to try. A much more descriptive term would be “cumulative gains curve”, but unfortunately the term “lift” seems to be what everyone has been using, so I’ll stick to that in lieu of my preference for the richer description that is contained in graphs like these:

Especially with data mining suites, you’ll have a wide selection of possible techniques at your disposal, each with their own pros and cons, and each with their own idiosyncratic measure for accuracy. You need to calculate the lift for every model so that you can make a valid comparison across all your brainchildren.

As you are working to improve these models, the lift goes up. And when you spend some more time on it, you are likely to find even more ways to boost the lift of your model. And then some. As usual, the incremental gains tend to go down, and you want to ensure that you calculate what those gains are worth to the business. Here is an interesting quandary, that I have learned in the trenches: most of the time, the gains to the business from improving the model will be (significantly) higher than the marginal costs of working to improve the model. My salary has always been much lower than the improved bottom-line revenue from a more accurate model. And I’d still argue you need to call it quits (way) before then! Not because I would be “wasting money”, but because the odds are that besides this model, there are lots and lots of other opportunities to also build models, and those will be worth more than the gains from tweaking the model at hand. Tempting as it may be to keep fiddling with a pet project, that I was enjoying so much…

Given the current maturity in data science, I’ll venture to say that applying predictive modeling to foster stronger business results isn’t nearly approaching the point of maturity. There are still so many potential applications for which we lack good enough data, and skilled data scientists, to take on all these challenges! In every business I have worked in so far, there were countless new opportunities to show value with predictions, and to help transform the business. Getting to a good (enough) model quickly is hugely valuable, so that you can (re)consider where to proceed: tweaking this model, or starting a new one. You can always improve the existing model, if only a little bit, by spending more time on it. Whether that is worth your while isn’t the “real” business question to ask. Where your contribution will add the most – that is the hallmark of a mature model builder, imho. So when is the model “done”? When there’s a more profitable one to develop!