Tom Breur

18 November 2018

When we talk about “Predictive modeling”, there is an interesting paradox. As I pointed out in a previous blog post (“Predicting is hard”, 15 July 2016) forward looking models rest on the (big!) assumption that the future will look like the past. Yet everybody knows that *if* there is one thing certain about the future, it’s that it will be different. We just don’t know how, exactly. And we never will, until the future has become the present…

Predictive modeling is a quantitative discipline, yet we don’t have any data on the future. We determine associations in the present by reflecting on the past (!), and then try to extrapolate those patterns to the future. There are two important problems with that. Like I just mentioned, these “predictions” rest on an important assumption, one that we even know *not* to be true, that the future will look like the past. Secondly, our predictive model is based on, ** and limited to**, a data set that we assembled for the purpose of building this model (the so-called “model set”, see “Mastering Data Mining” by Berry & Linoff, 1999).

This latter limitation implies that we can *only* predict values within the boundaries of a search space that is constrained by our model set. Any combinations of data points outside that hyperspace are not available, and hence we have zero empirical evidence those associations will hold. An example of this can occur with regression models. They render a mathematical *equation* that can easily be applied *outside* of the data range data that was used to build the model.

Let me illustrate this problem with a silly example. Let’s say you have discovered there is a positive relationship between income and age. If you base your linear regression model on data from people between 20-65 years old, then you *might* determine that the expected income for someone 200 years old is around $150K. Not a very useful association. Obviously the model has limited validity for people older than 150 years. We’ve never seen anyone that old, and therefore we have no data points to make predictions in that range. Yet you can plug in *any* age in the linear equation. Regression models are pretty accommodating, they do not protest.

Note that this same phenomenon played a big role in the collapse of financial markets in 2008. Models for mortgage securitization were built on limited available data. In fact, many analysts objected. But if the money is good, there is (almost) always someone you can find who is willing to build ‘your’ model… Then rates of unemployment and default started occurring that were never observed when the models were built. We know the end result: a meltdown because businesses kept relying on models that were being used outside the applicable range. All of a sudden, default rates and unemployment were climbing much higher than was reflected in the data initially used to build these securitization models.

Despite these limitations, I would argue that predictive modeling is the bread and butter of data science. Predicting (as opposed to ”explaining”, see my earlier blog post on 25 August 2018) requires inferences about future instances that were not, and can never be observed. At least not until the prediction has turned into a description.

So how do we deal with this paradox?

First of all, a great start is to recognize “unsolvable” problems for what they are. Be upfront about possibilities and limitations. As a data scientist you can draw inferences from data that have been made available to you. Communicate clearly what assumptions you made when calculating predictions, i.e. you assume the future will look like the past, within the available realm of observations. Be clear what statements you can find support for in the facts, and –at least as important– what statements *don’t* have support in the data. The late Jerry Weinberg taught me how incredibly powerful it can be to say: “I don’t know.” All it takes is some courage.

Secondly, there are two perspectives on this dilemma. Neither is “perfect”, by the way, as you should expect from an “unsolvable” paradox. You can either choose to classify the past and extrapolate it to the future, or, you can build models that are (also) based on historical data but that are geared towards revealing outcomes for combinations of input data that have not yet occurred. In this blog post I refer to this second type of predictive modeling as “scenario planning”, which has been rigorously developed in the realm of system dynamics.

These are the two qualitatively different approaches to “predicting the future” that I discern:

1 – “Predicting as classifying”

In Business Modeling and Data Mining (2003) Dorian Pyle points out that what colloquially gets referred to as “predictive modeling”, realistically is a special case of classifying the past. It is crucially important that the array of input variables you use to predict are offset in time from the target variable. This “time shift” between the array of predictive variables and the target variable is explained very clearly in Berry & Linoff’s “Mastering Data Mining” (1999).

This approach to “predictive modeling” often gets referred to as Machine Learning in the data science community: a computer ‘learns’ the pattern of input variables in relation to the labeled output variable. An array of input variables are pulled from data sources that originate *earlier in time* than the target variable we are trying to predict. That offset is necessary (!) to avoid “Leakers” (Berry & Linoff, 1997 & 1999) or “Anachronistic variables” (Pyle, 2003) that appear to be *predicting*, but conversely might be the *result* of your target variable. This need to offset predictor variables from the target is what drives the requirement for temporal data modeling, commonly capturing all input variables using Type II SCD’s.

This “Predicting as classifying” type of modeling seeks look-alikes for the people that responded in the past (buying a product, donate to a campaign, click through on a banner, etc.) and flag those records in the current database. The end-result invariably is some code or analytic function that describes how values of the input variables are related to the odds of belonging to the target class (buy, donate, click through, etc.). Apart from measurement level (which obviously requires different algorithms), there is no substantial difference between categorical (buy or clickthrough yes/no) or continuous target variables (how much donated).

Here is the rub: the relation between input variables and output variable (the “target”) is given by a set of rules or a mathematical equation. Nothing in that code limits the range of input variables that can be entered per se. That is why you can easily calculate the expected salary for an employee who is 200 years old. By the same token, you could “predict” the odds of someone being pregnant, regardless of whether the value for gender was given as “male.” Constraints or restrictions need to be *imposed* on the input data to ensure valid predictions, they are not part of the model itself.

For this type of predictive modeling you always want to “stress test” models at the fringes, to make sure results are still plausible. The ultimate and fundamental “problem” with these predictive models is that measurement of (past) data is static, while the future is dynamic. Nonetheless, these types of models have certainly proven their tremendous business value, and I consider them the “killer app” for data science.

2 – “Scenario planning”

Scenario planning, or system dynamics modeling, is an approach to predictive modeling that explicitly aims to make predictions about states that were not observed before. Some of the most common examples of real-life applications occur in weather forecasting and climate change models. But these techniques are in no way restricted to those domains.

To my mind, the most comprehensive title in this area is Prof. John Sterman’s “Business Dynamics” (2000), a dense 1000+ page tome. Not for the faint of heart with all the underlying math (differential equations), this epic book features a very wide assortment of examples. System dynamics as a field launched (no pun intended) after the late Prof. Jay Forrester applied these techniques to guided missiles, and he himself was involved in deploying that technology in World War II. Later Forrester used these same techniques for urban planning, for instance.

Feedback systems like the proverbial room thermostat are omnipresent. An example is the limits to growth when entering a market with a new product. For many years, P&G saw a linear relation between investments in marketing for disposable diapers and market share growth, until growth stalled. There comes a point when every family with babies has adopted (disposable) diapers. Nowadays we don’t even *think* of diapers as “disposable” because cotton diapers have just about disappeared. Examples abound where seemingly endless linear growth eventually tapers off, until it stalls.

When relations between observed variables are known to hold *across a range of values*, then you may infer from those constant relations what will happen outside of an observed range in the data set. Note that modeling in this style hinges on the assumption you have modeled relations perfectly (a naïve and questionable assumption at best), and also appreciate that the data needed for validation do not exist – these are conditions that have not occurred, yet. However, if the relations hold within the observed range, there is at least *some* reassurance they will continue to hold beyond the modeling set. You test the validity of these models by running simulations under a wide variety of scenarios. If the behavior of the model remains consistent with observations, we consider it “valid.”

System dynamics suffers from two fundamental challenges: there are often multiple models that can be fitted to the same observed data, and there are no conclusive tests to decide which candidate model “fits” the data better than others. This ambiguity is what climate deniers have “leveraged” when they express their skepticism. There is some validity to that argument, even when 95%+ of scientists have concluded that global warming as a result of human actions is “real” and undisputed (it is). Again, imho, as data scientists we still need to acknowledge imitations of techniques we choose.

To make an analogy: for many years the “establishment” (tobacco industry) disputed the relationship between smoking and (lung) cancer. Although evidence of the relationship was mounting, there were many (paid…) studies that disputed the relation. Also, and of principal importance, it isn’t possible ethically to design a randomized experiment to conclusively test whether smoking induces cancer in humans. Although we may morally object, the scientific arguments brought forward were more or less “valid”, until the overwhelming substantiating support for the “smoking-causes-cancer” hypothesis no longer could be refuted. A majority of scientists converged on the same conclusions. In many respects, the debate on global warming is at a similar tuning point, where it has become almost impossible for a well-informed scientist to dispute the conclusion that global warming is caused by humans. Still, from a methodological perspective, I would continue to acknowledge shortcomings in the models used (even though personally I am convinced of global warming).

Here is the rub: although system dynamics models can be “tested” for plausibility, those results are never, and could never be conclusive. If you try to make predictions about conditions that weren’t observed, then how could you test them?!? Also, somewhat counter intuitively, sometimes there can be multiple models that are all built on the same underlying historical data, and that are all “valid” incorporations of history. You can never do an exhaustive search of multiple possible futures, since the number of scenarios you would need to test is infinite! Although you can argue these are drawbacks of scenario planning models, the inherent “uncertainty” is as much a part of trying to predict an unknown future as anything else.

**Conclusion**

The field of data science seems to have “settled” on predictive modeling as classifying the past. For the most part, the human element has been largely eliminated from the model building *process*. Hence this approach gets referred to as “machine learning.” Although domain expertise may well be crucial in the data extraction and data preparation phase, most machine learning tools do not facilitate inputting subject matter expertise into the actual model development phase, sometimes called “model engineering.” See e.g. these two papers (Part 1 & Part 2) I wrote in 2007 with my colleague Bas van den Berg.

The overriding use of the term “predictive modeling” gets reserved for machine learning, where a flat file (machine learning algorithms can *only* take in flat files) is run through an algorithm (or suite of algorithms) to arrive at the “best” fitting equation that describes the relation between an array of input variables and the target variable. At run time that model gets deployed on “fresh” data (the most recent available) to calculate predictions. After the fact you then monitor how well the responses match the performance of the test data that you hold out during the model development phase.

As I have tried to explain in this post, the prevailing mode of predictive modeling (“machine learning”) isn’t the *only* one available. But system dynamics does not appear to have been embraced widely in the data science community. The model building process works rather different with these tools (like Vensim, or Stella/iThink, etc.). The model structure (code, analytical equation) is to a large extent the result of voluntary input of the modeler. In contrast to this, when you machine learn a “predict-as-classify” model from data, the modeler merely sets parameters and then allows the algorithm to elicit the optimal model automatically.

Neither of these approaches is uniformly better than the other, but they are different in several ways. Which one to choose depends on the business needs. If your goal is to provide predictions *within* the range of the historical data provided, then a “predict-as-classify” (machine learning) predictive model is a tried and tested option. It’s main advantage is relative simplicity of model building and less need for input from humans to determine “the best” model. Note that these models merely predict, they can make no claim as to causal relations or explanation (see e.g. Galit Shmueli’s excellent paper “To Explain or to Predict?”, 2010).

If you want to be able to provide predictions outside of the historically observed data ranges, then a system dynamics model, or scenario planning approach might be more appropriate. The transfer functions between the variables in the model need to be specified, and therefore begin to provide an explanation of the model dynamics, as given by the structure of X impacts Y, etc. The overall “behavior” of the model helps explain what might happen under new, never observed conditions. Either one of these two features could be important reasons to choose for this style of predictive modeling.

The future is fundamentally unknowable, of course, and therefore data scientists can only “predict” under certain assumptions. “Predict-as-classify” is by far the most common approach to predictive modeling. The abundance of tools and –more importantly– qualified talent makes it likely this will remain the default. Usually no explanation is required (at least for business purposes) and therefore opaque techniques (like Neural Networks or Support Vector Machines) can be used. And even when the model can be explained (which you should always do, imho, see my blog post about explaining models on 25-AUG-2018), there is no claim to causality. Predicting and explaining serve to different purposes, different needs, and therefore you’ll likely wind up with different models (see Shmueli’s paper).