23 September 2018
John Tukey (1915-2000) is undoubtedly one of the most influential godfathers of data science. He invented the term “bit” (first used by Claude Shannon in his seminal 1948 text “A Mathematical Theory of Communication”), he introduced the “box plot”, contributed significantly to jackknife estimation, and has a few statistical tests named after him. Tukey articulated the distinction between exploratory and confirmatory analysis, at a time when most statisticians were focused on the latter. He wrote a paper in 1958 that contains the oldest use of the term “software”, is said to have founded the field of Computer Science, and many more credits are attributed to him.
Personally, I love this profound quote from him:
“Far better an approximate answer to the right question, which is often vague, than an exact answer to the wrong question, which can always be made precise”
(taken from: “The Future of Data Analysis”, 1962). Tukey’s remark arose over concerns that statistics was increasingly relying on mathematical models, rather than focused on solving real-world problems.
Most statistical tests that are commonly used nowadays rest on assumptions about the distribution of underlying data. These are common features like data following a “normal distribution”, yet rarely do people bother to check whether their data actually adhere to those patterns! Although there is genuine beauty in math and closed form analytical solutions, to tame the “Big Data” monster, we need practical approaches that help companies get value from oceans of data. Also when their data don’t always follow perfect Gaussian distributions.
Real-world data are messy in much the same way as business practices are flexible, subject to change, cater to exception scenarios, and reflect imperfections –small or big– in the source systems that generated those data. Now if the corresponding data models do a good (enough) job of capturing all possible scenarios, you can hope to disentangle that messy hairball. But that’s a big “if”…
By the same token, “real” customers behave erratic, don’t all respond in the same way to promotions, are not all affected by the same marketing messages, etc. As a result, systematic patterns in their behavior need to be ferreted out – and that is hard work! What you are looking for is often little more than a faint signal, drowning in a sea of noise.
Surfacing meaningful consumer behavior patterns also means you have to navigate treacherous cliffs because the data don’t always mean what they appear to imply. In his epic book “Business Modeling and Data Mining” (2003), Dorian Pyle wrote (p. 67):
“Data is very fickle stuff. At best, it is but a pale reflection of reality. At anything less than the best, it seems intent on leading the unwary astray. Invariably, the data that a data miner has to use seems particularly well constructed to promote frustration.”
Yet we all know there is value in data. It may be buried deep, but it is there (usually…). When you open a new dataset, or even while you are piecing it together, you start by “profiling.” Personally, I am partial to a more descriptive term for this process that I first read in Dorian Pyle’s “Data Preparation for Data Mining” (1999): “data assay.” The data assay serves to establish the suitability of a dataset for your analytic purposes. It’s a crucially important first step, and you ignore or skip it at your peril. I can understand why few people consider it “sexy” or “cool”, and maybe that is why it receives so little attention. Pyle’s excellent book “Data Preparation for Data Mining” being one of the exceptions to that rule.
It is tempting to start analysis around the variables that are conveniently available: it’s “pragmatic”, informative, you get the “low hanging fruit”, first, etc. But it’s a death trap, exactly the one Tukey tried to warn against. Because reporting on the basis of existing variables may well answer “the wrong question”, no matter how precise. Instead, Tukey encouraged us to pursue “the right” questions, even if that is somewhat “vague.” The business problem always has to come first. Where lies opportunity? Where is the company bleeding? Those questions drive the “big” objectives, and notice how they are not defined in any way by specific variables in your data set.
I would argue that the most important variables required to highlight “big” questions are never immediately available. Because if these key metrics were already captured, then why does this business problem continue to exist? By redefining (literally!) how to look at (“measure”) the business, you offer opportunities for change and improvement. The hard work to get at those variables, is like the guide leading you through a dark forest. If the path was already clear, nobody would be searching their way.
Let’s analyze some premises behind Lean, or the closely related Toyota Production System. Historically, when the cost of capital equipment was a major concern, “utilization” was one of the central measures that organizations pursued. The reasoning was that higher utilization implies higher ‘efficiency’, and/or better Return On Investments. Hence “utilization” gets measured and efforts were geared towards maximizing it.
Along came “Lean” that challenged many of these traditional and intuitive notions. And with that ‘new’ perspective, new metrics are needed to shed some light on the same processes. What if you look at queues and delays, instead of utilization? Underutilized capacity appears as waste, whereas queues have no apparent cost. If you want to illustrate the economic importance of queues, the quickest way to senior management’s heart is by quantifying the costs of delay and the associated financial losses.
The “cost of delay” is an example of the kind of “hard questions” that I suggest you try to pursue. In practice, many organizations base their decisions on conveniently available proxy variables that are substitutes for the real economic objectives. An example of a proxy might be cycle time. Although it seems ‘obvious’ that reducing cycle time is a worthy cause to pursue, unless you can quantify the cost of delay in real dollars per day, it’s an example of a so-called “proxy variable.” Only after you have formulated the relationship between this proxy variable and profits, can you begin to make rational trade-offs between minimizing cycle time and maximizing economic gains.
Since for-profit organizations are usually focused almost exclusively on financial gains, the relation between operational processes and the bottom-line will often be an area to search for “hard questions.” But in many other areas like medicine, the hard questions might revolve around number of lives saved, average extension of life expectancy, or something amorphous like quality of life. Usually, the hard questions are lurking behind the mission of an organization: why are we in business, why do we exist?
There is an old and hackneyed joke about the “Streetlight effect” (or drunkard’s search) – a drunk looking for his keys under a light post. When someone asked him: “Was this where you lost them?” he replied: “No, but this is where the light is!”