12 March 2018
“Analytics” means different things to different people. And now that “data science” has been coined the sexiest job of the 21st century, it appears to have taken on such a broad meaning that it’s hard to tell what isn’t covered by it… I would like to clarify these three terms OLAP, statistics and data science, in the hope they will become more useful. I have heard the word “analytics” used to mean all three of them, so my hope is to shed some light on these discussions.
More than anything else, I’d like to point to principled differences between these three, in an attempt to magnify distinction. In practical settings there will often be considerable overlap, please accept that as a given. Statisticians often venture in the data science and OLAP space, data scientists will apply statistical principles, and also rely on OLAP, etc. This likely contributed to some of the confusion. So how do they differ?
Statistics is typically run on relatively small datasets – one of the key differentiators from data science. But OLAP also typically involves many, many more records than is customary in statistics projects. So that is an angle where the sheer volume of data (number of records) that are typically involved set statistics apart from both OLAP and data science.
When you work with “Big Data”, statistical testing in the traditional sense isn’t very useful: with millions of even billions of rows, every tiny effect becomes statistically significant. Statistics are crucial if there is some effect, but it isn’t immediately clear whether expected “random” fluctuations might have caused it by chance, or whether “chance” is an unlikely explanation for the observed patterns. How unlikely is given by the magnitude of the p-value of a hypothesis test. In social sciences less than ~0.01-0.05 is usually referred to as “significant”: it means the odds this observed phenomenon would occur “by chance” is then less than 1/20 or 1/100.
Another key distinction between statistics and data science, is that statistics aim to test hypotheses, and the purpose of data science is often to generate them. When you apply statistics to test a hypothesis, there is a preconceived notion you’re attempting to validate. In data science, however, you typically search for patterns in (very) large data sets by means of automated or semi-automated processes. This is what your typical machine learning (ML) algorithms do. Whenever you surface such “unusual” patterns (the effects that stand out), this gives rise to further exploration. This is what we mean when I say that data science generates, rather than tests hypotheses.
When you compare OLAP with data science, it is apparent that they both “explore” hypotheses, albeit in very different ways. Typical data sources contain dozens, sometimes hundreds, or even thousands of variables. Many people colloquially refer to data exploration in an OLAP cube as “data mining” or data science. However, there is a non-trivial distinction between the “mining” of data in an OLAP cube, relative to the kind of work data scientists undertake.
When data scientists “mine” a large dataset, their machine learning algorithms consider hundreds, sometimes thousands of variables. Each and every one of them has a “fair” chance of showing up in the results, whether they be spurious results, or not. Conversely, when you “mine” an OLAP cube, you do so within the fixed constraints of variables that were chosen beforehand (!) as dimensions for the cube. Only within the realm of variables that were chosen as dimensions and attributes for further description in the OLAP cube can you “find” anything. This is why OLAP analysis is sometimes described metaphorically as “talking to the data”, and data science is more akin to “listening to the data.”
In a colloquial sense, you can think of OLAP querying as “hypothesis testing”: when you have theories about the data, slicing and dicing serves to add context, to verify whether your theories are true. For example: suppose a geographic visualization highlights that sales in the East region are declining. I now may relate that data point to the performance of a group of sales managers for that territory to see if they are underperforming, relative to their historic sales rates. Or maybe the “hypothesis” is that the drop in sales is due to unavailability (“out of stock”) of certain products, which negatively impact sales, etc. Obviously, all of those discoveries are only possible if you chose to include those specific variables in the OLAP cube. And clearly you are not doing any formal hypothesis testing, even though you are searching for outlying data points that suggest “clear” (i.e. non-random) effects – there “just” isn’t any p-value associated with these patterns.
Data science, being an interdisciplinary field, relies heavily both on statistics as well as OLAP. In the early, data exploration stages of a project, data scientists are (very) likely to benefit from access to OLAP. To get a sense for the (relative) strengths of effects, solid knowledge of statistics will come in handy to get a sense of proportion. All too often, anecdotal business evidence is taken as gospel. Then formal statistics tests can be a compelling way to rule out chance (“coincidence”) to prioritize the choice of projects you pursue.
Data science is different from OLAP in that you leverage computing power (machine learning algorithms), rather than human inspiration and perspiration to search for patterns in data. Data science purports to be more “theory free” than statistics, and there is some truth to that: serendipitous discoveries of patterns in the data are known and allowed to drive much of the exploration process. Of course a project journey is never (entirely) “theory free” – there are business facts that trigger the need to launch data science projects. Statistics, however, is clearly theory driven as the choice of hypotheses to test is governed by the underlying theory. So it is safe to say that data science is less theory driven, rather than theory free.
Now that “everyone” wants to jump on the data science band wagon, it is more important than ever to get clarity around terms and definitions. It is difficult to say whether it has been marketing hype, superficial recruitment strategies, or some other reason, that has blurred the boundaries between OLAP, statistics and data science.
The term data science came in vogue some 20 years ago. A Google trends plot for the number of searches for “data science” shows it continues to gain in popularity:
Searches for “data science”, last 10 years
Hopefully, this ongoing popularity will contribute to knowledge and understanding of the possibilities and limitations of our wonderful profession!