11 February 2018
“It always takes longer and costs more” – Hofstadter’s law. Data Science tasks are complex in nature, and what makes them so hard to predict is that 80-90% of time is spent preparing the data, and this initial phase is a notoriously difficult task to estimate! What is maybe even more intriguing is that the law actually reads: “It always takes longer than you expect, even when you take into account Hofstadter’s Law.”
Let’s face it, data are ‘never’ clean, are they? They are either a little messy, or they are very messy. You start profiling your data, and flag suspicious or deviant records and fields. While you determine the sensitivity of dealing with “bad: data in a number of different ways, eventually you settle on a strategy for suppressing implausible values, imputing missing ones, and generally how to map input values via a suitable transformation.
One of the often overlooked elements about this early stage work is that there is no way to handle data in a “value free” manner. Whichever way you process the data, you “have” to commit to taking a perspective. Reification is when you bring things to life, which in computer science appears to refer exclusively to data modeling. Alas, if you wish, data preparation maybe is a special kind of data modeling.
After you have dealt with your data quality issues, the “fun” part begins. Most people consider the modeling part more exciting. But reality is that in most cases –I would argue the overwhelming majority– a very simple approach (like Regression) works remarkably well. I can relate to that temptation, but from a business standpoint there usually is insufficient justification to tinker with a simple, early version of your model (as I wrote about here).
The inconvenient truth about Data Science is that a fairly “simple” approach works just fine, most of the time. Most of the effort goes towards menial work like data preparation. And although your choices to reify the data play a largely invisible role, it is my conviction that they weigh heavily on your final results. So much for the sexiest job in the 21st century…