Data Science – an inconvenient truth

Tom Breur

11 February 2018

“It always takes longer and costs more” – Hofstadter’s law. Data Science tasks are complex in nature, and what makes them so hard to predict is that 80-90% of time is spent preparing the data, and this initial phase is a notoriously difficult task to estimate! What is maybe even more intriguing is that the law actually reads: “It always takes longer than you expect, even when you take into account Hofstadter’s Law.”

Let’s face it, data are ‘never’ clean, are they? They are either a little messy, or they are very messy. You start profiling your data, and flag suspicious or deviant records and fields. While you determine the sensitivity of dealing with “bad: data in a number of different ways, eventually you settle on a strategy for suppressing implausible values, imputing missing ones, and generally how to map input values via a suitable transformation.

One of the often overlooked elements about this early stage work is that there is no way to handle data in a “value free” manner. Whichever way you process the data, you “have” to commit to taking a perspective. Reification is when you bring things to life, which in computer science appears to refer exclusively to data modeling. Alas, if you wish, data preparation maybe is a special kind of data modeling.

After you have dealt with your data quality issues, the “fun” part begins. Most people consider the modeling part more exciting. But reality is that in most cases –I would argue the overwhelming majority– a very simple approach (like Regression) works remarkably well. I can relate to that temptation, but from a business standpoint there usually is insufficient justification to tinker with a simple, early version of your model (as I wrote about here).

The inconvenient truth about Data Science is that a fairly “simple” approach works just fine, most of the time. Most of the effort goes towards menial work like data preparation. And although your choices to reify the data play a largely invisible role, it is my conviction that they weigh heavily on your final results. So much for the sexiest job in the 21st century

Advertisements

One comment

  1. > The inconvenient truth about Data Science is that a fairly “simple” approach works just fine, most of the time. Most of the effort goes towards menial work

    This truth is inconvenient but also liberating. Once you realized that in many cases “good enough is good enough”, you can live in peace with this disbalance, and even give presentations about it (https://www.youtube.com/watch?v=UwkNmXhWmfI) 🙂

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s