5 July 2018
Data preparation can easily consume 80% of work. Those who have spent time in the trenches know that estimated number “80%” is actually on the low side… It often seems more like 90%, sometimes even 95%, or more. To the “outside world” that may seem like a lot of overhead. Especially to those who aren’t very familiar with analytics.
My take on the cost of data preparation is slightly different. Personally, I like to think of “data preparation” as a value creating workstream, even though almost everybody considers it a mere “cost.” I’m aware this perspective is controversial, so let me elaborate.
In process models like CRISP-DM a separate step is provisioned for “data preparation” that itself doesn’t appear to create any value added output, so the perspective of data preparation as a cost in the analytics process seems so intuitive.
In 1980, Philip Crosby published his seminal book “Quality is Free”, based on the premise that doing things ‘right’ the first time around, is always cheaper (in the long run…). I am constantly surprised how relevant that insight remains today. In the digital era we have entered, data quality issues by and large tend to originate at the transitions between business silos or between organizations.
For example, if your marketing campaign is less effective because of address inaccuracies, the “handoff” between upstream data entry staff and downstream marketers who using these lists lies at the root of this data quality gap. Of course it only becomes evident when your data scientist surfaces the imperfections in data entry (as manifested by lower response rates or high mail returns). When you join tables, and some records get “lost”, you usually have a similar data quality issue, maybe because accounts payable used a different definition of “customer” than operations did, etc.
During data preparation many or most of these gaps are brought to light. Then you find a pragmatic way to align various datasets for the purpose of model building. Over the years, I have grown increasingly convinced that highlighting and addressing these gaps is where enormous amounts of corporate value go to waste. Pinpointing, and more importantly, quantifying their magnitude and associated financial costs can drive incredibly powerful transformation. All on the back of unglamorous, menial data preparation work…
Let’s return to the title of this post: should data preparation really be considered a cost? In my experience, the lion’s share of grunt work in data preparation is to overcome data quality gaps that are the immediate result of imperfect business process handoffs. If you manage to do a compelling job of quantifying the business costs of these gaps, pre-existing but hitherto largely invisible cracks in the corporate value chain, then the most tedious and unglamorous work of data preparation may well turn out to be the most valuable effort after all!