16 April 2017
Big Data are all around us. One of the new marketing concepts is a “data lake”, although I don’t believe there is any agreed upon architecture definition of what a data lake really is. Some people perceive it as the 21st century version of a data warehouse. Some might even conclude that when it is called a data lake, then surely it must be powered by Hadoop. Others see a data lake “merely” as a contemporary version of either a persistent staging area (PSA) or a operational data store (ODS). The latter term was coined by Bill Inmon (often considered the father of data warehousing), and laid out in his 1995 book “Building the Operational Data Store”, along with its various types.
One of the recurring themes I hear when people talk about a data lake is that they prefer storing data in its native, original format, rather than having to transform it. This dramatically lower the cost to store, because you don’t have to do the upfront work of modeling and ETL. Also, this lowers the barrier when you are unsure about the business case for storage. You can store far more data “just in case” you need it later, and only incur storage cost, rather than storage + ETL – the latter can be a considerable expense. This, I feel, is a genuine benefit, and I pointed to it here.
This latter point revolves around key benefits for Schema-on-Read, as opposed to Schema-on-Write as we were doing for the past decades. This offers a potential benefit, and is also a threat, the premise of this article. One of the reasons to “use” your data lake in similar ways to the use case for an ODS is because you need some data integration, but maybe not as much as you would expect in a full-fledged corporate data warehouse. The need for low latency data is not met by “traditional” data warehouse solutions. Therefore light, partial integration can meet the need for a subject-specific “complete” data view, without incurring the costs and delays for integration in data warehouse style.
So far, so good. There is just a slight problem with this approach. As you ingest ever more sources, the amount of Work-in-Progress (WIP) piles up. Data profiling is one of the first steps you would like to take when load a new, any new data source. A cornerstone of working Agile is that you provide timely feedback. This hinges both on bandwidth as well as fidelity. High fidelity implies comprehensive profiling (which is difficult and time consuming), and bandwidth is costly: the whole data lake value proposition revolves around spending minimal time upfront! So you are faces with the law of WIP piling up, which we know from the Theory of Constraints and Little’s Law to be detrimental to throughput. This, surely, is a tricky balance to navigate. WIP doesn’t talk, doesn’t complain, in fact it is perniciously silent.
The purpose of storing data is to (be able to) use it later, obviously. So unless you do at least some initial data profiling, you are essentially banking on the usability of data, at good faith. Having worked in the field for a couple of decades, I never take the value of data at face value. I am reminded of a story by a colleague who had been storing 18 months worth of JSON files, only to discover that they could not be formatted in a relational format: the internal structure was broken. Some data types, like JSON or XML can be stored essentially “as is”, bit that gives zero reassurance these same data can be later made to surface their assumed structure. My friend got lucky, and was able go back through an API and recover the same data, this time ensuring the structure would allow it to be parsed properly. The analogy he offered me was: “Hey, in a manufacturing plant, you don’t wait to test product quality until you load on the truck, do you?!?” So either you do some minimal upfront data quality checking, or you are “just” allowing WIP to pile up.
As I like to say: “Don’t put off until tomorrow, what you can put off until the day after tomorrow!” I say that tongue-in-cheek, of course. If you think you can put it off, then what is going to stop you, from postponing it indefinitely? Have you ever gotten distracted or overwhelmed, and found yourself skipping tasks that you had conveniently parked for the interim? Unless you are actively managing WIP, the odds are you are not providing adequate feedback to upstream data providers to sustain your agile pace. This WIP is things like the challenging and knowledge intensive data modeling, verification of business rules, or necessary data cleansing to prevent GIGO for your decision support system. And as you all have realized by now, it is this same mechanism might well turn your data lake into a data landfill…