12 August 2018
In the analytics world, data modeling is a bit of an oddball. I remember when the tools were costly and exclusively preserved for the “enlightened” – an elite class of IT specialists who seemed estranged from everyday business life. In the old days, the data modeling team was a station that “had” to be passed, they possessed almighty veto power.
In an earlier post (ETL is dying), I shared my perspective how moving and transforming data has changed fundamentally. This may have partly been triggered by the “Big Data” movement, but I think there is more. Business Intelligence is increasingly becoming a commodity, table stakes. Nobody stores data for data’s sake. With the growing reliance on data scientists that in many –if not most– professional settings need to do a lot of the upfront ETL work themselves, you need a coherent and enterprise wide data strategy. Ronald Damhof’s 4-Quadrant model has always served me very well in this respect.
This 2×2 framework distinguishes between systematic versus opportunistic development along the vertical axis. In most organizations, Quadrants I & II are dominated (“owned”) by IT, whereas Quadrants III & IV are usually managed closer to primary business processes. The horizontal axis can be characterized by push versus pull mechanisms. Another angle on this same dynamic is whether source systems (“supply” – quadrants I & III) or information consumers (“demand” – quadrants II & IV) are the driving force.
In Damhof’s model, there is explicit recognition for ad hoc data exploration (the proverbial “sandbox”) that may render early versions of information concepts. Typically these become input for Dimensions, in Kimball speak, or KPIs. Once those prototypes are deemed “worthwhile” to sustain, they migrate from ad hoc to recurring, from sandbox (IV) to production status (II). In Damhof’s model that implies these information products promote from quadrant IV to quadrant II.
However, moving information products from Quadrant IV to Quadrant II is easier said than done! When a data scientist “prepares data”, each and every single step in that process needs to be recreated in code that can run autonomously. And that is only the first step! When an algorithm then gets applied to the underlying data, these calculations need to be reproduced, too. For years, efforts to raise PMML – a language to describe predictive models – to a global standard were brave and relentless. However, apparently they never reached “escape velocity”, or the critical mass required to become de facto standard. PMML has remained –in large part– wishful thinking.
Almost every data science platform has its own native deployment engine. And there’s the rub: those data science platforms are rarely if ever seamlessly connected to production data. Handing off the algorithms makes the process brittle, error prone, and cumbersome to maintain. So not only do you need to overcome technical hurdles, partly around connectivity, but the semantic data modeling challenges are formidable, too!
There are no two ways around it: either work is worthwhile to continue and persist, or its not. If information products have proven their value to the organization, the hard work of promoting prototypes to Quadrant II (or Quadrant I) needs to be undertaken. By promoting information products to the top half of Damhof’s 4-quadrant model, you expose that work to corporate governance and control mechanisms – where it belongs.
There are many paths information processes can traverse to this model, and much more could be said about it. An honest assessment of everyday work will often surface some of these movements, which makes this such an incredibly valuable model, imho.
I feel that winners or losers will emerge from structural, systemic differences between those companies who succeed in moving lots of information assets from Quadrant IV to II, and those that seem forever stuck in Quadrant IV. At an organizational level, you “need” the concepts that emerge in Quadrant IV to become common parlance, with unified metrics. These will typically reside with IT, but either way achieve “production” status as common reporting targets. I like to talk about “corporate sanctioned reporting”, the metrics an organization’s board has agreed to, and that have become leading metrics for management decision making, by earning that status.
What few people appreciate, is that establishing and refining those metrics is the grunt work of data modelers. The nitty gritty details of how to capture the input data, which exception cases to exclude, and lots of other “details” make the difference between an organization that is truly focused, versus one that is run by the seat of your pants. In the era of Big Data, big gains can come from careful data management and master grade data modeling. You just cannot wing this.
We have come a long way since Codd, and we don’t have to learn all lessons again. Non-relational systems (like Hadoop, etc.) have truly earned their place (notwithstanding my reservations), but paradoxically, to my mind more than anything else they have underscored the importance and value of relational data modeling. Data modeling is dead, long love data modeling!