30 April 2017
Ever since the “big debate” between Inmon and Kimball in the 90’s, dimensional modeling has been a recurring component in contemporary business intelligence (BI) architectures. Inmon and Kimball both agreed that dimensional modeling should be used, they just couldn’t agree exactly how to leverage it. Inmon favored a hub-and-spoke architecture with downstream dependent data marts (presented in dimensional models). Kimball favored a bus-matrix where the entire data warehouse was modeled (and materialized) in dimensional form. This reference paper discusses the Kimball versus Inmon approach, and contains a bizarre (recurring) typo: use of the word “confirmed dimension”, when everybody else I know refers to conformed dimensions, alas.
In his early writings Inmon suggested a 3NF data modeling style for the central hub. That turned out to be a problematic approach (see this paper). Later Inmon updated his vision to recommend Data Vault style modeling for the central hub. Although I personally have always felt the Kimball approach to data warehousing appeared to have gained (far) more popularity, there is no compelling evidence either way as to which is superior.
Many of the contemporary BI solutions are a response to, a mechanism to cope with constraints in existing technology (mostly RDBMS’s). For example: dimensional storage deliberately “uses” redundancy to expedite query performance. Once you historicize data in a star schema (dimensional model) from an underlying 3NF source system, query writing becomes easier (more intuitive, but also less joins), and also a lot faster. So to some extent, these solutions were architected to cope with limitations in RDBMS performance. As another example: the reason BI systems are mostly designed with insert only loading processes has to do with notoriously slow insert-updates. Wherever possible, those are replaced with insert-only procedures. Extensive use of surrogate keys has a similar origin, largely geared towards yesteryear’s hardware needs (indexing strategies).
These dimensional solutions are not without their problems, however. Incremental loading mechanisms in historicized databases can be notoriously complex. This has been one of the drivers behind the rise of data warehouse automation tools: they take away some of the inherently error prone coding by humans. When a tool generates all the loading SQL statements, they still need to be tested, of course. But once the logic has been shown to work, it tends to be more readable, and better maintainable. This code will also contain far less nasty bugs.
Another problem with the Kimball approach to data warehouse architecture is that the Bus matrix, in its entirety, needs to be specified upfront (the dreaded “Big Design Up Front”, BDUF). Making changes later on is rather problematic. If you want to add just a single field in a dimension table, there is now a problem with history: what to do with the fact table records that are associated with this dimension record for the time preceding inclusion of the new field? “Some” history for this attribute may (still) be available, but how to deal with it? If you leave the historical records “NULL”, this encumbers query logic somewhat, and it isn’t true to reality. Setting all the historic records to the current value isn’t pretty, either. Obviously at least some history is usually available, and there is a reason why you want to include this attribute: it matters because it changes! However, in the Kimball architecture, trying to insert the (full) history available basically implies you need to redo the initial load again. This will be a heavy CPU load, potentially fraught with some tricky data logistics challenges because all archived source data files need to be made available, again.
As an alternative to incremental loading, recently a new architecture has seen the light: the so-called Lambda architecture, as described by Marz & Warren. Their core premise for suggesting the Lambda architecture is that complexity in software development patterns is a known threat to the validity of your product: humans can cope with only so much complexity, and after that errors are likely to creep in… Incremental loading patterns and early arriving facts/late arriving dimension records add complexity. Although all of those patterns can be solved, if you try to take on enough of them, you are exponentially more likely to miss a trick, from time to time, and introduce errors.
To the best of my knowledge, few data warehouse teams have embraced the Lambda Architecture so far. The massive recomputation required that replaces incremental loading is meant to “save” on complexity, at the expense of CPU’s and I/O required. Not all hardware currently in use will be up to this task! But if you consider this from a TCO perspective, hardware has become less and less expensive, relative to human resources required to debug and maintain complex systems.
So where does dimensional modeling “still” fit in, then? I would argue that the reasons behind incremental loading in a Bus matrix data warehouse are gradually evaporating, mostly due to technical evolution and price erosion of hardware. However, I assert that there still is, and probably always will be a place for dimensional modeling. I know of no approach that so accurately and validly captures “true” end-user requirements.
Although business users may not be “data modeling savvy”, people quite naturally think in dimensional models! As Ralph Kimball once said, dimensions naturally present attributes that you would drag and drop on a report. This is a very natural way of exploring the data for business users. Corr & Stagnitto’s book (2011) points out that human language can be translated into dimensional structures. I want to see X (Fact Table measures) by Y and Z (dimension Foreign Keys). This relation between human thinking (language) and data model structures is also at the heart of the more generic FCO-IM, for instance.
My perspective is that the Dimensional model as an incrementally loaded physical data warehouse will increasingly get replaced by the output of a function running on “all the staging data”, exactly as proposed in the Lambda Architecture. Their development patterns are simpler, and hence less error prone. And maybe just as important: it moves us further toward continuous integration. Some of the Ensemble Modeling techniques (Anchor Modeling, Data Vault, etc.) bear striking resemblance with the graph approach that Marz & Warren propose, as my esteemed colleague Koos van Strien has pointed out (“Lambda Architecture for the DWH”).
These innovative data warehouse development approaches add a lot of agility to your solution when you need to change a dimension or a fact table. In the “old world”, as your EDW system grows, making changes becomes increasingly complex and cumbersome. There are no two ways around it: this effectively leads to creating new legacy. The major reason that the Data Vault community “rebelled” against bus matrix architectures, if I remember this well, was because everyone who had been managing Kimball style data warehouses for a couple of years would have their own “horror stories” of the maintenance nightmare their rigidity creates.
The Inmon hub-and-spoke architecture that is common when you apply the Data Vault modeling paradigm copes with part of this problem. Dimensional data marts live downstream from the hub, and when you recreate those after a change, all of the source data that were ever available to the data warehouse team, are always “on-line.” So rebuilding a data mart is much, much less onerous. This is how that architecture elegantly addresses the BDUF problem of the Kimball Bus matrix.
One of the prime reasons why the Data Vault community embraced virtual data marts was better agility, at the (slight) expense of just one extra join (the View). Rock star Roelant Vos has constantly been on the cutting edge of that charge, and his blog is well worth the read if these topics interest you. Downstream from the Data Vault hub, a persisting and integrated System-of-Record, enables agility that the Kimball architecture can’t provide due to the problems of handling history after a change in a dimension needs to be implemented, and that I referenced above. Working with Dimensional Views, you can make a tiny update, and with the right permissions, change control in place, etc., push this to production immediately (see this use case I was involved with).
For a while now (5-10 years), the majority of newly built data warehouses in the Netherlands are being built in this fashion (sorry, I have no scientific numbers, but the conference scene converged on a 75% estimate). I consider this evolution progress, but the temporal data loading and query logic is still relatively complex – hard enough for many people to pose serious challenges. Again, this is where solutions grow in complexity beyond what is prudent in light of known human limitations in handling complexity. I would argue that the design pattern of incremental loading is non-essential complexity.
The way forward I see is to replace the convoluted data manipulation of morphing and appending history, by a straightforward and relatively simple change to the code of your dimensional function (Lambda style) that can grow organically. When you then re-execute that function, you have effectively traded in considerable and avoidable complexity for the CPU’s and I/O required to redo your initial load. If you then manage to keep doing that for every single change, I foresee unprecedented Agility.
Special thanks to my friend and thought leader Stephan Deblois for guiding me in these innovative approaches to data warehousing, and also for reviewing and augmenting this blog entry.
Complementary reading on Dimensional Modeling in the age of big data: