13 November 2016
In the 90’s when I got my hands on the first “real” data warehouse I ever worked with, my notion of data latency was bound by solutions I knew, had seen until then. I had a completely different mindset, different expectations, because this was a state-of-the-art data warehouse. It never occurred to me that a monthly refresh rate, with the new release becoming available towards the end of the month (sic), would one day be completely unacceptable.
Nowadays, I would say that for the majority of BI users, a daily update rate for an Enterprise Data Warehouse has become the norm. Some solutions may schedule a few updates per day, which brings the latency down from at least 24 hours, to about 8 or 6 hours. I haven’t seen too many analytical platforms that have succeeded in ramping up the speed at which they manage to make their data available. And even if they had the know-how and technology available to do so, there just often isn’t a business case to justify it.
The reason why refreshing data once a day is mostly sufficient, is largely because primary business processes don’t offer enough opportunities to act quicker on data even if it were available. One of the questions I like to ask people when we discuss more frequent load schedules is: “What decisions would you make differently, if the data were available sooner?” Their answer often shows that the humans involved in making “data” decisions are rarely able to react as fast as the technology might.
Now that is obviously a chicken-and-egg dynamic, because when the data are “late”, you can also hardly find ways to act on it sooner. Once a day refresh is also a natural limit that we have grown to accept from the Kimball (dimensional) paradigm, and the corresponding approach of physically moving data into a star schema. Loading more often is in part problematic because you would run into disproportionately more early arriving facts, or late arriving dimensions, two sides of the same coin. But what if we would simply “dream” these latency constraints away, much like I don’t consider monthly updates “normal”, anymore?
Yesterday’s relational technology, and –in part– the prohibitive cost of scalable hardware, created a lower bound of daily, or maybe hourly updates. As the arrival rates of various interfaces become more variable, and hiccups in source data provisioning become more frequent, a balancing loop prevents us from driving down latency. But many people (like thought leader Roelant Vos, who has been advocating this for a while) now use virtual star schemas as the default mode of presenting data. Then obviously the limitation of physically loading the data goes away…
The management of the raw and business-oriented data can now be done with very fast distributed storage (like HDFS), using in-memory parallel processing tools that enable you to move even further away from the physical star schema. Those new data architectures (like the Lambda Architecture) are combining rich batch data (for example, updated daily) with real-time streaming data (added as it come during the day). The final batch and real-time data is then presented as a virtual star schema to your regular BI tools, in much the same way as Roelant Vos’ recommended architecture.
In tomorrow’s solutions, the divide between operational and analytical systems will gradually disappear. In particular as the volume of data goes up to gargantuan numbers, you simply cannot afford to copy or move data to a dedicated analytical environment. From there, the step to near real-time analytics all of a sudden becomes feasible. As Jay Kreps has argued, there are even genuine architectural benefits to managing the disruptions in data flows across heterogenous sources that way. Let it flow…