1 October 2017
Over the last year, I have come across ever more alarming messages with regards to the health and viability of the Hadoop eco system, recently a post about “Why Hadoop is dying”. When an analyst firm like Gartner already warned us years ago to be weary of the data lake fallacy, you would think CIO’s listen up. But more recently, Gartner quoted: “through 2018, 80% of data lakes will not include effective metadata management capabilities, making them inefficient.” Oops. As anybody in the “old school BI world” will tell you: without proper context (metadata), data is pretty useless. Did anyone ever think that would be any different if you store lots and lots of it? Having more data does not make it “speak for itself.” James Kobielus, Big Data evangelist for IBM, was quoted saying that “Hadoop declined more rapidly in 2016 from the Big Data landscape than I expected.”
Hadoop has its strengths and weaknesses; this year marks its 10-year anniversary. If you are trying to process huge volumes of data (let’s say 50-100 Tb or more), that are comprised of diverse data types (relational, text, audio/video/images, log files, JSON, XML, etc.), multi-structured data, and you are processing all these data in batch mode, then you probably have a good use case for Hadoop. As to the appropriate size, Kashif Saiyed has written on KD Nuggets: “You don’t need Hadoop if you don’t really have a problem of huge data volumes in your enterprise, so hundreds of enterprises were hugely disappointed by their useless 2 to 10 Tb Hadoop clusters – Hadoop technology just doesn’t shine at that scale.” Earlier this year at the Gartner Data & Analytics Conference (6 March 2017) I recorded Nick Heudecker (Gartner’s main Hadoop analyst) saying: “If you’re thinking about setting up a Hadoop environment and dunking four Terabytes of data in it? Don’t bother… The technology and complexity isn’t worth it.”
Many companies adopted Hadoop “because everyone else did” – to keep up with the Joneses, so to speak. However, often they just don’t have enough data to warrant a Hadoop rollout. But now they have their Hadoop cluster up and running, and discover that data management is really hard in Hadoop. It is not a jack of all trades. This is one of the reasons that over 100 open source projects are currently underway to fill some of those gaps in functionality. Probably the hype around it played a significant role in misstating or overstating Hadoop’s capabilities. Another Gartner prediction: “… through 2018, 70% of Hadoop deployments will fail to meet cost savings and revenue generation objectives due to skills and integration challenges.” Everybody seems to be scrambling for talent, so that skills gap seems pervasive.
Another disconnect appears to be that Hadoop is not really geared to real-time processing. For real-time or near real-time processing, typically Spark is more suitable. The Spark analytics engine was developed for in-memory parallel processing, and this will leverage hardware more efficiently for real-time applications. Hadoop is suited for storage and batch processing, low-latency unpredictable (analytic) queries are not its forte. NoSQL solutions (like Hadoop) are evolving rapidly to support SQL or SQL like access mechanisms, although those interfaces are known to be remarkably slow. But it does allow interoperability with traditional systems and supports governance through the relational model: currently we have no real alternative to SQL for auditability and traceability of data flows.
Investors must have noticed some of these trends, too, and valuations of “giants” like Cloudera, Hortonworks, and MapR have not been faring well. Cloudera had been set to go public for a few years, and on 28 April they finally went public.
A historic look at Hortonworks’ stock price isn’t rosy:
Comparing this trend to the NASDAQ makes Hortonworks’ valuation look even more alarming:
Cloudera went public on 28 April 2017, so only time can tell how their valuation will develop in the long run. This year it has lagged the NASDAQ by a considerable margin. One press release from last year, however, does shed some light. Boston based Fidelity, through its Contrafund mutual fund marked down its stake in Cloudera by almost 37%: http://www.reuters.com/article/us-funds-fidelity-ipo-idUSKCN0WW1EO (news from 30 March 2016). Last month (15 September 2017) Cloudera filed an S1, in itself not unusual for a company that has recently gone public. Where analysts are weary, though, is further dilution of value by more emissions, and uncertainty about who are likely to be selling significant volumes. Revenue targets ($358M sales target for 2017) that where appealing at the previous share volume levels, now trigger much less appetite.
The concerns from the market seem to be that these start-ups keep raising money (Cloudera raised $1 Billion in eight rounds), without meeting their growth targets. If you have been paying attention (but even if you haven’t J) it’s not for lack of marketing spend! The hype around Hadoop and Big Data may have tempted some people to ignore that although there are certainly some valid and impressive use cases, not everybody is ready to analyze Petabytes of data – even if they have that much data to begin with. Use cases with Petabytes of multi-structured data simply aren’t for everyone. And unless you are managing Petabytes of data, Hadoop is certainly not the best solution for Enterprise Data Warehousing, or Business Intelligence for that matter.
Using Hadoop, data management is a cumbersome affair. Yet many business people refer to Hadoop as their “data warehouse”, which tells us foremost that people usually don’t know what data warehousing really entails. According to Bill Inmon, the godfather of data warehousing, a data warehouse is “A Subject oriented, Integrated, Time-variant, Non-volatile collection of data in support of management’s decision-making process.” Your typical data lake doesn’t resemble these qualities. And Ralph Kimball’s more pragmatic definition of a data warehouse “A source of data that is optimized for query access” surely doesn’t apply to Hadoop implementations, either!
Despite all these problems, I myself have grown increasingly enamored with the “data lake” concept: by dramatically lowering the cost of storage, and driving down the effort (and hence cost…) of getting data into a very lightly governed environment, organizations now have the option (!) to consider data for storage without a clearly defined business case. What that implies is that the price you pay for management and overhead of your Hadoop environment buys you the option to persist data storage. Effectively, this means that costs for your traditional Persistent Staging Area (PSA) need to be compared like-for-like with your Hadoop management overhead. Nowadays, there are more and more economic cloud alternatives for this. But when Hadoop turns out to have a lower TCO, and if buying this storage option promises sufficient value, then Hadoop might make sense. Obviously, there are other use cases, too.
Needless to say, you don’t want this data lake to turn into a data landfill. If you don’t consider what you eventually plan to do with all these data until you discover you would have really preferred to have organized it slightly differently, the ease of getting data in has come around to bite you in the bum. And if you defer source data profiling until you discover fundamental shortcomings that make these data unfit for purpose, you are looking at cheap storage to hold a smelly pile of crap.
As an alternative, nobody ever said that this inexpensive landing zone for large volumes of data couldn’t be a relational store… The price of disk space and even SSD’s is rapidly going down. Innovative business models that traditional (and new) RDBMS vendors are offering now to stay competitive, allow you to have your cake and eat it, too. In case you hadn’t noticed, there is price war going on among cloud vendors, and their customers are loving it! Storing data in Hadoop is cheap, but managing those data not so much. And getting data in may be relatively easy, but getting them out not so much. That is where the good old craft of data modeling still holds its own, and data integration and semantic layers still come into play. Context is King: you just can’t extract any value from your data without that relevant and necessary context.
Until you ascertain that your business goal –at least in principle– can be achieved with your data, the data you are storing is just “Work-in-Progress” (WIP). And as Throughput Accounting and the Theory of Constraints have taught us: WIP is a liability, not an asset (see this presentation). Data does not generate any value, until you get it into the hands of end-users. And until then, it represents merely a liability.