21 July 2016
“Big Data” seems the latest hype. For the time being, it’s mostly storage providers who seem to gain from it. They love seeing Terabytes or even Petabytes being stored in a feast of technological advances. Small wonder they are so eager to embrace this trend.
But except storage, cloud, and hardware vendors, research firms have discovered Big Data. Gartner: “Big Data will represent a hugely disruptive force during the next five years, enabling levels of insight that are currently unachievable through any other means”, and “Information is the oil of the 21st century, and analytics is the combustion engine.” McKinsey Global Institute: “Analyzing large data sets —so-called Big Data— will become a key basis of competition, underpinning new waves of productivity growth, innovation, and consumer surplus.” These types of messages have grown awareness for Big Data, all the way up to the boardroom. Research firm IDC predicts spending on Big Data will have grown to 50 Billion by 2019, serious business indeed.
Unfortunately, there is no consensus on what “big Data” really is… Definitions like the three V’s (Volume, Variety, Velocity) do little if anything to make that clearer. The only common denominator seems exorbitant volumes and loosely structured formats. Both unstructured as well as semi-structured formats refer to incomplete data modeling, which means you defer explicit modeling until you apply the data. Apart from the traditional structured data, the lion’s share of growth originates with unstructured, or semi-structured data. Think pictures, audio, video, RFID, logs, GPS, clickstream data, social media, etc.
In the past, “traditional BI” leveraged man-machine interaction data: a person does something, and that electronic trail gets recorded. This was the bread and butter of traditional data warehousing and CRM. Big Data is more often driven by machine-machine interactions: RFID chips, mobile phones, etc. create a huge electronic trail without any human ever inputting any voluntary action (alas, other than agreeing to grant access to location details, for instance).
The “Big Data” hype
Like with any hype cycle, it starts out with a phase where everyone wants to join in, like all storage vendors are encouraging us to come on board before this ship leaves the harbor. But do you remember CRM? When Siebel was as big as Oracle is now? After the disillusion, you enter a phase (the trough) where everybody wonders: “How could I be so stupid (again…)? What was I thinking??” I feel that only after that do you begin to see real innovations.
Gartner hype cycle:
I’ve been working with data for three decades now, and am often left wondering: “What is so new about all this?!?” Some technology (like NoSQL) is new, cloud storage is getting a lot cheaper which opens up new business cases that were not economically feasible before, but that’s about it.
Think back, for a minute, about the introduction of fiberglass cable a few decades ago. The promise of an interconnected world, internet was the best thing since sliced bread, and everyone needed fiberglass. It was “obvious” how this would lead to commercial success. Then came the 2000 internet bust (oh, and ADSL), and we’ve lived through another cycle of innovations, and we’re still tapping into fiber we put into the ground a few decades ago! “Triple Play” (selling TV phone and internet as a bundle) seems to be one of the few innovations of that era that has stayed with us.
I am concerned “Big Data” will show similar cyclical patterns. I genuinely believe in the value of data, and have built my career around it. But I am still not convinced that MPP hardware is what you “need” in all cases, I find a lot of this technology immature (alas, managing some of it has provided me with substantial “job security”…). A Ferrari surely can go faster than a Toyota Corolla, but the latter doesn’t need nearly as much maintenance. And when it comes to every day commuting traffic, they will both get you to your destination about equally fast. And the Toyota handles speed bumps a lot better!
Big Data offers its own set of maintenance challenges, though. Hadoop –at least for the time being– appears the dominant Big Data platform. Together with products like Impala and for instance HBase and Hive, Pig, Zookeeper, etc., to mention just a few, all offer a compelling eco system to manage and coordinate distributed applications in support of BI. Still, compared to old school data warehouse tools, the management overhead is much bigger, and finding talent to grow your team is (still) a lot harder! That all adds to indirect costs.
“Big Data” analytics
“Big Data” analytics is the obvious next step in this evolution. For many companies this is a little speck on the horizon, a target that shows up in many polls on this subject. But what are we talking about here? In the “old school” data warehouse we always received more data from operational systems than we stored. So I fail to see any valid connection between “Big Data” and “Volume”, unless you take the diversity of data types (“Variety”) into account that are available in typical Big Data solutions. These semi-structured data, however, are not (yet) amenable to structured analysis in the tools we currently have at our disposal: those all (!) still require the important features to be extracted in some relational format. That is what the current generation of statistical and data mining algorithms all (still) require.
Against that background there are principally two routes to choose, two types of architectures to leverage the insights you have gleaned from your data. Either you organize your analytics to leverage the MPP architecture (https://en.wikipedia.org/wiki/Massively_parallel_(computing)), optimally using NoSQL capabilities. The alternative is to use the NoSQL source data as an input for what is essentially a flat file that probably has already been aggregated and modeled, that will run in your good old relational database.
The first approach requires specialized “new-ish” programing skills because the equivalent of “Group By” statements in Hadoop comes down to building MapReduce jobs. This could be aided by SQL front-ends (that emulate familiar relational statements) on top of Hadoop, or not. The second option implies that your Hadoop/Hive solution “merely” serves as yet another source system. People call that Big Data analytics, but I take that qualification with a grain of salt. I call that the wannabe Big Data analytics J
Analytics as a profession is all but new. We are still developing and progressing, but a lot has already been achieved. There is little point in investing in Big Data without the corresponding tools and capabilities to leverage all these data. Analytics are obviously the key to this success. “Pure bred” Big Data analytics are (still) fairly new, and unless your maturity and size justify this, my recommendation is to thread carefully before entering that world. It is not exactly proven technology. Skilled professionals are very hard to come by. I consider productivity of your scarce analytics knowledge workers one of the key focus areas for management. Business cases for Big Data analytics are few and far in between.
If you want to avoid falling into the same trap(s) that we’ve seen with earlier hype cycles, analytics will have to prove its value. First. That means solid business cases that compel senior management to invest in pilot projects, because the projected ROI is evident.
ROI could be based on time to break even, or it could be based on quantifying the contribution of a predictive model (see this paper I wrote on that subject: https://tombreur.files.wordpress.com/2016/06/how-to-evaluate-campaign-response-the-relative-contribution-of-data-mining-models-and-marketing-execution-200703.pdf). Sometimes the gains from analytics can be non-tangible, like market innovation or defending ones reputation. We see that sometimes in social media analytics, but then you still need the same senior management sponsorship.
“Big Data” are an unmistakable revolution in our profession. Because NoSQL solutions run on relatively inexpensive MPP clusters, increasingly IT departments will consider this as an alternative alongside traditional RDBMS solutions. The latter have known limitations in scalability, and can be costly. NoSQL solutions enable elegant linear scaling, so your hardware can grow along with your processing needs.
Data storage is growing fast, research firm IDC estimates some 60-100% per year. Therefore it is not a question if, but rather when NoSQL (“Big Data”) solutions will replace old-school relational systems. As the market and maturity of these products evolves, and more qualified talent becomes available, the number of compelling business cases will grow.
Maintenance of these NoSQL systems is still a challenge. A lot –if not most– is Open Source, and experience has shown that improvements come slow but steadily. “Even” Gartner nowadays acknowledges that Open Source is realistic alternative to “traditional” vendor supplied products.
The transition from the tried and trusted relational (SQL) systems to the new “Big Data” platforms like Hadoop is a paradigm shift in BI technology. This is one of the reasons why Big Data are considered a disruptive force. Just like the introduction of SQL RDBMS solutions was traumatic at the time (I used to be a COBOL programmer), we will see a caesura in BI history as a result of Big Data. Exciting times!