Big Data and Virtualization

Tom Breur

12 July 2016

Big Data are here to stay. Interestingly, there’s a paradox in that the data itself is changing shape and definition so quickly, that most of it has only limited value for permanent, fast-retrievable storage. There’s a reason why all Big Data providers are so keen to provide SQL emulators: relational access is.

My guess is that the current generation of data virtualization tools are a kind of “half-way house”: they provide a valuable connection mechanism that still relies on good old SQL optimization, to minimize the need for actually moving (and making redundant copies of) the data. NoSQL front-end solution that provide quasi-SQL functionality are often buggy, and always incomplete: they never provide the full functionality of SQL. But the relational model, a proven mathematical approach to bringing disparate sources together can’t logically be replaced.

At the moment, we do need to cope with limitations (cumbersome programming languages) and inefficiencies (mostly lack of scalability) in database technologies. My hope is we will wind up somewhere different, while retaining the best of both worlds (SQL and NoSQL). That’s why I see virtualization as a half-way house, a place to stay before we set data analysts free.

The data we hold (held) in our traditional data warehouse are “still” there, and are being merged with ever more sources. This has the net effect of enhancing or enriching existing data sets. Robust and efficient data integration is almost guaranteed to become one of the next frontiers, and business people begin to realize it is hard work, a genuinely difficult task. Yet without integration, information silo’s cannot possibly live up to their potential.

Big Data mostly live in NoSQL solutions, for reasons of cost and performance. Although traditional (SQL) access to data is the bedrock of BI, it has proven inadequate for contemporary data analytics needs: it is just too slow and expensive. Since it is obvious there is no commercially viable way to store all these Big Data in your existing (traditional) data warehouse solution, you can either move your “small data” to a NoSQL environment, or you will need some bridge. That is where data virtualization comes in.

Your data warehouse has already proven its value, and is not going to be replaced, and least not just yet. Moving it to NoSQL doesn’t add value or save costs until you “kill” the old data warehouse. But on top of that, data warehousing in the NoSQL world isn’t nearly mature enough to fulfill that role. And at the current pace, that stage of maturity doesn’t even appear on the horizon. All the reports and datasets that come out of your existing data warehouse, that I refer to as “small data”, guide the organization. Once customer segments have been established, the organization learns to think in terms of these groups as if they really existed, beyond the label we attach to database records.

Big Data are too fickle, too volatile, the business case for permanently storing all of them is too unclear. The interface specifications are uncertain and malleable, all of which makes ETL-type data integration problematic and risky. Calling a development team “Agile” doesn’t solve that problem. At all. I would argue that the reason NoSQL solutions gained their popularity is because “traditional” BI simply couldn’t keep up with the pace of development and innovation that business were displaying.

I have written and spoken a lot about data virtualization that will fill a gap here, but it’s only a technical stopgap, albeit useful and valuable at times. As I wrote a few days ago ( Big Data alone won’t get you very far. Big Data alone are “just” another cumbersome silo that require even more data preparation than we as analysts are already used to. It is commonly estimated that about 90% of analysts’ time is spent merely preparing the data, before any value is had from it. That situation is often even worse with the new-ish, semi-structured Big Data solutions. Not a good use of your expensive and scarce data scientists.

Data virtualization can help to make fickle Big Data, with a relatively low half-life time, efficiently accessible for analysis. When there is no use case (yet) to move these data to your “small data” environment (typically persistent, expensive relational storage), then at least you can join data in a flexible and inexpensive way. I use the word “join” here in common vernacular, not the way a BI developer might think of. Because without that (potentially virtual) link between Big Data and traditional sources, results are probably lacking essential (business) context.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s