16 December 2018
“Big Data” are often characterized by the “3 V’s”: Volume, Velocity and Variety. Those three V’s are said to have been coined by Doug Laney, in his “3D Data Management” paper from 2001. Later (2014) a fourth V got added that is commonly recognized: “Veracity.” It reflects how much uncertainty, noise, or ambiguity there is in the data. And why stop there? A fifth V, “Value” soon (2014) emerged. But why stop there? Not for long (2017), someone came up with “7 V’s”, then 10 V’s, and a satirical “42 V’s of Big Data and Data Science.”
Big Data are a big deal. In 2011, the McKinsey Global Institute dubbed big data the fourth industrial revolution. But although big data can be characterized in terms of three V’s, they are in no way defined by them. As such, the three V’s are useful for PR and marketing, but do little to define or delineate big data applications. What do I mean by that? Even when none of the three V’s are present, you can still have a “big data application.”
Do the 3 V’s define big data?
For example, let’s take the development of modern day dive computers for recreational divers. Nowadays, almost all amateur scuba divers plan their dives using computers. These devices are less error prone than humans, and they generally allow for considerably longer dives while still staying well within safe boundaries to avoid risks of decompression sickness. Without a computer you need to plan your dive with a Recreational Dive Planner table, and monitor “bottom time” with analog devices like a watch and depth gauge.
It used to be (prior to the 80’s) that scuba divers planned their dives based on tables that were developed with data gathered from US Marines. In a controlled experiment, they were exposed to extreme dive profiles to determine boundaries for safe diving depth and length. Those tables were initially developed in 1937, and later adjusted for recreational diving and published in 1956.
When recreational scuba diving became ever more popular with the masses, the industry began to realize that the typical characteristics of US Navy Seals could be rather different from your average amateur recreational diver. So new experiments were conducted in 1987 (later extended in 1989) under supervision of Dr Raymond Rogers with a group of individuals that more closely resembled “average” divers, rather than US Marines. A group of 911 subjects in all. Extensive analysis and algorithm testing followed. This led to the current PADI Recreational Dive Planner tables, tested by DSAT (Diving Science and Technology).
Since these data are quite old, Velocity is absent. And with 911 subjects, Volume is very small, too, although hundreds of test dives were conducted. Variety is all but absent, since almost all data are structured data on a very limited number of metrics like dive time and depth, and whether symptoms of decompression occurred, or not. Nonetheless, the work to develop more accurate and valid models to predict the risk of decompression sickness symptoms would be considered a worthy application of data science and big data by just about anyone in the discipline. With zero out of three V’s present. A look at the history of decompression research and development shows how profound and versatile development and analysis have been. Arguably as advanced as in any other discipline, while more and more data are combined and contrasted to continue to improve divers’ safety.
Awareness of 3 V’s
Socializing these “3 V’s” has done a lot to create broad awareness about big data and its importance for driving innovation and competitive advantage. I think this is a good thing, because the emergence of big data largely followed Moore’s Law, and although exponential growth is a phenomenon whose impact typically defies human imagination, it creeps upon on us “slowly” – in a gradual, piecemeal fashion. But because of the spectacular growth rate for over half a century already, the impact could easily be overlooked.
When I am having drinks at our neighborhood BBQ, and tell people I work in data science, it’s quite surprising how many people immediately (and unprompted) associate this with big data. All the airline magazines and news flashes must be having some impact. Interestingly, the concept “sticks”, so presumably the concept of three V’s has worked to stir lay peoples’ imagination.
Here is my concern with the 3 V’s: they do nothing to define big data. Neither one of these three V’s is a necessary, nor sufficient condition to describe or even pin down big data analytics. You can have big data solutions without any large Volumes: in many healthcare applications medical scientist do amazing work but typically need to bootstrap because there are so few records, especially with rare conditions. Macro-economic models might look at economic cycles that are measured in decades, hardly impressive data Velocity. And neither is Variety required for big data analytics. Think about clickstream analysis, for instance, where you only look at web clicks, albeit billions of them.
Maybe, just maybe we are ready to enter a new phase with big data. Awareness is considerable, and data science is getting traction as a discipline. If we continue to confine ourselves by V’s that don’t even need apply, our progress could stall. Managing expectations is a full-time job. Only with adequate expectations can we hope to break through organizational gridlock and give data science and analytics the place they deserve on the corporate agenda.