Volume, velocity, and variety: Understanding the three V’s of big data
For those struggling to understand big data, there are three key concepts that can help: volume, velocity, and variety. These three vectors describe how big data is so very different from old school data management.
We practitioners of the technological arts have a tendency to use specialized jargon. That’s not unusual. Most guilds, priesthoods, and professions have had their own style of communication, either for convenience or to establish a sense of exclusivity. In technology, we also tend to attach very simple buzzwords to very complex topics, and then expect the rest of the world to go along for the ride.
Take, for example, the tag team of “cloud” and “big data.” The term “cloud” came about because we systems engineers used to draw network diagrams of local area networks. Between the LANs, we’d draw a cloud-like jumble meant to refer to, pretty much, “the undefined stuff in between.” Of course, the Internet became the ultimate undefined stuff in between, and the cloud became The Cloud.
To Mom and Dad and Janice in Accounting, “The Cloud” means the place where you store your photos and other stuff. Many people don’t really know that “cloud” is a shorthand, and the reality of the cloud is the growth of almost unimaginably huge data centers holding vast quantities of information.
Big data is another one of those shorthand words, but this is one that Janice in Accounting and Jack in Marketing and Bob on the board really do need to understand. Not only can big data answer big questions and open new doors to opportunity, your competitors are using big data for their own competitive advantage.
That, of course, begs the question: what is big data? The answer, like most in tech, depends on your perspective. Here’s a good way to think of it. Big data is data that’s too big for traditional data management to handle. Big, of course, is also subjective. That’s why we’ll describe it according to three vectors: volume, velocity, and variety — the three Vs.
- Volume
Volume is the V most associated with big data because, well, volume can be big. What we’re talking about here is quantities of data that reach almost incomprehensible proportions.
Facebook, for example, stores photographs. That statement doesn’t begin to boggle the mind until you start to realize that Facebook has more users than China has people. Each of those users has stored a whole lot of photographs. Facebook is storing roughly 250 billion images.
Can you imagine? Seriously. Go ahead. Try to wrap your head around 250 billion images.
So, in the world of big data, when we start talking about volume, we’re talking about insanely large amounts of data. As we move forward, we’re going to have more and more huge collections. For example, as we add connected sensors to pretty much everything, all that telemetry data will add up.
Or, consider our new world of connected apps. Everyone is carrying a smartphone. Let’s look at a simple example, a to-do list app. More and more vendors are managing app data in the cloud, so users can access their to-do lists across devices. Since many apps use a freemium model, where a free version is used as a loss-leader for a premium version, SaaS-based app vendors tend to have a lot of data to store.
Todoist, for example (the to-do manager I use) has roughly 10 million active installs, according to Android Play. That’s not counting all the installs on the Web and iOS. Each of those users has lists of items — and all that data needs to be stored. Todoist is certainly not Facebook scale, but they still store vastly more data than almost any application did even a decade ago.
Then, of course, there are all the internal enterprise collections of data, ranging from energy industry to healthcare to national security. All of these industries are generating and capturing vast amounts of data.
That’s the volume vector.
- Velocity
Remember our Facebook example? 250 billion images may seem like a lot. But if you want your mind blown, consider this: Facebook users upload more than 900 million photos a day. A day. So that 250 billion number from last year will seem like a drop in the bucket in a few months.
Velocity is the measure of how fast the data is coming in. Facebook has to handle a tsunami of photographs every day. It has to ingest it all, process it, file it, and somehow, later, be able to retrieve it.
Here’s another example. Let’s say you’re running a presidential campaign and you want to know how the folks “out there” are feeling about your candidate right now. How would you do it? One way would be to license some Twitter data from Gnip (recently acquired by Twitter) to grab a constant stream of tweets, and subject them to sentiment analysis.
While big data has been turned into more of a marketing term than a technology, it still has enormous untapped potential. But, one big issue has to get solved first.
That feed of Twitter data is often called “the firehose” because so much data (in the form of tweets) is being produced, it feels like being at the business end of a firehose.
Here’s another velocity example: packet analysis for cybersecurity. The Internet sends a vast amount of information across the world every second. For an enterprise IT team, a portion of that flood has to travel through firewalls into a corporate network.
Unfortunately, due to the rise in cyberattacks, cybercrime, and cyberespionage, sinister payloads can be hidden in that flow of data passing through the firewall. To prevent compromise, that flow of data has to be investigated and analyzed for anomalies, patterns of behavior that are red flags. This is getting harder as more and more data is protected using encryption. At the very same time, bad guys are hiding their malware payloads inside encrypted packets.
Or take sensor data. The more the Internet of Things takes off, the more connected sensors will be out in the world, transmitting tiny bits of data at a near constant rate. As the number of units increase, so does the flow.
That flow of data is the velocity vector.
- Variety
You may have noticed that I’ve talked about photographs, sensor data, tweets, encrypted packets, and so on. Each of these are very different from each other. This data isn’t the old rows and columns and database joins of our forefathers. It’s very different from application to application, and much of it is unstructured. That means it doesn’t easily fit into fields on a spreadsheet or a database application.
Take, for example, email messages. A legal discovery process might require sifting through thousands to millions of email messages in a collection. Not one of those messages is going to be exactly like another. Each one will consist of a sender’s email address, a destination, plus a time stamp. Each message will have human-written text and possibly attachments.
Photos and videos and audio recordings and email messages and documents and books and presentations and tweets and ECG strips are all data, but they’re generally unstructured, and incredibly varied.
All that data diversity makes up the variety vector of big data.
Managing the three Vs
It would take a library of books to describe all the various methods that big data practitioners use to process the three Vs. For now, though, your big takeaway should be this: once you start talking about data in terms that go beyond basic buckets, once you start talking about epic quantities, insane flow, and wide assortment, you’re talking about big data.
One final thought: there are now ways to sift through all that insanity and glean insights that can be applied to solving problems, discerning patterns, and identifying opportunities. That process is called analytics, and it’s why, when you hear big data discussed, you often hear the term analytics applied in the same sentence.
The three Vs describe the data to be analyzed. Analytics is the process of deriving value from that data. Taken together, there is the potential for amazing insight or worrisome oversight. Like every other great power, big data comes with great promise and great responsibility.
This article was originally published on www.zdnet.com and can be viewed in full
Archive
- October 2024(44)
- September 2024(94)
- August 2024(100)
- July 2024(99)
- June 2024(126)
- May 2024(155)
- April 2024(123)
- March 2024(112)
- February 2024(109)
- January 2024(95)
- December 2023(56)
- November 2023(86)
- October 2023(97)
- September 2023(89)
- August 2023(101)
- July 2023(104)
- June 2023(113)
- May 2023(103)
- April 2023(93)
- March 2023(129)
- February 2023(77)
- January 2023(91)
- December 2022(90)
- November 2022(125)
- October 2022(117)
- September 2022(137)
- August 2022(119)
- July 2022(99)
- June 2022(128)
- May 2022(112)
- April 2022(108)
- March 2022(121)
- February 2022(93)
- January 2022(110)
- December 2021(92)
- November 2021(107)
- October 2021(101)
- September 2021(81)
- August 2021(74)
- July 2021(78)
- June 2021(92)
- May 2021(67)
- April 2021(79)
- March 2021(79)
- February 2021(58)
- January 2021(55)
- December 2020(56)
- November 2020(59)
- October 2020(78)
- September 2020(72)
- August 2020(64)
- July 2020(71)
- June 2020(74)
- May 2020(50)
- April 2020(71)
- March 2020(71)
- February 2020(58)
- January 2020(62)
- December 2019(57)
- November 2019(64)
- October 2019(25)
- September 2019(24)
- August 2019(14)
- July 2019(23)
- June 2019(54)
- May 2019(82)
- April 2019(76)
- March 2019(71)
- February 2019(67)
- January 2019(75)
- December 2018(44)
- November 2018(47)
- October 2018(74)
- September 2018(54)
- August 2018(61)
- July 2018(72)
- June 2018(62)
- May 2018(62)
- April 2018(73)
- March 2018(76)
- February 2018(8)
- January 2018(7)
- December 2017(6)
- November 2017(8)
- October 2017(3)
- September 2017(4)
- August 2017(4)
- July 2017(2)
- June 2017(5)
- May 2017(6)
- April 2017(11)
- March 2017(8)
- February 2017(16)
- January 2017(10)
- December 2016(12)
- November 2016(20)
- October 2016(7)
- September 2016(102)
- August 2016(168)
- July 2016(141)
- June 2016(149)
- May 2016(117)
- April 2016(59)
- March 2016(85)
- February 2016(153)
- December 2015(150)