Big Data, Big Overhead or Bad Math?

Published in

codeburst

5 min readNov 18, 2018

In 2015 The Economist wrote an article about big data — Data,data everywhere. Now, if any magazine should be numerate I would think it would be The Economist (a magazine I should add that I subscribe to and am normally a great admirer of). However, in this case it seems they failed to do the math and got taken in by the big data hype machine.

This article says that Walmart has collected 2.5 petabytes of data on customer transactions to be used for data mining. Gathering large amounts of retail data has been a hot topic for a few years. Everyone talks about the discovery that people who buy diapers often buy beer. (See Beer and Diapers: The Impossible Correlation). This has been cited in thousands of articles as an example of the magical knowledge that one can mine from big data.

The Economist article mentions, with a sense of awe over the scale of the data that Walmart does, about 1 million transactions an hour. It goes on to say that all of this data is stored in a massive data storage that is around 2.5 petabytes in size.

A petabyte is 1,000 terabytes, or 1 million gigabytes.

Now, a million transactions an hour sounds like a lot. The Economist article is fawningly awestruck by the sheer size of the data that is produced and stored.

But what is that data? Does that number of transactions really add up to that much data?

There are 8,760 hours in a year. At 1 million transactions an hour, that is 8.76 billion transactions a year. So let’s go with 10 billion even though that means a million transactions per hour for 24 hours a day, 365 days of the year and then a major round up.

What is a transaction? Well, someone comes into a Walmart store and buys some stuff. So let’s look at what kind of data is in a transaction.

Now, in the following I’m going to assign a number of bytes for certain kinds of information. For example standard storage for a date/time takes 8 bytes. Numbers like prices can be stored in 4 bytes. A credit card number is usually 16 digits but some can be up to 19. So we will go with 19 bytes for a credit card.

So, someone goes into a Walmart and buys some stuff. Let’s assume that the average number of items is five. I think that is high, given that I often go into places and buy one item.

On a transaction there is header information that applies to the whole transaction and then there are items (we are assuming five).

Sometimes this is a cash sale and not a lot is known about the customer. However, let us assume we have a date/time (8 bytes), a credit card (19 bytes), an affinity card (19 bytes), and the store ID — say 8 bytes. Information on the particular till taking the transaction, say 8 bytes. That is 62 bytes of header. Each item has a product code (12 bytes), a price (4 bytes) a quantity (4 bytes). Which give us 20 bytes per item, which is 100 bytes for 5 items. That means that the transaction is around 162 bytes.

We may have missed some things here. For example there could be more than one product code on an item (external and internal). There could be some discount codes for special offers. So to be safe let’s greatly blow up the size of the transaction. Instead of 162 bytes we will say 1,000 bytes, which is ridiculously large for a transaction and should more than cover any extra information they are carrying that we haven’t counted.

So at our greatly inflated 10 billion transactions a year, and our gross overestimation of 1,000 bytes per transaction we have 10,000 billion bytes per year, which is 10 terabytes.

Now let’s assume they are keeping 10 years of data. That would be 100 terabytes of data.

A petabyte is 1,000 terabytes. According to the article, Walmart was keeping 2.5 petabytes which is 2,500 terabytes. That is 25 times more data than they could possibly collect over 10 years even allowing for ridiculously large estimates of data sizes.

So what is all that data? If that number is at all real, then the 2.5 petabytes would be mostly overhead. This would mean that all that data spinning around on massive farms of disks is about 96% overhead and 4% real data. In actuality the number is probably close to 99% overhead and 1% real data.

Any competent programmer would think that they could store that data in a much better way and reduce the size by a factor of close to a hundred. But that would remove the need for specialized software and the massive investment in storage hardware. So be careful suggesting that to big data startups or other companies in that space.

If you do the math on almost all big data numbers you will find this giant discrepancy between the actual amount of data and what is being stored.

A lot of this is because the data is being stored with relational databases which are incredibly inefficient when it comes to data storage. Some of it is just manufacturers trying to pad the data sizes to justify software and hardware sales. Or it just might be they aren’t good at multiplication.

But there are articles that seem even more out of touch with numeric reality than The Economist. Forbes had an article Really Big Data At Walmart: Real-Time Insights From Their 40+ Petabyte Data Cloud.

In this article Walmart is cited as gathering 2.5 petabytes an hour. Now we saw above, that at 1 million transactions per hour it would take 250 years to come close to getting 2.5 petabytes. So getting that much in an hour is …well, what can I say…unbelievable.

Consider that the entire YouTube upload is about 1 petabyte a day. So this article is saying that Walmart adds about 60 times as much data a day as the entire YouTube network.

Does anyone edit this stuff? Does anyone ever do the math?

There is, unfortunately, a deep level of innumeracy in our technical press, in corporate management, and in the investment community. So numbers like this just go unchecked.

So the next time you read about big data, or see a start up trying to ride the big data wave, think about whether that is really big data, or big overhead, or just bad math.

Big Data, Big Overhead or Bad Math?

Written by Lance Gutteridge