When is ‘Big Data’ Really ‘Big’?

Opinion| data science| connected sensors | | Florian Douetteau

As I’m involved in the startup ecosystem, I meet a lot of startups who work with data. Every week, people ask me the same question: “Is the volume of data I’m dealing with big enough to be called ‘Big Data’?”

I understand their point perfectly: a startup that starts a data-oriented project always wants to put a “Big Data” label on its slides. It catches the data-focused VC’s eyes and the public’s attention. However, at some point, for instance before the startup has to pitch its product in front of an investor, people start wondering: “Am I really doing Big Data?”

Big Data is bigger in volume and bigger in reach

Big Data: from one tera to one peta

Big Data can indeed mean different things. Let’s take two different cases.

Case #1: you work in a context of a relatively big economic power like France and you make around 1 billion in revenue. You deal with a few million individuals who generate a few contact points a month. The data generated is both machine- and human-related. The amount of data you have to deal with would be around a few terabytes total.

Case #2: you are a huge digital stakeholder like Facebook, which stores all the pictures from your users, or a network of satellites that photographs the whole planet. The amount of data you have to deal with would be around hundreds of petabytes.

The difference between the two cases in point is a factor of 100,000. However, does that mean the former doesn’t deserve the title “Big Data”?

Big Data is bigger in volume and bigger in reach


The main issue with big data is that you need a global definition. If we take a look back in history we can see that players like credit cards providers or retailers already had to deal with a few terabytes of data already in the 1990’s. Then, two distinct waves occurred. First, the volume of data skyrocketed. This phenomenon was a consequence of the emergence of machine-to-machine data, the installation of connected sensors, and the generation of systematic media data. Since then, the need to deal with a greater volume of data has shifted from being confined to a handful of global companies in the 1990’s to being a daily issue for a million companies today.

One of the main issues of Big Data today is to offer all those companies the same level of relevance in analyzing their data than to the first 10 or 20 world-leaders.

It’s a matter of nameplate

There are as many differences between what you can do with satellite data and online advertising data or retail consumer behaviour data as the differences between a plastic surgeon and a cardiac surgeon. You don’t go to see a plastic surgeon for a cardiac disease – unless you want to play with fire! Apply that logic to Big Data. Or not.

I will conclude with a piece of advice: just as it helps a doctor to put a nameplate on the door to get more patients, you should not hesitate to add the “B” word on your slides to impress investors. Only your results will determine if you deserve to be called what you want.

PS: This presentation by O. Grisel about Predictive Analytics and the usage of the term "big data" gives an interesting perspective about what's big, what's smallbut nonetheless important.

Try Dataiku DSS

Other Content You May Like