We live in the world of big data. We all know this. I do, you do, heck, I even picked up Cosmopolitan the other day and it had a definition of big data in it!
However, just because something is mainstream does not mean it makes sense to everyone. Quite the opposite really. So many people have talked and written and tried to define big data but for non-data-scientists-or-data-engineers no one really knows where to start to get Big-Data-ready anymore. Right?
So I’m not going to walk in the footpaths of big data experts and estimated colleagues worldwide and try to further explain big data. I’m just going to discuss what changes in your job when you shift from small data to big data.
From Small Data Pipelines to Big Data Storage
First off, we should try to understand what small data even is. Well, small data is used to describe processes built to handle smaller volumes of data than what we’re starting to deal with today. This data is traditionally stored in basic SQL relational databases. It’s pre-processed to make sure the information is stored in a structured and solid way, and can be accessed easily. Analysts then get that structured data every day or week with tools like Oracle, SAS, or ETL platforms. They export reports to Excel and build more reports daily, weekly, and monthly.
If you try and look at how big data affects this it all comes down to one thing: size (that’s what she said). Yes indeed: big data is just too big to fit into these traditional pipelines. Much more data is collected, from lots of different sources (transactional, but also logs on every site, or in connected objects, as well as CRM data from emails, sales informations from several tools, social media data, open data used to enrich, and so on). That volume of data has to be handled by new systems to ensure that it stays accessible and solid: distributed systems. Hadoop is the most famous one (yep, that cute little elephant guy).
From Rigid ETLs to Going Straight To the Data Sources
That data also comes in lots of different formats, so it doesn’t fit into the strict pipelines and manipulation processes set up by ETLs. More importantly, there’s so much information in that raw data that you don’t know what’s in it until you start digging. So you don’t want to pre-process it and risk losing important bits, you want to store as much as possible and lose the strict processes.
This also means that you can’t rely on interfaces to spit out pre-built reports anymore to get valuable insights; you want to get to the source, that raw data, and get as much as you can out of it. So yes, big data means walking away from Excel for a little while. It means new tools, whether code like SQL, Python, R, or even Hive, Pig, MapReduce, Scala or C++, or platforms that allow analysts to manipulate the data without code (pause for sigh of relief).
More than anything though, big data means lots of extra value. It means new jobs created, in data management and storage, in analytics, data science and algorithm development. It also means new business models and ways for companies to become much more efficient, or even reinvent themselves.
If you want to know more about how to go from Excel to a Big Data Analytics Mindset, look no further.