Everyone knows the oh-so-popular statement that a data scientist spends 50 to 80% of his time cleaning and preparing his data before he even starts looking for insights in it. Everyone’s been talking about this since 2014 really, and yet it hasn’t changed.
Data Wrangling Is Inherent to Big Data
What the articles about this phenomenon don't tend to touch on is the fact that data preparation is inherent to big data. The more data you bring in to train your model on, the better your model is, but also the dirtier that data is.
When you’re bringing in not just pre-formated weblogs but data from documents, sensors, CRM tools, and social connectors, it all comes in different formats. It needs to be cleaned and unified before you can use it even to get basic business insights. And you don't want a data infrastructure that pre-processes all your data too rigidly and puts you at risk of losing most of the valuable information hidden in the raw data sources.
Of course, there are many tools available that try to make this easier, but a lot of data scientists tend to come back to the tools they know best: coding in Python, SQL, and even R. Why? Because they are most flexible and can be manipulated easily and naturally by data scientists. Also because data scientists need to think about their future, and they don’t want to become specialists in a skill that will be restrictive when they look for their next job.
Data Wrangling in Code is Costly
In any case, data scientists end up spending so much of their time looking into the data, finding where the problems in the data are, replacing incorrect values or formats, correcting anomalies, finding and checking keys to join on, performing multiple splits to generate new features, etc., all the while going back-and-forth and testing to see if these new features impact algorithmic performance positively. And each time they get started on a new dataset, they have to start over from scratch and recode.
This approach proves to be even more complicated when you start dealing with more than one data scientist. Trying to re-read someone else’s code and understand it to find out where the schema was messed up and while your whole pipeline is down is complicated.
How to Pimp your Data Wrangling
There isn’t a miracle solution to end all data preparation hassle of course; every job has its annoying tasks. And, like it or not, data munging is useful to data science. But we’ve found that there is a way to make it easier: Visual Preparation!
Ok now just hear me out. I know it’s not a new idea, and so many tools out there offer visual interfaces to clean that are rejected by data scientists. That’s not what I’m referring to.
A visual interface can never be as flexible as a human using code - that’s a fact. But what if you could go back-and-forth seamlessly in one interface between visual accelerators and coding in your favorite languages?
So instead of rewriting your code for annoying little standard operations like date parsing or geocoding, or text processing or various folds and unfolds, you can just use a visual editor, see the effect each step has on your data, and go back to tweak your features at any time. And when you want to go into more advanced cleaning and enriching, or if some operations are just easier for you in code, you can just switch to writing code, SQL, R, Python, anything in shell (and all the equivalents for distributed databases), all of this in the same interface.
At Dataiku, that’s what we call the visual interactive data preparation, and it has over 80 pre-coded processors designed by our own data scientists to make their work more efficient.