The term “data scientist” was coined in 2008 by two LinkedIn analysts to describe their work deriving business value from the masses of data being generated by their website. Since then, people haven't stopped wondering what a data scientist really is (even some companies looking to hire them aren't so sure). If you're seeing an intrepid data scientist to lead your company on the path to Enterprise AI, what skill sets should you be looking for?
To start with, I recommend reading James Kobelius' hilarious Data Scientists: Myths and Mathemagical Superpowers where he will disabuse you of any preconceived notions: the data scientist is neither unicorn nor trumped-up BI analyst, and he came down from the ivory tower awhile ago. In short, data science is a very real profession.
If you drill down more on the technical side, you can get an entertaining, but stressful, map of the many tasks and tools to master, as shown in Becoming a Data Scientist – Curriculum via Metromap:
So now you've ascertained that "data scientist" is a legitimate job, and that there's an awful lot of technical buzzwords that go with it, how do you filter the wheat from the chaff when looking at candidates?
It's not easy, because as the years have evolved, so have the different breeds of data scientists out there. But it really boils down to these 6 core skills:
A Good Data Scientist Communicates Effectively
The harsh reality is that statistics are complex. A data scientist has no hope of enlightening the average business user with an Excel file. To let the data tell a story, a data scientist needs to have a veritable Swiss army knife of presentation skills to convey their results persuasively, to anyone. This can range from the most mundane (Powerpoint presentation) to the most exotic (multimedia storytelling using interactive Javascript visualizations based on the latest D3 framework).
What to look for: Can create a Powerpoint presentation as powerful as your Marketing VPs.
How to test: When presenting his results, does your data scientist remember to highlight the hypothesis to explore in green, and the one to reject in red?
A Good Data Scientist Knows Your Business
A data scientist needs to have an overall understanding of the key challenges in your industry, and consequently, your business. For example, (s)he must be familiar with the industry's financial ratios to rapidly assess whether there is a potential gain, its order of magnitude, and then find inspiration before making the next move.
Another characteristic of a true data scientist is that he or she fascinated by the subjects that will have the greatest impact, not the problems in and of themselves. A data scientist is not a scientist in the traditional sense; it’s not the quest for truth that should drive, but the process to uncover it.
What to look for: Goes for the highest stakes instead of the most complete.
How to test: Does your data scientist prefer to deal with a safe bet of $100k or a risky endeavor worth $1M?
A Good Data Scientist Understands Statistical Phenomena
Data scientists must be able to correctly interpret statistics: is a result representative or not? This takes an understanding of statistics that allows the data scientist to assert, with authority, why 3% is statistically significant for certain cases, but means nothing for others.
This skill is key, since the majority of stats we analyze contain statistical bias that needs correcting.
What to look for: Can understand what is statistically significant.
How to test: Does your data scientist go bug eyed when you ask him to calculate a confidence threshold for his assessment?
A Good Data Scientist Makes Efficient Predictions
The data scientist must have a broad knowledge of algorithms to select the right one, and moreover, know which features to adjust to best feed the model. There is often a certain degree of creativity involved here; as a painter uses color to convey depth, a data scientist must know how to combine different data so they complement each other.
What to look for: Instinctively knows which features to add to the model.
How to test: Has your data scientist already tried all the variables or derivatives that you can possibly think of?
A Good Data Scientist Provides Production-Ready Solutions
Today's data scientists need to provide services that can run daily on live data. In other words, operationalization needs to be a regular part of their vocabulary.
What's new here is that historically, back office models built by BI or data mining teams were often re-written by technical teams for real-time production environments. Nowadays, a recommender system cannot withstand a rewrite before being put online.
What to look for: Knows how to deliver production-ready solutions.
How to test: Does your data scientist run the other way when asked to code his algorithm in Java?
A Good Data Scientist Can Work at Enterprise Scale
Data scientists must know how to handle multi-terabyte datasets to build a robust model that holds up in production. They must not be afraid of datasets with a 12-digit file size. In practice this means having a good idea of computation time, what can be done in memory and what, on the other hand, requires Hadoop and MapReduce.
What to look for: Someone not afraid of big datasets.
How to test: Does the prospect of reconciling several customer datasets of a few million lines apiece make your data scientist break into a cold sweat?