The term “data scientist” was coined in 2008 by two LinkedIn analysts to describe their work deriving business value from the masses of data being generated by their website.
It was the dawn of the Big Data era: organizations with a few talented individuals, who had the right combination of skills, were consistently outperforming the collective brainpower of their competitors.
How long can it last? Who knows? At the dawn of civil aviation, a pilot was equal parts adventurer, cartographer and mechanic, as well as someone who could fly a plane.
The imagination and creativity that pushed them to explore the skies even inspired a few to write famous children's books, such as The Little Prince by Antoine de Saint-Exupéry and James and the Giant Peach by Roald Dahl.
To start with, read James Kobelius' hilarious Data Scientists: Myths and Mathemagical Superpowers where he will disabuse you of any preconceived notions: the data scientist is neither unicorn nor trumped-up BI analyst, and came down from the ivory tower a while ago. In short, it's a very real profession.
Then drilling down on the technical side, you can get an entertaining, but stressful, map of the many tasks and tools to master, as shown in Becoming a Data Scientist – Curriculum via Metromap:
So now you've ascertained that "data scientist" is a legitimate job, and that there's an awful lot of technical buzzwords that go with it. How do you filter the wheat from the chaff when looking at candidates?
It really boils down to these 6 core skills:
1. Communicates effectively to business users
What to look for: Can create a Powerpoint presentation as powerful as your Marketing VP's.
How to test: When presenting his results, does your data scientist remember to highlight the hypothesis to explore in green, and the one to reject in red?
2. Knows your business
A data scientist needs to have an overall understanding of the key challenges in your industry, and consequently, your business. She must be familiar with the industry's financial ratios to rapidly assess whether there is a potential gain, its order of magnitude, and then find inspiration before taking her next breath.
Another characteristic of a true data scientist is that she's fascinated by the subjects that will have the greatest impact, not the problems in themselves. A data scientist is not a scientist in the traditional sense; it’s not the quest for truth that drives her, but the process to uncover it.
What to look for: Goes for the highest stakes instead of the most complete.
How to test: Does your data scientist prefer to deal with a safe bet of $100k or a risky endeavor worth $1M?
3. Understands statistical phenomena
Data scientists must be able to correctly interpret statistics: is a result representative or not?
This takes an understanding of statistics that allows the data scientist to assert, with authority, why 3% is statistically significant for certain cases, but means nothing for others.
This skill is key, since the majority of stats we analyze contain statistical bias that needs correcting.
What to look for: Can understand what is statistically significant.
How to test: Does your data scientist go bug eyed when you ask him to calculate a confidence threshold for his assessment?
4. Makes efficient predictions
The data scientist must have a broad knowledge of algorithms to select the right one, and moreover, know which features to adjust to best feed the model. There is often a certain degree of creativity involved here; as a painter uses color to convey depth, a data scientist must know how to combine different data so they complement each other.
What to look for: Instinctively knows which features to add to the model.
How to test: Has your data scientist already tried all the variables or derivatives that you can possibly think of?
5. Provides production solutions
Today's data scientists need to provide services that can run daily, on live data.
What's new here is that historically, back office models built by BI or data mining teams were often re-written by technical teams for real-time production environments. Nowadays, a recommender system cannot withstand a rewrite before being put online.
What to look for: Knows how to deliver production-ready solutions.
How to test: Does your data scientist run the other way when asked to code his algorithm in Java?
6. Can work on a mass scale
A data scientist must know how to handle multi-terabyte datasets to build a robust model that holds up in production. He must not be afraid of datasets with a 12-digit file size. In practice this means that he needs to have a good idea of computation time, what can be done in memory and what, on the other hand, requires Hadoop and MapReduce.
What to look for: Someone not afraid of big datasets.
How to test: Does the prospect of reconciling several customer datasets of a few million lines apiece make your data scientist break into a cold sweat?