At Dataiku we want to shorten the time from the analysis of the data to the effective deployment of predictive applications.
We are therefore incentivized to develop a better understanding of data scientists daily tasks (data exploration, modeling, and deployment of useful algorithms across organizations) and where the frictions occur causing relevant analytics to not be delivered in production smoothly.
Some years ago, industry leaders with input from more than 200 data mining users, data mining tool, and data mining service providers tried to model best practices to realize better, faster results from their data.
They came up with a methodology known as CRISP-DM, the Cross Industry Standard Process for Data Mining, an industry-, tool-, and application-neutral model. The method divides tasks into two phases: an experimental phase and a deployment phase.
The above diagram of a whirlpool explains how the CRISP phases work in harmony. There are two overlapping paths, and data intersects the center of both. The inner crescent moon loop in the center on the right side to the data reflects the experimental phase during which analysts are thinking about and modeling various hypotheses. The arrows leading to deployment stands for the process during where satisfactory experiments and models are deployed for production. The loop outside creating the entire moon represents the process overall, where analysts and individuals dealing with data start this process of experimenting and deploying over and over again.
Interestingly, even though CRISP-DM as a method was developed in 1996, there were no tools that managed to design something consistent with the core concepts of this methodology. Most tools on the market for data tended to not separate the experiment phase and the deployment phase.
Specifically, two different concepts are often mixed: a workflow and data pipeline
A workflow is how a human maps his own work and thought processes to some abstract forms of labor on the data. For example, a person could think that she needs to clean the data, and then check the distributions of various aspects of the data with some graphs, before deciding on some statistical tests. She then writes some rough drafts of code to further think about her process and check her thoughts for further ideas. This is the experiment level within the CRISP-DM methodology.
A data pipeline is very much like an ETL (Extract, Transform and Load) mainly composed of data transformations and their related dependencies (data A is stats_x based on data B & C). These data transformations can be static or generated code (such as models in machine learning). This is the deployment level.
The mixing of these two concepts is what causes intermixing of deployment and experiment phases.Experiments very often pollute the final data pipeline with unnecessary nodes along the iteration process. This causes difficulties.
It is hard to follow a data project (what has been validated ? what is currently under progress ?) yielding to difficulties working collaboratively on the same data project.
What is the delivered product of a data analysis is unclear (is it some charts on scoring, is it the associated workflow that produces that chart, is it the replayable sub pipeline of the entire data pipeline?) At a higher level, this causes people who are not hands on with the data to not be involved at all stages of the experiment validation process because they are only given reports that stands out of the given data science tool.
Because of this, the pace of iterations (sequence of experiments that fails and others that succeed) is drastically slowed down.
Data Science Studio aims to solve these issues by clearly separating the experiment tasks and the deployment tasks into different locations within the same tool.
In Data Science Studio, successful experiment are deployed in the flow. This is a data pipeline that looks like the diagram below:
Data Flow in DSS
Each node in the flow contains transformation (created by code or with visual tools) or model that has been validated during a dedicated prior experiments. The flow can be replayed thanks to our smart running engine that optimize data load and processing. Models in the flow can be retrained and all their versions are kept so that the user decides which one suits the best for scoring. In other word, the flow is always ready for production.
Where the experiments are performed in Data Science Studio ?
For the code lovers, experiments may be done within interactive notebooks (Python, R, SQL) in a comfortable drafting. Satisfactory code is then included in a node in the flow.
For non coders, experiment are performed in a purely visual notebook called the analysis module where common tasks on the data can be taken on:
- assessing the quality of the raw data
- cleansing and visualization
- creating new features & modeling
- assessing the ML model and visualize the predictions
The analysis module provides more than 60 built-in, highly-customizable processes to apply onto data. It also includes different 15 chart types. Machine learning options are very rich, including a powerful feature engineering, easy and customizable cross validation and hyperparameter optimization, various objective metrics, threshold optimization, and built-in charts dedicated for assessing models.
Machine Learning in DSS
Even though this tool sound complicated, it has been designed for beginners as well as advanced users in the very early stages of thinking about their data analysis experiments. It has in-place documentation and intuitive navigation in order to help everyone master the key concepts of data science.
Model Analysis in DSS
Advanced data scientists who like to code may also use the Analysis module as powerful starting point. They could then export a model in readable python code that they can play with as they wish.
Now, when an experiment is satisfactory, a data scientist or an analyst delivers the relevant content to the flow. This can be a transformation on the original data, a model (to be retrained or not), a chart, some code involving any of the above.
The CRISP-DM loop is closed and high iterations are now within reach for data teams!
Data Science Studio is available for free in its Community Edition or Enterprise Trial.