Did MLFlow Kill the Data Science Star?

Dataiku Product Iulia Feroli

Picture this. It’s 2012 and you’re a data scientist. You have the “sexiest job of the 21st century.” Your boss is impressed with the highly performant algorithms you’ve developed and other departments are in awe of your ability to turn seemingly nonsensical data into actionable business insights. They think you’re a magician, if not a genius at the very least. Life is good.

And yet, as time goes by, expectations for your field are steadily shifting, as standards expected of data science projects increase, while the profile of a “data scientist” continues to evolve. The standards we expect of data science projects have risen to include concepts like scalability, repeatability, governance, and “enterprise grade robustness.” Yet, if you ask 100 companies what their data scientists do today, you may get 100 different job descriptions. The data scientist role has continued to broaden in scope and educational and professional backgrounds, and the number of people involved in data science projects has continued to increase as well. 

With all the extra expectations, are data scientists still working on “sexy” project elements? Do they still speak the same language, and can they collaborate when their way of work varies so much?

How can this balance exist, where data scientists come from diverse backgrounds and skill sets, yet the projects they produce have increasingly higher requirements and thresholds to success? 

Good Automation

The answer to this question turned out to be the very thing we feared advanced AI would be used for: job automation. Not in the sense of robots taking over data science of course, but rather a step back, evaluation, and proactive re-engineering of the data scientist's job. I’m sure everyone looking at the day-to-day job can identify components that require creative human solutions, as well as more mundane tasks like debugging a colleague’s code, managing dependencies, or hacking together code to glue together various tools. 

Data scientists often end up spending too much of their time working on these mundane tasks; not only are they not fun, but they take away from the time a data scientist can spend exploring use cases with their unique expertise and vision. To make optimal use of resources and get successful data science projects into production, businesses need to shift this paradigm, automate the mundane, and introduce standardization. 

Enter MLFlow 

MLFlow has introduced a new open source standard for this way of work, supporting model standardization, experiment tracking, and overall more streamlined model workflows. You, along with your diverse array of data science colleagues, can develop models for different types of projects or industries using consistent formats and methodologies. You can easily pick up and understand existing or old projects or sort through various experiment runs to analyze past model metrics, without spending valuable time on menial tasks. And this is only step one. 

Get More Out of Your Models With MLFlow + Dataiku

Over the last few years and releases, Dataiku has continuously improved its integration with MLFlow. If you are an adopter of the MLFlow framework already, you can leverage its familiar benefits, with the additional value add of native Dataiku functionalities. After benefitting from MLFlow’s automation, standardization, and model tracking, you can import those models into Dataiku and further benefit from easy model interpretability, deployment, and monitoring. 

Dataiku’s interpretability features allow you to understand your models better at a first glance. You can use partial dependence to see how your model is influenced by values across each variable, subpopulation analysis to track any potential bias on subsets of data, and individual explanations to dive deep into probability extremes. This provides you with a more comprehensive view of your MLFlow models, allowing you to seamlessly iterate upon them and quickly draw meaningful insights for your business. Additionally, these features make great strides towards exploring a model’s fairness and satisfying governance policies, which are important points for auditors. 

As you develop and iterate upon models in Dataiku, you can track your experiments in an MLFlow compatible format, compare models using a variety of metrics through interactive dashboards, and finally choose the best versions to deploy. With just a few clicks you can deploy a real-time endpoint hosted on an automatically scalable Kubernetes cluster, or create an automated pipeline that refreshes batch deployments if the underlying data quality declines. You can also even choose to export models out of Dataiku, which can be used in any MLflow-compatible scoring system that supports the “python_function” flavor of MLflow. 

Dataiku helps to simplify and automate the lifecycle of the model, with out of the box functionalities that automate a lot of the mundane tasks typically delegated to data scientists. Instead, you can focus on more exciting work, like model development and improvement, or even move on to new use cases. 

Taking Back Data Science

As we transitioned from notebooks with ML scripts that were manually run, to highly available hosted solutions, and an increasingly cloud focused hyperscale world, the data scientist role and their day to day focus will also keep shifting. 

To keep the focus on the important human-centered decisions that make data scientists so valuable, teams can look to adopt standards, frameworks, and tools to automate the mundane aspects of the job. In an ideal setup, teams wouldn’t need to actively think about standardized code, experiment tracking, interpretability, model versioning, deployment, or monitoring, but rather new business pursuits. 

Read more about Dataiku’s vision for achieving Everyday AI and the latest features from the newest release on our blog.

You May Also Like

An End-to-End Solution for Actuaries With Dataiku

Read More

Top Data Preparation Software Features in Dataiku

Read More

Data Lineage: The Key to Impact and Root Cause Analysis

Read More