Easily Compare and Evaluate Models in Dataiku

Use Cases & Projects, Dataiku Product Joy Looney

Evaluating different models is a critical step in both design and production, especially when you want to run Champion/Challenger or A/B testing strategies. The model evaluation and comparison capabilities of Dataiku make this crucial step even easier for practitioners. 

In a recent Dataiku Product Days session, Dimitri Labouesse, Senior Solutions Engineer at Dataiku, walked us through these Dataiku features. Recapping the session and hitting the highlights, this blog will help you understand all the hype around these features. 

→ Watch the Full Dataiku Product Days Session

Let’s get right to it. Why do these features matter? When you want to deploy advanced MLOps strategies such as Champion/Challenger or A/B testing, you need a way to directly compare new versus older models to see which delivers better results against the business objectives and the advantages and/or disadvantages of replacing a model currently in production. 

With performance and model design in Dataiku, you can make informed decisions regarding the right time to launch new models as well as inspect the reasons behind degrading model performance. Dataiku’s capabilities are specifically designed to make this an overall effective and easy process.

model comparison

Use Case 1: Comparing Models to Deploy the Best Possible One 

Let’s say you are designing models and you would like to compare multiple trained models to understand how different modeling settings ultimately impact model performance. What are the next steps? After you train one or more models in Dataiku, with a simple click you can easily perform a “compare” action which will judge multiple models trained with different settings or hyperparameters.

multiclass model comparison

The comparison feature displays the performance of all of your models side by side, and you can check out performance visualizations including a decision chart, lift chart, calibration curve, ROC curve, and density chart for different ways to compare. Another intriguing point is that you are able to quickly investigate what details led to better performance for one model in comparison to others by clicking into model information and diving further into the features section. 

model comparison graphs

Use Case 2: Automated Input Data Drift Detection 

After a model has been deployed, using the API audit logs in Dataiku, you can receive, in real time, all of the data that your model is going to score in production. Here you can track the changes in data live as they are produced by your model. To check that the live data is the same as the data on which your model was trained, you can use a simple evaluation recipe in Dataiku which will allow you to identify and measure data drift

data input

In the model evaluation store, you can find a score pertaining to the identified data drift of the models that you have evaluated. At any time, you can examine more closely the scope and impact of the potential differences that can be observed between your datasets. 

data drift input

The global drift score gives you an idea of how well a secondary classifier model can distinguish between the current evaluation data and the data originally used to validate the model. In other words, is there a notable difference in the profile or composition of today’s data versus our baseline reference data? Plain English explanations help you understand how to interpret the results of these model drift tests.

You can also investigate further to see which particular variables in your data may have drifted. And finally, you will find a scatter plot chart that displays the feature importance for the original model versus the feature importance for the drift model to help assess whether features that have drifted are in fact important to the prediction and therefore likely to cause different model behavior.

data drift input

feature drift importance

In addition to exploring and evaluating input data drift, Dataiku also has separate tools to help you examine prediction drift and performance drift. For example, a fugacity table and density chart enable you to inspect the values and distribution of the expected vs observed predictions.

evaluation store

evaluation store

Interactively evaluating models is useful, but tracking the performance of a production model requires some level of automation. In Dataiku, you can monitor the model performance on live data and automate said monitoring of your model with the presence of controls. This will enable you to track and verify the stability of model performance over time. This can help you make key decisions along the way in regard to the timing of new launches as well as indicate when your model needs attention. 

input data drift

evaluation store

If you are eager to learn more and would like to follow along with a more in-depth walkthrough of these newer features in Dataiku, be sure to check out the full Product Days session

You May Also Like

AI Isn't Taking Over, It's Augmenting Decision-Making

Read More

Maximize GenAI Impact in 2025 With Strategy and Spend Tips

Read More

Taming LLM Outputs: Your Guide to Structured Text Generation

Read More

Looking Ahead: AI Hurdles IT Leaders Need to Overcome in 2025

Read More