This was the first full year of operation for the AI Lab, and it has undoubtedly embedded Dataiku into the machine learning (ML) research community. Here’s what we were up to in 2020 and how you might see this research impact the broader world of ML as well as the direction of Dataiku’s offerings to come.
Back Up: What’s the AI Lab?
The Dataiku AI Lab is a research team that seeks to contribute to the academic ML community as well as develop tools to assist everyone on their data journey. Dataiku researchers' interests are very broad, from active learning and semi-supervised learning to AutoML and meta ML.
Beyond its research mission, the AI Lab also brings its expertise into the real world, working hand-in-hand with Dataiku data scientists and architects to solve enterprise data challenges.
Don't forget to sign up for the 2021 edition of the ML trends webinar to hear from the Dataiku AI Lab team on what's to come!
Active Learning
We have been very active working on active learning (get it?!). We’ve open sourced our work on active learning with the cardinal package as well as presented it at the latest PyData global conference and an ICDM workshop.
Alexandre Abraham went in-depth on the topic in February on Data from the Trenches, presenting a high-level perspective on existing active learning packages and how they compare plus diving into some specific features of those packages. He covered the simplest approaches available in each package, reviewed some aspects of the code, and presented the most cutting-edge methods proposed.
MLOps
As organizations worldwide struggled to adjust to changing human behavior, there was perhaps no bigger topic in 2020 than MLOps. To go along with the new MLOps trend, we have worked hard on understanding the performance of drift detection techniques that would best fit the needs of Dataiku users.
MLOps systems are designed to trigger alerts when drift is detected, but in order to make decisions about the strategy to follow next, it’s also important to understand what is actually changing in that data and what kind of abnormality the model is facing.
Simona Maggio explored the topic at length, including a post on Data from the Trenches (Why Is My Data Drifting?) that describes how to leverage a domain-discriminative classifier to identify the most atypical features and samples, plus how to use SHapley Additive exPlanations (SHAP) to boost the analysis of the data corruption.
AutoML
Another important task we took on this year was to support the AutoML effort by running massive, large-scale experiments to better tune each hyperparameter value of each algorithm available in Dataiku DSS. We are very happy that all of our work was made available to everyone through our blog, Data From The Trenches — and the work on AutoML is no exception. Aimee Coelho published Hunting for the Optimal AutoML Library to highlight this work.
What Else & What’s Next
In addition to all these topics, as part of our engagement in the open source community, we were happy to be part of the scikit-learn consortium for its second year in 2020. We have also had the chance to host a scikit-learn sprint (before lockdown of course) and actively contribute with many Dataikers.
What will 2021 hold for ML research and for the Dataiku AI Lab? We’re hosting a webinar on January 20 to share our thoughts and focus areas for the year ahead, so don’t forget to sign up, and we’ll see you there to continue the discussion!