First, who is the Lab?
The Lab is a key part of Dataiku’s research and development processes. From research in causal inference, dataset shift, and active learning to hands-on projects with data scientists and engineers, the Lab is busy at work establishing eminence in the machine learning community as well as helping Dataiku establish itself at the forefront of the field.
The 3 Main Roles of the Lab:
- Understand: The Lab reduces the scientific uncertainty of ML topics to help craft generic and robust features.
- Connect: The Lab builds bridges between the research community and Dataiku customers on their path to becoming truly AI enterprises. Additionally, the Lab produces open-source software for ML practitioners (for example, Cardinal and Mealy).
- Be curious: The Lab never stops asking questions to discover and push forward new technologies!
Does the Lab Contribute to the Dataiku Platform?
Yes!
The Lab’s work applies to Dataiku in a multitude of ways. The Lab works at the root of the system, finding new use cases and exploring various techniques and technologies then identifying which of these discoveries could potentially be beneficial for implementation in Dataiku. Many of the features that make Dataiku agile and accessible for all users in the enterprise are curated by the Lab. This includes features such as an ML-assisted labeling plugin, a model drift monitoring plugin, and model error analysis plugins. In the latest, soon-to-be-released version of Dataiku, the Lab has backed core /native drift monitoring in the model evaluation store, what-if accelerators, model stress tests, and more.
Double duty — in addition to those responsibilities of exploration and identification, the lab also consistently takes on the responsibilities of consultancy and collaboration for projects, working with the Dataiku developers to orient decisions for features, (both core and plug-in) around key goals as well as provide insights into the complexity of algorithms. The lab adapts to each project — sometimes working on small parts of features and sometimes owning the whole creation process end-to-end.
What’s All the Talk of Published Papers About?
The Lab has achieved the accomplishment of having two papers published this year through the ECML PKDD and the colocated IAL workshop. Through publishing papers, the Lab finds belonging and amasses acceptance within the data science community.
In addition to this, the process of getting published involves receiving critical feedback from others in the field, as the publishing process consists of multiple peer reviews. These peer reviews are a very engaged and intense process that culminates in the scoring and ranking of the research papers. Only the best make the cut to be published at the end of the day! Fun fact: The Lab also does peer reviews for papers from others in the community.
It isn’t all just for the researchers though, the things that are revealed in these research papers have implications for real-world practitioners, including Dataiku users.
The findings of Ensembling Shift Detectors: an Extensive Empirical Evaluation discuss dataset shifts (when the data training the machine learning models are divergent from a model operation). Specifically, the work shows that ensembling different shift detectors and adapting the strategy to specific data allows for more robust drift identification, covering a large spectrum of shift types. This proper usage of shift detectors is something that can be used to improve both current and future models, including in Dataiku. For a more in-depth look at the topic, check out the paper or this blog.
The next published paper, Sample Noise Impact on Active Learning, investigates the issue of noise in samples during labeling and proposes new methodologies to mitigate this problem. A research protocol is proposed to detect and evaluate the impact of noisy samples on active learning, along with a method more robust to them. This work paves the way toward a better detection of noisy samples in real-life experiments. If you want a deeper dive into active learning, take a look at this overview or this blog both of which are related to the Lab’s work in this area.
What’s Rising to the Surface?
When we sat down to chat with team members from the Lab, we were curious to find out what trends they saw rising through the data science and machine learning community as well as hear about what the Lab is currently working on that we can look forward to. Here is what they shared with us:
When following trends in the community there are obviously many topics that come up, but you might want to keep a special eye out for anything that is related to these three lines of research:
- Data-centric AI rather than pure ML: many organizations are focused on improving data rather than improving code, highlighting how good, consistent data is more important than big data.
- Learning without human-labeled data: research in self-supervised learning, especially for tabular data, and unsupervised pre-training strategies is still very active, as well as the studies on the importance of relevant data augmentation for generalization.
- Trustworthy ML: focuses on how to build explainable, fair, privacy-preserving, causal, and robust models. In particular, there is growing attention on dataset shift and strategies to improve robustness learning from multiple domains. As companies increase their AI maturity and work to manage their AI models, these push-to-production topics become more and more the focus of the conversation.