To Annotate or Not? Predicting Performance Drop Under Domain Shift

Tech Blog Bhawna Krishnan

Dataiku’s AI Lab is dedicated to contributing to the academic machine learning (ML) community and developing tools to assist everyone on their data journey. Beyond its research mission, the Lab also brings its expertise into the real world, working hand-in-hand with Dataiku data scientists and architects to solve enterprise data challenges. Dataiku researchers' interests are very broad, from MLOps, active learning, and semi-supervised learning to AutoML and meta ML.

In this ML Research in Practice session, we talk to Matthias Grallé and Adrian Mos from Naver Labs Europe. Naver Labs Europe is a research and development center for Naver Corporation, a South Korean technology company. It is located in Grenoble, France and focuses on NLP, computer vision, and ML. Naver Labs Europe aims to develop cutting-edge technologies that can be applied to Naver's products and services, such as its search engine and AI-powered personal assistant. The center conducts research in collaboration with European universities and research institutions.

Let’s dig deeper into what Naver Labs does and learn more about their recent paper on predicting performance drop under domain shift.


Performance decline caused by domain shift is a common challenge for NLP in production environments. To combat this, there is a need to regularly update and annotate evaluation datasets to assess the potential decline in model performance. However, this process can be both costly and time-consuming.

This is the first of many technical webinar recaps that we will be providing. In this first installment, we have discussed how the metrics for domain shift prediction are dependent on the model and the task. Naver Labs has demonstrated the effectiveness of its methods through sentiment classification and sequence labeling tasks, showing that its approach is both efficient and cost-effective. Their method can predict performance drops with a high degree of accuracy, with a margin for error as low as 2.15% for sentiment analysis and 0.89% for POS tagging.

Keep an eye out for future recaps on technical webinars, where we will share more insights and information on cutting-edge developments in the field.

You May Also Like

Machine Learning Research, in Practice: LMQL

Read More

From Chatbots to Agents: Augmenting LLMs With Tools

Read More

Tackling Imbalanced Learning With Generative Synthesizers

Read More