Next up in our ML Research in Practice series, we’ll discover another way that the Dataiku AI Lab supports the academic machine learning (ML) community. Check out our previous blogs on predicting performance drop under domain shift and predicting GDP with Google Trends if you haven’t already!
The Dataiku AI Lab recently met with Nathan Noiry, COO, co-founder, and head of research at Althiqa (and his co-authors), whose research paper “Beyond Mahalanobis-Based Scores for Textual OOD Detection“ was accepted at NeurIPS in 2022. They discussed the issues of detecting distributional shifts in the context of textual data.
Training ML algorithms requires a diverse dataset that reasonably represents the target population. However, it is impossible to predict how the data will change over time. When the distribution of new data deviates significantly from the training data, errors occur. There are two approaches to tackling this problem:
- Distribution vs. distribution: To compute the distance between two probability distributions, one must collect data at regular intervals, such as every five months. Then, they must compare the distributions observed in the five months to the training data distribution to identify differences, leading to a comparison of two probability distributions. However, this may not be practical for real-life scenarios.
- Point vs. distribution: Ideally, instead of waiting for distributional information, an efficient system would be able to recognize unusual incoming data points and prevent inaccurate predictions. The goal is to design out-of-distribution detectors that can determine, given a new data point x, whether it falls within or outside the expected distribution.
Standard methods are ineffective in handling textual and high-dimensional data as they are not designed for it. Isolating one feature can disrupt its associations with other features, leading to incorrect conclusions.
That’s why Nathan Noiry and his team introduce TRUSTED, a new Out-of-Distribution (OOD) detector for classifiers based on Transformer architectures that meets operational requirements being both unsupervised and fast to compute.
What's coming up next? In the series finale of ML Research in Practice, we’ll delve into IGN’s periodic production of national fine-scaled land cover maps and see how deep learning methods are most helpful in this setup.