A Primer on Data Drift

Tech Blog Du Phan

When a machine learning model is deployed in production, the main concern of data scientists is the model pertinence over time. Is the model still capturing the pattern of new incoming data, and is it still performing as well as during its design phase?

Just like cars, model drifting is real issue

Monitoring model performance drift is a crucial step in production ML; however, in practice, it proves challenging for many reasons, one of which is the delay in retrieving the labels of new data. Without ground truth labels, drift detection techniques based on the model’s accuracy are off the table.

In this article, we will take a brief but deep dive into the underlying logic behind drift as well as present a real-life use case with different scenarios that can cause models and data to drift.


Measuring Data Drift: A Practical Example

Let’s say we want to predict the quality of the Bordeaux bottles at the wine shop near our place. This will help us save some money (and avoid some not-worth-it hangovers).

To do this, we will use the UCI Wine Quality dataset as training data, which contains information about red and white variants of the Portuguese “Vinho Verde” along with a quality score varying between 0 and 10.

The following features are provided for each wine: type, fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, and alcohol rate.

This is a snapshot of the dataset in Dataiku — you can get Dataiku Free Edition here if you want to use it to explore the data yourself; for an in-depth analysis of the dataset, you can also check out this Kaggle kernel

To simplify the modeling problem, let’s say that a good wine is one with a quality score equal to or greater than 7. The goal is thus to build a binary model that predicts this label from the wine’s attributes.

For purposes of demonstrating data drift, we explicitly split the original dataset into two:

  • The first one contains all wines with an alcohol rate above 11%, wine_alcohol_above_11.
  • The second one contains those with an alcohol rate below 11%, wine_alcohol_below_11.

Such clear differences between the two datasets will be formalized as data drift. We split wine_alcohol_above_11 to train and score our model, and the second dataset wine_alcohol_below_11 will be considered as new incoming data that needs to be scored once the model has been deployed.

At this point, you would naturally expect a discrepancy between the scores. Why is that? Because clearly — as you’ll see in the next section — this violates the underlying IID assumption.


A Small Detour Through Statistical Learning

Let’s consider a classical supervised learning task with training set S = {(xᵢ, yᵢ)}, with xᵢ ∈ X and yᵢ ∈ Y, and a risk function r: YxY→ R. We wish to learn a predictive function f : X → Y that not only minimizes the empirical risk (on the set S) but also the generalization risk (on unseen data). Under the statistical learning framework, the generalization risk can be faithfully estimated from the empirical risk. This PAC theorem uses Hoeffding’s inequality, which relies on the infamous Independent and Identically Distributed (IID) assumption.

 
Generalization Bounds theorem
Generalization Bounds Theorem

Under this assumption, each pair of observations (x, y) is drawn independently from the same joint distribution p(X,Y). Intuitively, this means that the training set S is a reasonable protocol through which we can get partial information about the underlying true function f.

If the above assumption does not hold, the empirical risk on the training set is no longer a robust estimation of the expected empirical risk on new unseen data. In other words, the test error is no longer a good estimation of future errors.

With that, we will now investigate different scenarios where this assumption fails and causes data drift.

What Is Data Drift?

In the literature, several terms and formulations are used to describe the problem of data drift, here we employ the framework presented by Quiñonero Candela.


Back to Our Example…

To keep the problem simple, we employ a univariate point of view, focusing on the feature alcohol. However, the reasoning can be generalized to multiple dimensions, which is usually the case in practice (several features drifting at the same time).

The question we want to address here is:

Will the performance of a model trained on my training set (wines with an alcohol level above 11%) change drastically when scoring new bottles (whose alcohol level is below 11%)?

One important detail to note is that the assessment is about the comparative performance of the model between original and new data, and not about the absolute performance of the model.

If we have the ground truth labels of the new data, one straightforward approach is to score the new dataset and then compare the performance metrics between the original training set and the new dataset. However, in real life, acquiring ground truth labels for new datasets is usually delayed. In our case, we would have to buy and drink all the bottles available, which is a tempting choice… but probably not a wise one.

Therefore, in order to be able to react in a timely manner, we will need to base performance solely on the features of the incoming data. The logic is that if the data distribution diverges between the training phase and testing phase, it is a strong signal that the model’s performance won’t be the same.

With that being said, the question above can be answered by checking data drift between the original training set and the incoming test set.

The Drift Zoology

Now let’s get back to our wines and take a closer look at different situations where data drift occurs. Throughout this section, this is how our model is trained and originally tested.

From the wine_alcohol_above_11 dataset, we randomly split them into two:

  • The first one to train the model, denoted by alcohol_above_11_train, will be further split in training and validation sets.
  • The other one for testing the model called alcohol_above_11_test.

We fit a Random Forest model on alcohol_above_11_train. The model achieves a F1 Score of 0.709 on the hold-out set.