Applied Data Science: Optimizing Grocery Store Checkout Times

My neighborhood mini-market has a very clever system for optimizing wait times at the checkout: there are always between two and five people working in the store, working continuously either at the checkout or stocking shelves.

woman with basket walking through grocery store

When the checkout line gets too long, a call goes out immediately over the loudspeaker, and someone stocking shelves will come and open a new checkout line in less than a minute. In short, everything is done with agility to reduce the waiting time for customers, who are (admittedly) overly spoiled.

This agility is possible because the store operates on a small scale and with a relatively high profit margin; such an agile system is not possible in a mega-market: at scale, it must be able to accurately predict the number of people needed in the store in advance and on top of that, to manage everyone's schedules. So how does a big box store optimize checkout times?

Defining the Problem

To make its schedule, a mega-market generally performs an optimization every week or two to establish each employee's schedule. They most often use a so-called operational market research system (e.g., ILOG by IBM or the open source software listed here) which takes into account the scheduling constraints for each employee, regulations, the store's logistical constraints, etc.

The planning system, in fact, has a dominant input variable: predicting busy periods. Precisely: How many customers are likely to go through the checkout during each quarter hour over the two weeks? To make this prediction, the store manager has access to over ten years' worth of historical data. A gold mine! But it's still necessary to be able to exploit the vein of ore.

Time Series Models

To address this problem of evolution over time, two time models are commonly used: so-called "ARIMA" models and triple exponential smoothing models.

These models have the common trait of viewing a time series as a combination of several factors:

Autocorrelation: The coming rush depends on the rush in the previous days.
A trend: The rush can broadly increase or decrease over time.
Seasonality: The rush may correspond to seasonal variations, potentially with several periods (week, month, year).

In the 1960s, these models were the subject of extensive research, especially because of their applications in optimizing financial markets. However, in the case of problems of the "physical world" — such as the rush in a grocery store — considering external data is most important.

For example, it is important to be able to mark holidays in order to account for their effect on the rush. This is exactly the kind of problem that made us develop a companion feature in Dataiku, which makes it possible to automatically populate a time series with detailed information on holidays, regardless of the country in question. Adding this additional data to influence the model is the data scientist's job.

What Additional Data? Unexpected Discoveries

While working on traffic data, we discovered some amusing phenomena:

Our first model predicted plateaus... because indeed, according to the historical data, the store was often "saturated," i.e., the maximum number of customers was reached and limited by the number of checkouts open.
School vacations have significant effects, especially in dense urban areas, and are not on fixed dates.
Paydays are important! The first Saturday after payday is the most important factor, and not all the countries in Europe have the same policy on paydays (monthly, bimonthly).
Near a stadium, knowing the game times is important!
The inclusion of public holidays must be based not only on the day itself, but also in advance (last Saturday before Christmas).
Obviously it is not just a matter of predicting the number of customers, but the checkout time. The average shopping baskets are much bigger on Saturdays.
Don't forget sales or special promotional days!

Bonus: Predicting Confidence Intervals & Calculating a Business Risk

In practice, the system must not only predict sales volume; it must be able to make good decisions. An initial refinement would be to consider confidence intervals.

For example: next Saturday, between 10 a.m. and 10:15 a.m., it is estimated that there will be 140 shoppers who want to go through the checkout, and there will be between 112 and 170 shoppers, with a probability of 90%.

A confidence interval allows the schedule creation system time to work in worst/best case scenarios. To go even further, you can even model an economic gain for each probabilistic scenario. To do this, we must make several economic models:

Modeling costs: For example, a supermarket employee costs the supermarket (including charges and structural costs) $3 per quarter hour.
Modeling profit: For example, the supermarket has a gross profit margin of around 10%, the average basket is $ 30, and a cashier handles about 6 tickets maximum per quarter hour at full throttle.
Analyzing an effective saturation level from the data, and estimating the number of lost sales when the checkout lines become too long.
Estimating the marginal cost of underutilization (a cashier handles one customer or less over a quarter hour).

Now imagine the situation on Saturday at 10 a.m. A rush model is able to extract the probabilities for each scenario, and to derive an economic risk from them with tables in this form:

Number of open checkouts	23	24	25	26	27	28
Probability of saturation causing the loss of 5 sales	91%	74%	54%	32%	30%	28%
Probability of underutilization	7%	14%	15%	19%	31%	48%
Economic risk of loss during each quarter hour	$13.86	$11.52	$8.55	$5.37	$5.43	$5.64

Just for this quarter hour taken individually, having 26 open checkouts would be optimal. However, getting this exact number might not possible for scheduling reasons. But with this detailed information, the operational research system can find the schedule that effectively minimizes the risk of economic losses.

Applied Data Science: Optimizing Grocery Store Checkout Times

Defining the Problem

Time Series Models

What Additional Data? Unexpected Discoveries

Bonus: Predicting Confidence Intervals & Calculating a Business Risk

You May Also Like

Everything to Know: AI Agents for Supplier Risk Assessment

Building AI Agents for Life Sciences: From Silos to Synthesis

Scaling GenAI in Financial Services With Dataiku and NVIDIA

How Databricks & Dataiku Embed Governance Into AI Workflows

Applied Data Science: Optimizing Grocery Store Checkout Times

Defining the Problem

Time Series Models

What Additional Data? Unexpected Discoveries

Bonus: Predicting Confidence Intervals & Calculating a Business Risk

Learn More About Staffing Optimization

Subscribe to the Dataiku Blog

You May Also Like

Everything to Know: AI Agents for Supplier Risk Assessment

Building AI Agents for Life Sciences: From Silos to Synthesis

Scaling GenAI in Financial Services With Dataiku and NVIDIA

How Databricks & Dataiku Embed Governance Into AI Workflows