Automation Scenarios: Another Step Towards Successful Model Deployment

Data Basics Margot

I — probably like many before me — like to think of data science (and more generally of big data) as a process of a final outcome: deployment into production. The thing is, putting a model to production is often more complicated than it seems; setting up automation scenarios is a way for IT teams to make this last step both easier and trackable.

How I Picture the Data Science Process

The Data Science Process A (very) simplified diagram of a classic churn project

It all begins with a very basic and simple question: what are you trying to predict? Do you want to estimate the amount your new customers are going to spend on your website? Do you want to predict which of your users are about to churn? And how is this data going to help you get there? So, the first step is clearly defining your goal.

Once you know what you’re aiming at, it’s time to dig in and get your hands data dirty. You first want to collect all the data (logs, CRM data, transactional data, social data, etc.) that can help you — one way or another — manage your final prediction. Then, you need to prepare and to clean all that raw data to make it completely homogeneous and therefore easier to analyze. Seems quite easy, right? Well, this whole data munging step is typically the most time-consuming one. It often represents up to 80% of the time data scientists spend on a project.

jean luc picard gif When data scientists are done cleaning their data

Then enters machine learning. This is the part when you get to build, to train, and to test several models to determine which one is the most efficient and likely to actually produce accurate predictions.

When you’ve tested, iterated, tested, iterated, and finally arrived to a model that is fit for production, voila you’re done! Except that that is never the case. Deploying a model is a lot more complicated than a “voila, let’s do it scenario.” Why? Let’s find out.

Why Is Putting Projects in Production So Tough?

The reasons for this are multiple. One of the core problems revolves around the technological inconsistency between what we could call the prototyping and the production environment. As the two infrastructures are different and as the people working on them have different trainings and skill sets, both environments evolve with different languages (for instance R for prototypes vs Java, C++ for production), which means putting a project in production sometimes requires the production team (usually IT) to recode the whole model into what is deemed the appropriate language.

burger sauteurThis is a warrior data scientist trying to deploy his model to production

Another big issue data scientists have to face when putting their models into production is recomputing. And why is that? Because they’re operating on data, and data definitely isn’t something that’s frozen in time. Data is more of a flow that gets refreshed and enriched by interactions between humans and machines. 

And it’s precisely because data is regularly updated, that what results from data needs to be frequently updated as well. That’s why if they want to have valuable and updated data-generated insights, data scientists need to regularly recompute their datasets, their models, and so on.

This is even more true when your project requires that you use your model in real-time. Let’s take a tangible example. Picture yourself as an application developer working for a bank. You’re responsible for developing a mobile app allowing customers to apply for a loan just by making them fill in a form with their requested loan amount and their name. The bank’s data team has provided you with a great model able to predict whether or not the customer is likely to refund the loan.

Once the customer has filled in the form, the app should ask servers about the customer’s data, use the model, and automatically return a screen telling the customer whether or not the bank has accepted their loan request. Great process right? But before deploying your model into production, you want to make absolutely sure your models and your data are regularly updated. Otherwise your app might for instance accept a loan that would not have been accepted by a classic bank advisor, which could get you into trouble.

The thing is, to really be up-to-date, insights need to be recomputed very frequently (sometimes up to several times a day) from many different sources and recomputing often involves lots of processing steps.

How Scheduling Is Probably Going to Save Your Life (Or at Least a Bunch of Working Hours)

There’s a way to automate this whole (and fairly unrewarding) recomputing process, and it’s called scheduling. So, what is scheduling really? And how does it work? Scheduling is basically a way to automatically and easily refresh your data.

What you first need to do to set up a schedule is describe your data pipeline as dependencies between datasets (from your very first ones with raw data to the final ones containing your predictions) and tasks (cleaning, joining, etc.).

WorkflowsHow the dependency structure of a project is represented in the flow tab of Dataiku DSS

Then, you have to define what we call an automation scenario. An automation scenario is a sequence of several actions you want to perform on your different “data objects” (datasets, models, folders, etc.) to keep your insights updated. There are three different tasks you need to complete to set up a scenario.

First, the trigger: you have to define what is going to launch the recomputation. There are several ways to characterize a trigger: it can for instance be time-based but it can also follow a dataset or queries results changes.

triggersThe trigger setup in Dataiku DSS

Once your trigger has been defined, you have to list the different actions that need to be done in order to update your insights. You basically have to create a sequence of steps such as checking or rebuilding a given dataset, computing metrics, refreshing insights, and so on.

The reporter setup in Dataiku DSS
The implementation of the steps sequence in Dataiku DSS

Finally, you want to add reporters to get notified when a scenario is done running (or is beginning). You can define the notification channel that suits you best (Hipchat, Slack, emails, etc.), the running conditions of the reporter (when do you want to get a message: if a model has changed of more than 10%, if there’s been an error during the recomputing process, etc.) and the type of message you’re willing to receive (text message, status, etc.).

The reporter setup in Dataiku DSSThe reporter setup in Dataiku DSS

And there you have it: your very first automation scenario! Combined with the dependency structure of the project, the automation scenarios allow a comprehensive description of your data pipeline autonomous recomputation.

The great thing in Dataiku DSS is that you can both run several scenarios from different projects at the same time and monitor the whole thing on a single screen called “scenarios monitoring”.

The Scenarios Monitoring screen on Dataiku DSS
The Scenarios Monitoring screen on Dataiku DSS

So you got it, scheduling is basically an empowerment of our data science process as it makes its last step — the production deployment — way easier… and trackable. That’s part of why we use the term “end to end” when we describe what Dataiku DSS enables data teams to do in terms of data products: it allows them to go from start (data preparation, etc.) to finish (data visualization, production, etc.) on a single platform all the while letting them leverage the tools, languages, and models they know best.

You May Also Like

How to Build Trust in Data Analytics Projects

Read More

AI Isn't Just for the Super Technical

Read More

Ada Lovelace Day: Celebrating the Visionary Who Saw Beyond Numbers

Read More