At a basic level, a machine learning model can be created as the output of a machine learning task. The purpose of machine learning is to take the time-consuming task of identifying patterns that are correlated to some outcome and expedite this task using semi-automated processing. This allows for identifying correlations or the prediction of an outcome using a mixture of potentially complex inputs.
By generating a model, we can complete the machine learning task over and over on updated data sources, without having to build the task from scratch. This enables further automation of data science projects.
On a technical note, a machine learning model is trained to recognize specific patterns in a dataset and then used with new data to identify similar patterns and/or make predictions for analysis. The machine learning model acts as a blueprint for future analysis tasks using machine learning.
Generally speaking, in data science there are two common types of machine learning models:
Prediction
Prediction relates to learning problems where the variable to predict is available in a labeled training dataset. Prediction models are trained on past examples for which the actual values (the target column) are known. The nature of the target variable will drive the kind of prediction task.
For example, let’s say we want to build a prediction model with the goal of predicting whether or not a patient will be readmitted to the hospital within a specific time span. We can train the model using historical data with the prediction target selected “hospital readmission.” During the training cycle, the model will identify the variables which serve as the strongest predictors of our prediction target. At the end of training, the machine learning model can be deployed into our data analysis workflow. This model can then be used to predict what patients are likely to be admitted again in the future.
Clustering
Clustering refers to unsupervised learning problems where the target is unknown, and you’re looking to find patterns and similarities in your data points. Clustering models are inferring a function to describe hidden structure from “unlabeled” data. These unsupervised learning algorithms are grouping similar rows given features.
Using the same hospital example, we might want to see if there are other strongly correlated items in our patient database, or items that are not independently tracked but have a distinct pattern in the data. We could use a clustering model as it identifies the correlated items we didn’t know to look at.
What Is a Machine Learning Model Lifecycle?
For those familiar with a formal Software Development Life Cycle, it won’t come as any surprise that machine learning models have a lifecycle too. The typical machine learning model lifecycle can be summarized in a four steps:
1. A model is born through the data science process. This involves the design, training, exploration, and selection of the best model.
2. The model is deployed to production. (This is like making it to the big time in the model world).
3. The model has a life with multiple retraining updates*.
4. Finally, the model is replaced with a better one. Typically this would be a newer, better model, maybe based on a new technique or new data that was unavailable previously.
*In Dataiku a saved model is deployed together with a training recipe that allows you to retrain the saved models, with the same model settings, but possibly with new training data.
Step 1. Design, Train, and Explore:
As previously mentioned, machine learning models are typically used for prediction or clustering. In prediction, we are looking to predict a calculable outcome. In clustering, we are looking for things we could predict, or simply things that appear to be correlated.
To use a machine learning model for prediction, the data must be cleaned and prepared just as it would be for other data analytics tasks. The machine learning technique is selected and then the model can be trained. A machine learning model is trained to identify the specific patterns using a reference set of data. This training results in a machine learning model that can be used on future data sets to identify similar patterns and make predictions.
There are various approaches to “training” the model, a worst-case slowest version would be a brute force one-to-one comparison of all inputs versus the outcome of interest with a ranking of how each input correlated to the outcome. This is often how children approach a correlation problem: Try everything and then decide what worked best. Today, machine learning models can be trained much more efficiently using prebuilt training optimization algorithms and search patterns. We can even specify the range or binning out the expected outcome.
Analysts typically build many models and tweak the input features, techniques, parameters, etc. until a best performing model is identified. Exploring during the model development step is essential to producing high quality models.
In Dataiku, data analysts can select the AutoML feature to build machine learning models, which has the advantage of trying a variety of different techniques and parameters automatically. AutoML makes machine learning more accessible and can help data analysts discover better techniques that they may not have considered using.
Step 2. Deployment
In this step, the winning model is deployed to production. This will typically involve moving the model to a new system running on production hardware, instrumenting the model for data drift and other monitoring and then testing the model to ensure it works under production conditions before making the model available for production use.
Step 3. Retraining
Predictive models will need retraining when the data in the real world differs from what the model was trained on. This data drift happens over time and can impact predictions' accuracy, resulting in the need to retrain the model on newer data.
In the hospital scenario, as staff learn what causes readmissions and adjust processes to reduce readmission, the top correlated items predicting readmission will change. A good model workflow will support re-training without having to start over with a model from scratch. Retraining returns us to step 1 of the model lifecycle!
Impact of Machine Learning Models
AI is the hot topic of today, but beyond its hype, machine learning has real value to organizations looking to improve their processes and understand complex external factors that impact growth. Machine learning models can drastically reduce the labor of data analysis and data research by automating a key element of creating business outcome predictors.
Dataiku helps everyone in an organization — from technical teams to business leaders — use data to make better decisions and drive business value on a daily basis. Organizations that use Dataiku enable their people to be extraordinary, creating the AI that will power their company into the future.