Key Steps Involved in the Machine Learning Process: A Primer

Data Basics, Scaling AI Marie Merveilleux du Vignaux

Machine learning has become more and more accessible in the last few years. Thanks to advancements in automated machine learning (AutoML), collaborative AI, and machine learning platforms (like Dataiku), the use of data — including for predictive modeling — across people of all different job types is on the rise. To be involved in the machine learning process, you don’t have to be an expert coder, data scientist, or engineer anymore. You can build your own ML model by following these five machine learning steps.

5 Key Machine Learning Steps:

woman using Dataiku on a screen

1. Define the Goal: The first step in the machine learning process is defining the business objective of your machine learning project as concretely as possible. This step is key to ensuring the success of your model. In order to have motivation, direction, and purpose to execute and build a machine learning model from start to finish, you have to identify a clear objective for what you want to do with the data, the model, and how it’s going to improve your current processes or performance at a given task. In short, without a clear goal, your model will probably not make it to production so make sure you start here!

To identify a business problem, start by looking at the different types of prediction and thinking about what exactly it is that you would like to predict. There are two main types of supervised machine learning:

  • Classification — Do you want to predict whether something is one thing or another?

  • Regression — Do you want to predict a specific number of something?

Once you have identified your business problem and the type of supervised machine learning you will be using, you are ready to move on to the next step of the machine learning process.

2. Prepare Data for ML: Preparing data for machine learning — in other words, making sure data is consistent, clean, and usable overall — is the most time-consuming part of the machine learning process and can take up to 80% of the time of the entire data project. So let’s separate this step into four sub-steps to make the process super clear:

  • Getting the Data: Mixing and merging data from many different data sources can take a machine learning project to the next level. There are a few ways to get usable data: connecting to a database, using APIs, and/or looking for open data on the web.

  • Analyze, Explore, and Clean the Data: This helps ensure better results, but it also helps avoid serious issues. Start digging to see what data you’re dealing with and ask questions to understand what all variables mean. Keep an eye out for data quality issues, such as missing values or inconsistent data — too many missing or invalid values mean that those variables won’t have any predictive power for your model.

  • Feature Selection: Select the features — also known as independent variables — you’ll use to train your model. By ensuring the right types of features are selected, you can reduce complexity and overfitting.

  • Feature Handling & Engineering: Feature engineering relates to building new features from the existing dataset or transforming existing features into more meaningful representations. This machine learning process step is about making transformations to features to allow them to be better used and positively impact the performance of your model.

Now you’re ready to get into what will occupy the remaining 20% of your work on this model!

3. Build the Model: You can build your model very simply by using Dataiku AutoML. AutoML is a tool that automates the process of applying machine learning and can make quick, baseline modeling simple — even experienced data scientists use AutoML to accelerate their work. In four simple steps, this consists in:

  • Building a baseline: This is a model that is straightforward but with a good chance of providing decent results, through quick modeling.

  • Designing the model: This includes selecting a target variable and prediction type.

  • Training the model: This is done on a subset of the data to evaluate how well it is able to map inputs to outputs and make accurate predictions.

  • Selecting the algorithm and hyperparameters: Decide which algorithm to use for your model based on your business goals and priorities.

4. Tune the Model: How do you know if your model is any good? That’s the part of the machine learning process where tracking and comparing model performance across different algorithms comes in.

  • Evaluate metrics and optimize: For regression models, you want to look at mean squared error and R-squared (R2). For classification models, you can start by looking at the most simple metric for evaluating that type of model: accuracy.

  • Check for overfitting and apply regularization: Regularization, simplifies your model and makes it less specialized to remedy for overfitting.

5. Model Interpretation: This is the degree to which models — and their outcomes — can be understood by humans. Below we will quickly outline three techniques to interpret results and review performance.

  • Partial dependence plots: These help explain the effect an individual feature has on model predictions.

  • Subpopulation analysis: This investigates whether a model performs identically across different subpopulations. If the model is better at predicting outcomes for one group over another, it can lead to biased outcomes and unintended consequences when it is put into production.

  • Individual prediction explanations: Partial dependence plots and subpopulation analyses look at features more broadly, but they don’t provide insight into the factors behind each specific prediction that a model outputs — that’s where individual prediction explanations come in. The explanations are useful for understanding the prediction of an individual row and how certain features impact it.

    man pointing at work on laptop

And that’s how you build a machine learning model! See? It wasn’t that difficult! Alright, I admit we did simplify things a bit to make the machine learning process more approachable, but if you take the time to dive deeper into the details, you will see it’s nothing that you can’t follow.

You May Also Like

6 Top-of-Mind Topics About AI & Trust in 2024

Read More

3 Concrete Ways to Drive AI ROI

Read More

Alteryx to Dataiku: The Visual Flow

Read More

Introducing LLM Cost Guard, the Newest Addition to the LLM Mesh

Read More