Here at Dataiku, we frequently stress the importance of collaboration in building a successful data team. In short, successful data science and analytics are just as much about creativity as they are about crunching numbers, and creativity flourishes in a collaborative environment. One key to a collaborative environment is having a shared set of terms and concepts. Even if you aren’t working in data science per se, it’s still useful to familiarize yourself with these concepts -- if you’re not already incorporating predictive analytics into your everyday work, you probably will be doing so soon!
The terms we’ve chosen to define here are commonly used in machine learning, and they’re essential to learning the basics of data science. (And many thanks to our own Ulysse Mizrahi, who helped us in compiling the list and writing the definitions!) Whether you’re working on a project that involves machine learning, or you’re learning about data science, or even if you’re just curious about what’s going on in this part of the data world, we hope you’ll find these definitions clear and helpful.
(This blog and infographic are part of our Machine Learning Basics illustrated guidebook -- click here for a free copy.)
Also, you can check out our introduction to machine learning algorithms.
Model: a mathematical representation of a real world process; a predictive model forecasts a future outcome based on past behaviors.
Algorithm: a set of rules used to make a calculation or solve a problem.
Training: the process of creating a model from the training data. The data is fed into the training algorithm, which learns a representation for the problem and produces a model. Also called “learning.”
Regression: a prediction method whose output is a real number, that is, a value that represents a quantity along a line. Example: predicting the temperature of an engine or the revenue of a company.
Classification: a prediction method that assigns each data point to a predefined category, e.g., a type of operating system.
Target: in statistics, it is called the dependent variable; it is the output of the model or the variable you wish to predict.
Training set: a dataset used to find potentially predictive relationships that will be used to create a model.
Test set: a data set, separate from the training set but with the same structure, used to measure and benchmark the performance of various models.
Feature: also known as an independent variable or a predictor variable, a feature is an observable quantity,recorded and used by a prediction model. You can also engineer features by combining them or adding new information to them.
Overfitting: a situation in which a model that is too complex for the data has been trained to predict the target. This leads to an overly specialized model, which makes predictions that do not reflect the reality of the underlying relationship between the features and target.
If you're interested in learning more, check out our illustrated guidebook on the basics of machine learning. Enjoy, and keep in touch!