Here at Dataiku, we frequently stress the importance of collaboration in building a successful data team. In short, successful data science and analytics are just as much about creativity as they are about crunching numbers, and creativity flourishes in a collaborative environment. One key to a collaborative environment is having a shared set of terms and concepts.
Even if you aren’t working in data science per se, it’s still useful to familiarize yourself with these concepts — if you’re not already incorporating predictive analytics into your everyday work, you probably will be doing so soon! This data science glossary will equip you and your team with the fundamental terms to know, as it remains ever-critical for organizations to educate themselves and establish a common mission and vision for scaling AI and becoming truly data-driven.
The data science concepts we’ve chosen to define here are commonly used in machine learning, and they’re essential to learning the basics of data science. Whether you’re working on a project that involves machine learning, or you’re learning about data science, or even if you’re just curious about what’s going on in this part of the data world, we hope you’ll find these definitions clear and helpful.
Key Data Science Concepts
Data Science: Data science, which is frequently lumped together with machine learning, is a field that uses processes, scientific methodologies, algorithms, and systems to gain knowledge and insights across structured and unstructured data. The definition can vary widely based on business function and role.
Machine Learning: Machine learning is a subset of AI that involves programming systems to perform a specific task without having to code rule-based instructions.
Deep Learning: Deep learning is a subset of machine learning where systems can learn hidden patterns from data by themselves, combine them together, and build much more efficient decision rules.
Model: A model is a mathematical representation of a real world process. A predictive model forecasts a future outcome based on past behaviors.
Algorithm: An algorithm is a set of rules used to make a calculation or solve a problem.
Training: Training is the process of creating a model from the training data. The data is fed into the training algorithm, which learns a representation for the problem and produces a model. The concept is also called “learning.”
Regression: Regression is a prediction method whose output is a real number, that is, a value that represents a quantity along a line, such as predicting the temperature of an engine or the revenue of a company.
Classification: Classification is a prediction method that assigns each data point to a predefined category, e.g., a type of operating system. Essentially, it refers to predicting categorical values.
Classification task: A classification task is the process of predicting the class for a given unlabeled item and the class must be selected among a set of predefined classes.
Target: In statistics, the target is called the dependent variable; it is the output of the model or the variable you wish to predict.
Training set: A training set is a dataset used to find potentially predictive relationships that will be used to create a model.
Test set: A test set is a dataset, separate from the training set but with the same structure, used to measure and benchmark the performance of various models.
Feature: Also known as an independent variable or a predictor variable, a feature is an observable quantity, recorded and used by a prediction model. You can also engineer features by combining them or adding new information to them.
Overfitting: Overfitting is a situation in which a model that is too complex for the data has been trained to predict the target. This leads to an overly specialized model, which makes predictions that do not reflect the reality of the underlying relationship between the features and target.
Regularization: Regularization, the remedy for overfitting, is the process for simplifying your model or making it less specialized.