Dr. Jason Brownlee, Ph.D., is a machine learning specialist and founding researcher at Machine Learning Mastery. According to Dr. Brownlee, “feature engineering is the process of transforming raw data into features that better represent the underlying problem to the predictive models, resulting in improved model accuracy on unseen data.” In his perspective, it comes down to getting the most out of your data for predictive modeling.
Put another way, feature engineering is the process of using domain knowledge to transform the raw data into a form that provides better or new signals to improve model accuracy. It involves creating and adding more variables (known as features) to the dataset at hand in order to improve model performance. It’s an important concept because it ensures that the input dataset is compatible with machine learning (ML) algorithm requirements and, ultimately, helps facilitate the entire ML process.
Breaking Down Feature Engineering
Steps that are typically involved in any machine learning workflow include gathering data, cleaning data, feature engineering, defining the model, training/testing the model, and predicting the output. Widely referred to in the industry as an art, feature engineering is often the differentiator between a good model and a not-so-good one. While feature engineering is vital to the overall machine learning process, it isn’t one-size-fits-all and the steps involved will vary depending on the business problem, type of model, and industry, for example. Some of the steps involved in feature engineering, though, may include:
- Pre-feature engineering data prep and exploratory data analysis
- Brainstorming/testing features and choosing which features to create
- Creating features
- Checking how the features work with the model (i.e., testing the impact)
- Optimizing the features if needed and repeating until the features work effectively
Feature Engineering and Dataiku
To support the initial portion of the feature engineering process which is typically more manual, users can transform or create new features using formulas, code, or built-in visual recipes. When developing the model itself, Dataiku AutoML accelerates feature pre-processing and handling by automatically filling missing values and converting non-numeric data into numerical values using well-established encoding techniques.
Additionally, built-in feature generation can programmatically create new pairwise combinations of features to provide additional inputs to the model, or users may instead choose to apply common feature reduction techniques. Once created, Dataiku documents the settings and stores feature engineering steps in recipes for reuse in scoring and model retraining.