The art of feature engineering in data science is often a balance between extracting more or new features and removing irrelevant or noisy features — too much or too little, and your model performance degrades, is difficult to interpret, or is overfitted. Feature engineering is the iterative process of transforming raw data into a form that provides optimal signal to models — all with the goal of creating simpler models with better results.
Common Techniques for Feature Engineering
Below are a few common techniques of feature engineering:
- Missing Data Imputation: With imputation, you’re replacing missing values in your dataset with estimated values. There are several methods including random sample, mean/median/mode, replacement by arbitrary value, and others, each with their own pros and cons. For example, if you replace missing values with a mean or median, you will alter the distribution of the data which may add bias to your model.
- Categorical Encoding: This is transforming strings or categorical variables (like Blue, Red, Green in a dataset about t-shirt colors, for example) into numerical values. Here there are also several methods, like target, one-hot, frequency, datetime, or mean encoding.
- Scaling: When there are large differences in the absolute values of multiple variables, scaling can better normalize or standardize values by establishing a limited scale for those variables, such as by limiting values between 0 and 1.
- Discretization: With this technique, you are sorting values into bins (called binning) or intervals, with the goal of altering data distribution. Many models don’t perform well with non-standard distributions, so this can improve model performance.
- Feature Extraction: With feature extraction, you're extracting data from text, images, and other non-numerical or categorical data into a form that is usable for ML models. For example, with NLP projects, you may want to extract values related to customer sentiment to better predict behavior such as likelihood to churn.
- Existing Feature Interactions: Sometimes it may be useful to include interaction terms between existing variables in your dataset. This can be helpful for uncovering signals that may be non-linear.
- Principal Component Analysis: This common method of feature engineering can help reduce the dimensionality of very large datasets while retaining as much information as possible.
So, how do you pick the best feature engineering techniques for your data? While some may be obvious (i.e., if you’re using a model algorithm that only works with numerical variables or can’t handle missing values), below are a few guidelines that can help you make decisions on where to start.
General Feature Engineering Tips
There are many ways to engineer features, and each data scientist’s preferred process and techniques can vary. Here are some general best practices gathered from experts to keep in mind when deciding on which techniques to choose:
- Spend time doing exploratory analysis: Use data visualizations and statistical analyses to better understand relationships between your data. This can help you identify the right features for the model and the most appropriate manipulations to make before starting.
- Dig into domain knowledge: Whether gathering from business stakeholders or gathering from other sources on your own, business context can help you better determine the importance of certain data variables, their relationship between each other, and their typical behavior. For example, having knowledge of typical fraud patterns in retail can help you understand which combination of features would be important when predicting fraudulent transactions.
- Think back from the objective and goals for your features: When determining which feature engineering techniques to employ, keep in mind the model type that you're choosing to optimize for that model. Also, when creating new features, make sure that they have the potential to be predictive and aren’t just created for the sake of it.
Feature Engineering in Dataiku
Dataiku includes several tools and built-in features that speed up the process of feature engineering:
- Quick iteration through automated feature engineering: Quickly and automatically apply common techniques like datetime encoding, target encoding, scaling, and imputation.
- Reuse common features with a feature store: Easily reuse features that are most relevant to your organization and establish best practices with a feature store.
- Better collaboration and distillation of domain knowledge: By working in a platform that can be used by business analysts (without sacrificing the complexity of full code), you can easily collaborate with SMEs and establish shared understanding with visual flows that show the full history of data transformations and project wikis that outline key business goals.
- Built-in exploratory data analysis: Dataiku features built-in statistical analysis and visualizations to quickly test out relationships between data, including Principal Component Analysis (PCA).