en
Tutorials
06. Feature Engineering

06. Feature Engineering Practical Guide

Encoding, Scaling, Missing Value Handling, Derived Variables


Learning Objectives

After completing this tutorial, you will be able to:

  • Categorical Encoding: Understand differences between One-Hot, Label, Target, Frequency Encoding and choose appropriately
  • Numerical Transformation: Apply Log transformation, Binning, various scalers to improve data distribution
  • Missing Value Handling: Utilize Simple, KNN, Iterative Imputer and Missing Indicator
  • Derived Variable Creation: Create new features using domain knowledge
  • Feature Selection: Select important variables using correlation, F-Score, and model-based methods
  • Pipeline Construction: Design safe preprocessing pipelines that prevent Data Leakage

Key Concepts

Why is Feature Engineering Important?

Factors affecting machine learning performance:

FactorImpact
Data QualityMost important - "Garbage in, garbage out"
Feature EngineeringCore factor - Makes performance difference with same data
Algorithm SelectionImportant but less impactful than features
HyperparametersFinal fine-tuning stage

Common trait of top Kaggle solutions: Most use similar algorithms (XGBoost, LightGBM), but differ in feature engineering. Andrew Ng: "Applied ML is basically feature engineering."


1. Categorical Encoding

MethodDescriptionSuitable Situation
One-HotBinary vectorFew categories, Linear models
LabelInteger mappingTree models
TargetTarget meanHigh cardinality
FrequencyFrequency countLeakage prevention

One-Hot Encoding

Converts categorical variables to binary vectors. Use when there's no order between categories.

# pandas method
df_encoded = pd.get_dummies(df, columns=['sex'], drop_first=True)
 
# sklearn method
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded = encoder.fit_transform(df[['sex']])
print(f'Feature names: {encoder.get_feature_names_out()}')
⚠️

Curse of Dimensionality occurs with many categories! Consider Target or Frequency Encoding if there are more than 8 unique values.

Label Encoding

Maps categories to integers. Effective for tree-based models.

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['sex_le'] = le.fit_transform(df['sex'])
⚠️

Label Encoding may imply ordinal relationship. Caution needed for linear models, but no problem for tree models.

Target Encoding (with Smoothing)

Encodes categories with target mean. Effective for high cardinality categories.

def target_encode_smoothed(df, col, target, m=10):
    """Prevent overfitting with Smoothing"""
    global_mean = df[target].mean()
    agg = df.groupby(col)[target].agg(['mean', 'count'])
    smoothed = (agg['count'] * agg['mean'] + m * global_mean) / (agg['count'] + m)
    return df[col].map(smoothed)
 
# Usage example
df['deck_smoothed'] = target_encode_smoothed(df, 'deck', 'survived', m=10)

Smoothing parameter m pulls categories with few samples toward the global mean. m=10 is a common starting point.

Frequency Encoding

Encodes by frequency of occurrence. No Leakage since Target information is not used.

freq_map = df['deck'].value_counts() / len(df)
df['deck_freq'] = df['deck'].map(freq_map)

2. Numerical Transformation

Scaler Comparison

ScalerCharacteristicsWhen to Use
StandardScalerMean=0, Std=1General cases
MinMaxScaler[0,1] rangeNeural networks, Distance-based models
RobustScalerUses median, IQRWhen outliers exist
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
 
# If outliers exist, use RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)

Log Transformation

Transforms skewed distribution closer to normal distribution.

# Check Skewness
print(f'Original Skewness: {df["fare"].skew():.2f}')
 
# Log transformation (use log1p if 0 is included)
df['fare_log'] = np.log1p(df['fare'])
print(f'Transformed Skewness: {df["fare_log"].skew():.2f}')

Consider Log transformation if Skewness is outside -1 to 1 range. np.log1p() calculates log(1+x) to safely handle 0 values.

Binning

Converts continuous variables to categorical.

# Equal Width
pd.cut(df['age'], bins=5)
 
# Equal Frequency
pd.qcut(df['age'], q=5)
 
# Domain-based custom bins
df['age_group'] = pd.cut(df['age'],
                         bins=[0, 12, 18, 35, 60, 100],
                         labels=['Child', 'Teen', 'Young', 'Adult', 'Senior'])

3. Missing Value Handling

MethodProsCons
Mean/MedianFast, SimpleVariance reduction
KNN ImputerConsiders variable relationshipsHigh computational cost
Iterative ImputerMost sophisticatedComplex, Slow
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
 
# Simple imputation
simple_imp = SimpleImputer(strategy='median')
 
# KNN-based (considers variable relationships)
knn_imp = KNNImputer(n_neighbors=5)
 
# Iterative (most sophisticated)
iter_imp = IterativeImputer(max_iter=10, random_state=42)

Missing Indicator

Missing values themselves can be informative. Add missing status as a separate feature.

# Indicate missing status
df['age_missing'] = df['age'].isna().astype(int)
 
# Impute missing values
df['age'].fillna(df['age'].median(), inplace=True)
 
# Analyze target by missing status
print(df.groupby('age_missing')['survived'].mean())

Missing Indicator can help prediction if missingness is not random (e.g., missing ages for elderly passengers).


4. Derived Variable Creation

Create new features using domain knowledge.

Titanic Example

# Family size
df['family_size'] = df['sibsp'] + df['parch'] + 1
 
# Traveling alone
df['is_alone'] = (df['family_size'] == 1).astype(int)
 
# Family size group
df['family_group'] = pd.cut(df['family_size'],
                            bins=[0, 1, 4, 11],
                            labels=['Alone', 'Small', 'Large'])
 
# Fare per person
df['fare_per_person'] = df['fare'] / df['family_size']
 
# Age group
df['age_group'] = pd.cut(df['age'],
                         bins=[0, 12, 18, 35, 60, 100],
                         labels=['Child', 'Teen', 'Young', 'Adult', 'Senior'])

Derived variable effectiveness varies by model. Linear models benefit greatly from derived variables expressing non-linear relationships, while tree models already handle non-linearity so the effect may be smaller.


5. Feature Selection

Comparing Various Methods

from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier
 
# 1. Correlation-based
corr = X.corrwith(y).abs().sort_values(ascending=False)
print('Correlation ranking:', corr.head())
 
# 2. F-Score (Statistical significance)
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)
scores = pd.Series(selector.scores_, index=X.columns).sort_values(ascending=False)
 
# 3. RFE (Recursive Feature Elimination)
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)
 
# 4. Model-based (Feature Importance)
rf = RandomForestClassifier(n_estimators=100).fit(X, y)
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
⚠️

Each method measures importance from different perspectives. Correlation looks at linear relationships only, F-Score at statistical significance, RF Importance at prediction contribution. Use multiple methods together!


6. Pipeline Construction (Leakage Prevention)

🚫

Data Leakage Warning!

Wrong: Scale entire data then split train/test

Correct: Split first, fit only on train, then transform each

Using Pipeline prevents this automatically!

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
 
num_features = ['age', 'sibsp', 'parch', 'fare']
cat_features = ['pclass', 'sex', 'embarked']
 
# Numerical preprocessing
num_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
 
# Categorical preprocessing
cat_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])
 
# Combine
preprocessor = ColumnTransformer([
    ('num', num_transformer, num_features),
    ('cat', cat_transformer, cat_features)
])
 
# Full pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier())
])
 
# Usage
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)

Recommendations Summary

SituationRecommended Method
Few categories + Linear modelOne-Hot
Tree modelsLabel/Ordinal
High cardinalityTarget/Frequency
Skewed distributionLog transformation
OutliersRobustScaler
Important relationships in missing valuesKNN Imputer
Leakage preventionPipeline essential!

Interview Questions Preview

  1. What are the pros, cons, and considerations of Target Encoding?
  2. When is Feature Scaling necessary and when is it not?
  3. Why use a Pipeline?

Check out more interview questions at Premium Interviews (opens in a new tab).


Practice Notebook

The notebook additionally covers:

  • Practical examples using Titanic and California Housing datasets
  • Performance comparison experiments by encoding method (One-Hot vs Label vs Target)
  • Accuracy comparison by Imputation method
  • Measuring model performance change from derived variables
  • Log transformation effect comparison in regression problems
  • Comparing multiple models (LogReg, RF, GBM, SVM) with Pipeline

Previous: 05. Ensemble Methods | Next: 07. Clustering