06. Feature Engineering Practical Guide
Encoding, Scaling, Missing Value Handling, Derived Variables
Learning Objectives
After completing this tutorial, you will be able to:
- Categorical Encoding: Understand differences between One-Hot, Label, Target, Frequency Encoding and choose appropriately
- Numerical Transformation: Apply Log transformation, Binning, various scalers to improve data distribution
- Missing Value Handling: Utilize Simple, KNN, Iterative Imputer and Missing Indicator
- Derived Variable Creation: Create new features using domain knowledge
- Feature Selection: Select important variables using correlation, F-Score, and model-based methods
- Pipeline Construction: Design safe preprocessing pipelines that prevent Data Leakage
Key Concepts
Why is Feature Engineering Important?
Factors affecting machine learning performance:
| Factor | Impact |
|---|---|
| Data Quality | Most important - "Garbage in, garbage out" |
| Feature Engineering | Core factor - Makes performance difference with same data |
| Algorithm Selection | Important but less impactful than features |
| Hyperparameters | Final fine-tuning stage |
Common trait of top Kaggle solutions: Most use similar algorithms (XGBoost, LightGBM), but differ in feature engineering. Andrew Ng: "Applied ML is basically feature engineering."
1. Categorical Encoding
| Method | Description | Suitable Situation |
|---|---|---|
| One-Hot | Binary vector | Few categories, Linear models |
| Label | Integer mapping | Tree models |
| Target | Target mean | High cardinality |
| Frequency | Frequency count | Leakage prevention |
One-Hot Encoding
Converts categorical variables to binary vectors. Use when there's no order between categories.
# pandas method
df_encoded = pd.get_dummies(df, columns=['sex'], drop_first=True)
# sklearn method
from sklearn.preprocessing import OneHotEncoder
encoder = OneHotEncoder(sparse_output=False, drop='first')
encoded = encoder.fit_transform(df[['sex']])
print(f'Feature names: {encoder.get_feature_names_out()}')Curse of Dimensionality occurs with many categories! Consider Target or Frequency Encoding if there are more than 8 unique values.
Label Encoding
Maps categories to integers. Effective for tree-based models.
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df['sex_le'] = le.fit_transform(df['sex'])Label Encoding may imply ordinal relationship. Caution needed for linear models, but no problem for tree models.
Target Encoding (with Smoothing)
Encodes categories with target mean. Effective for high cardinality categories.
def target_encode_smoothed(df, col, target, m=10):
"""Prevent overfitting with Smoothing"""
global_mean = df[target].mean()
agg = df.groupby(col)[target].agg(['mean', 'count'])
smoothed = (agg['count'] * agg['mean'] + m * global_mean) / (agg['count'] + m)
return df[col].map(smoothed)
# Usage example
df['deck_smoothed'] = target_encode_smoothed(df, 'deck', 'survived', m=10)Smoothing parameter m pulls categories with few samples toward the global mean. m=10 is a common starting point.
Frequency Encoding
Encodes by frequency of occurrence. No Leakage since Target information is not used.
freq_map = df['deck'].value_counts() / len(df)
df['deck_freq'] = df['deck'].map(freq_map)2. Numerical Transformation
Scaler Comparison
| Scaler | Characteristics | When to Use |
|---|---|---|
| StandardScaler | Mean=0, Std=1 | General cases |
| MinMaxScaler | [0,1] range | Neural networks, Distance-based models |
| RobustScaler | Uses median, IQR | When outliers exist |
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
# If outliers exist, use RobustScaler
scaler = RobustScaler()
X_scaled = scaler.fit_transform(X)Log Transformation
Transforms skewed distribution closer to normal distribution.
# Check Skewness
print(f'Original Skewness: {df["fare"].skew():.2f}')
# Log transformation (use log1p if 0 is included)
df['fare_log'] = np.log1p(df['fare'])
print(f'Transformed Skewness: {df["fare_log"].skew():.2f}')Consider Log transformation if Skewness is outside -1 to 1 range. np.log1p() calculates log(1+x) to safely handle 0 values.
Binning
Converts continuous variables to categorical.
# Equal Width
pd.cut(df['age'], bins=5)
# Equal Frequency
pd.qcut(df['age'], q=5)
# Domain-based custom bins
df['age_group'] = pd.cut(df['age'],
bins=[0, 12, 18, 35, 60, 100],
labels=['Child', 'Teen', 'Young', 'Adult', 'Senior'])3. Missing Value Handling
| Method | Pros | Cons |
|---|---|---|
| Mean/Median | Fast, Simple | Variance reduction |
| KNN Imputer | Considers variable relationships | High computational cost |
| Iterative Imputer | Most sophisticated | Complex, Slow |
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
# Simple imputation
simple_imp = SimpleImputer(strategy='median')
# KNN-based (considers variable relationships)
knn_imp = KNNImputer(n_neighbors=5)
# Iterative (most sophisticated)
iter_imp = IterativeImputer(max_iter=10, random_state=42)Missing Indicator
Missing values themselves can be informative. Add missing status as a separate feature.
# Indicate missing status
df['age_missing'] = df['age'].isna().astype(int)
# Impute missing values
df['age'].fillna(df['age'].median(), inplace=True)
# Analyze target by missing status
print(df.groupby('age_missing')['survived'].mean())Missing Indicator can help prediction if missingness is not random (e.g., missing ages for elderly passengers).
4. Derived Variable Creation
Create new features using domain knowledge.
Titanic Example
# Family size
df['family_size'] = df['sibsp'] + df['parch'] + 1
# Traveling alone
df['is_alone'] = (df['family_size'] == 1).astype(int)
# Family size group
df['family_group'] = pd.cut(df['family_size'],
bins=[0, 1, 4, 11],
labels=['Alone', 'Small', 'Large'])
# Fare per person
df['fare_per_person'] = df['fare'] / df['family_size']
# Age group
df['age_group'] = pd.cut(df['age'],
bins=[0, 12, 18, 35, 60, 100],
labels=['Child', 'Teen', 'Young', 'Adult', 'Senior'])Derived variable effectiveness varies by model. Linear models benefit greatly from derived variables expressing non-linear relationships, while tree models already handle non-linearity so the effect may be smaller.
5. Feature Selection
Comparing Various Methods
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier
# 1. Correlation-based
corr = X.corrwith(y).abs().sort_values(ascending=False)
print('Correlation ranking:', corr.head())
# 2. F-Score (Statistical significance)
selector = SelectKBest(score_func=f_classif, k=10)
X_selected = selector.fit_transform(X, y)
scores = pd.Series(selector.scores_, index=X.columns).sort_values(ascending=False)
# 3. RFE (Recursive Feature Elimination)
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=10)
X_rfe = rfe.fit_transform(X, y)
# 4. Model-based (Feature Importance)
rf = RandomForestClassifier(n_estimators=100).fit(X, y)
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)Each method measures importance from different perspectives. Correlation looks at linear relationships only, F-Score at statistical significance, RF Importance at prediction contribution. Use multiple methods together!
6. Pipeline Construction (Leakage Prevention)
Data Leakage Warning!
Wrong: Scale entire data then split train/test
Correct: Split first, fit only on train, then transform each
Using Pipeline prevents this automatically!
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
num_features = ['age', 'sibsp', 'parch', 'fare']
cat_features = ['pclass', 'sex', 'embarked']
# Numerical preprocessing
num_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical preprocessing
cat_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('onehot', OneHotEncoder(handle_unknown='ignore'))
])
# Combine
preprocessor = ColumnTransformer([
('num', num_transformer, num_features),
('cat', cat_transformer, cat_features)
])
# Full pipeline
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier())
])
# Usage
pipeline.fit(X_train, y_train)
score = pipeline.score(X_test, y_test)Recommendations Summary
| Situation | Recommended Method |
|---|---|
| Few categories + Linear model | One-Hot |
| Tree models | Label/Ordinal |
| High cardinality | Target/Frequency |
| Skewed distribution | Log transformation |
| Outliers | RobustScaler |
| Important relationships in missing values | KNN Imputer |
| Leakage prevention | Pipeline essential! |
Interview Questions Preview
- What are the pros, cons, and considerations of Target Encoding?
- When is Feature Scaling necessary and when is it not?
- Why use a Pipeline?
Check out more interview questions at Premium Interviews (opens in a new tab).
Practice Notebook
The notebook additionally covers:
- Practical examples using Titanic and California Housing datasets
- Performance comparison experiments by encoding method (One-Hot vs Label vs Target)
- Accuracy comparison by Imputation method
- Measuring model performance change from derived variables
- Log transformation effect comparison in regression problems
- Comparing multiple models (LogReg, RF, GBM, SVM) with Pipeline
Previous: 05. Ensemble Methods | Next: 07. Clustering