10. Mastering Imbalanced Data

SMOTE, Class Weight, Evaluation Metric Selection

Learning Objectives

After completing this tutorial, you will be able to:

Understand Imbalanced Data definition and real-world problem cases
Implement and understand Oversampling techniques (SMOTE, ADASYN)
Implement Undersampling techniques (Random, Tomek Links)
Perform cost-sensitive learning through Class Weight adjustment
Select and interpret appropriate evaluation metrics (F1, ROC-AUC, PR-AUC)
Optimize performance through Threshold adjustment

Key Concepts

1. What is Imbalanced Data?

Data where there's a significant difference in sample count between classes.

Real Case	Majority Class	Minority Class Ratio
Fraud Detection	Normal transactions	Fraud 0.1%
Medical Diagnosis	Normal patients	Rare disease 1%
Manufacturing Defects	Normal products	Defects 2%
Customer Churn	Retained customers	Churned 5%

Why is it a Problem?

Example: 1000 data points (990 normal, 10 fraud)

Model: Predicts "all transactions are normal"
→ Accuracy: 99%  (looks good but...)
→ Fraud detection rate: 0%  (completely useless!)

⚠️

Even with high Accuracy, the model might not detect the minority class at all. Don't trust Accuracy with imbalanced data.

2. Solution Categories

Category	Method	Description
Data Level	Over/Under Sampling	Adjust sample count
Algorithm Level	Class Weight, Cost-Sensitive	Adjust loss function
Evaluation Level	Appropriate metric selection	F1, AUC, etc.
Threshold Level	Threshold adjustment	Change probability cutoff

3. Oversampling

Increases minority class sample count to balance classes.

Random Oversampling

Randomly duplicates minority class samples.

from imblearn.over_sampling import RandomOverSampler
 
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

SMOTE (Synthetic Minority Over-sampling Technique)

Generates synthetic samples of minority class.

from imblearn.over_sampling import SMOTE
 
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

SMOTE Principle:

Select minority class sample
Select one of k-nearest neighbors
Generate new sample by linear interpolation between two samples

SMOTE generates new synthetic samples rather than simple duplication, increasing diversity and reducing overfitting risk.

ADASYN (Adaptive Synthetic Sampling)

A SMOTE variant that focuses on harder-to-learn samples.

from imblearn.over_sampling import ADASYN
 
adasyn = ADASYN(random_state=42)
X_resampled, y_resampled = adasyn.fit_resample(X, y)

4. Undersampling

Reduces majority class sample count to balance classes.

Random Undersampling

Randomly removes samples from majority class.

from imblearn.under_sampling import RandomUnderSampler
 
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)

⚠️

Random Undersampling can cause significant information loss. Example: 8000 → 800 (90% loss)

Tomek Links

Removes ambiguous samples at class boundaries to clarify boundaries.

from imblearn.under_sampling import TomekLinks
 
tomek = TomekLinks()
X_resampled, y_resampled = tomek.fit_resample(X, y)

Tomek Link: Nearest neighbor pairs from different classes

Remove majority class Tomek Link samples → Clarify boundary

5. Combined Techniques (Over + Under)

Combines oversampling and undersampling to maximize advantages.

from imblearn.combine import SMOTETomek
 
smote_tomek = SMOTETomek(random_state=42)
X_resampled, y_resampled = smote_tomek.fit_resample(X, y)

SMOTETomek Principle:

Synthesize minority class with SMOTE
Remove noise with Tomek Links

6. Class Weight

Assigns higher weight to minority class in loss function.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
 
# Auto-calculate (inverse of class frequency)
model = LogisticRegression(class_weight='balanced')
 
# Or specify directly
model = RandomForestClassifier(class_weight={0: 1, 1: 10})

balanced formula: weight = n_samples / (n_classes * n_samples_class)

Weight Calculation Example

from sklearn.utils.class_weight import compute_class_weight
 
# Example: 7600 vs 400 (19:1 ratio)
weights = compute_class_weight('balanced', classes=[0, 1], y=y_train)
# Class 0: 0.5263
# Class 1: 10.0000  (19x higher weight)

Class Weight doesn't modify data itself, so it can correct imbalance while maintaining original data distribution.

Random Forest's balanced_subsample

rf_balanced = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced_subsample',  # Applied to each tree bootstrap sample
    random_state=42
)

7. Evaluation Metric Selection

Why Accuracy is Inappropriate

With 99% normal, 1% fraud data:
→ Predicting all as normal gives Accuracy = 99%
→ But fraud detection completely failed!

Recommended Metrics

Metric	Formula	Meaning	When to Use
Precision	TP/(TP+FP)	Ratio of actual positives among positive predictions	When FP cost is high
Recall	TP/(TP+FN)	Ratio of detections among actual positives	When FN cost is high
F1-Score	2×P×R/(P+R)	Harmonic mean of Precision and Recall	When balance needed
PR-AUC	PR curve area	Comprehensive evaluation sensitive to imbalance	When severely imbalanced
ROC-AUC	ROC curve area	Comprehensive classification ability evaluation	General

ROC-AUC vs PR-AUC: PR-AUC is more sensitive for imbalanced data. ROC-AUC tends to be high when TN is large, so PR-AUC is recommended for extreme imbalance.

8. Threshold Adjustment

Adjust based on business requirements instead of default threshold (0.5).

from sklearn.metrics import precision_recall_curve
 
# Calculate Precision/Recall per Threshold
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
 
# Find threshold guaranteeing specific Recall
target_recall = 0.9
idx = np.argmax(recall >= target_recall)
optimal_threshold = thresholds[idx]
 
# Apply optimal threshold
y_pred_optimal = (y_proba >= optimal_threshold).astype(int)

Lowering threshold increases Recall, decreases Precision, raising threshold does the opposite. Choose based on business costs.

Code Summary

SMOTE + Model Pipeline

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
 
# SMOTE + Model Pipeline
pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])
 
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
 
# Evaluation
print(classification_report(y_test, y_pred))

Class Weight Method

model_weighted = RandomForestClassifier(
    class_weight='balanced',
    random_state=42
)
model_weighted.fit(X_train, y_train)

Correct Cross-Validation with SMOTE

from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
 
# imblearn Pipeline (SMOTE + model)
pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])
 
# Stratified K-Fold (SMOTE applied within each fold)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X_train, y_train, cv=skf, scoring='f1')
 
print(f'Mean F1: {scores.mean():.3f} (±{scores.std():.3f})')

🚫

Important: Resampling should only be applied to Train data. Keep Test data as original. Using imblearn's Pipeline in CV applies SMOTE only within each fold.

Method Comparison

Method	Pros	Cons
Random Undersampling	Fast, Reduces training time	Large information loss
Random Oversampling	Preserves information	Overfitting risk (simple duplication)
SMOTE	Increases diversity, Generates new samples	May create noise
ADASYN	Focuses on difficult samples	May create more noise than SMOTE
Tomek Links	Clarifies boundaries	Only removes small amount
SMOTETomek	Combines advantages	Increased computational cost
Class Weight	No data modification	Model dependent

Practical Guide

Situation	Recommended Method
Sufficient data	Undersampling or Class Weight
Insufficient data	SMOTE or ADASYN
Tree-based models	Class Weight (balanced/balanced_subsample)
Noisy data	SMOTETomek (noise removal)
Severe imbalance (100:1+)	SMOTE + Class Weight combination
Real-time prediction needed	Class Weight (shorter training time)

Best Practices

Always use stratified sampling: Split Train/Test with stratify=y
Resample only Train: Keep Test data as original
Cross-Validation caution: Use StratifiedKFold, resample within each fold
Metric selection: Use F1, ROC-AUC, PR-AUC instead of Accuracy
Consider Threshold adjustment: Optimize according to business requirements

Interview Questions Preview

Why is Accuracy inappropriate for imbalanced data?
What are SMOTE's principles and limitations?
Should you prioritize Precision or Recall?
What's the difference between ROC-AUC and PR-AUC?
How is Class Weight calculated?

Check out more interview questions at Premium Interviews (opens in a new tab).

Practice Notebook

Practice all imbalanced data handling techniques:

The notebook additionally covers:

Imbalanced data generation and visualization (PCA 2D projection)
Performance comparison experiments of all methods (6 methods)
ROC Curve vs PR Curve comparison analysis
Performance change visualization by Threshold
Random Forest balanced_subsample usage
Practice problems (extreme imbalance, cost-sensitive learning, multi-class)

Previous: 09. Anomaly Detection | Next: 11. Time Series

09. Anomaly Detection 11. Time Series