10. Mastering Imbalanced Data
SMOTE, Class Weight, Evaluation Metric Selection
Learning Objectives
After completing this tutorial, you will be able to:
- Understand Imbalanced Data definition and real-world problem cases
- Implement and understand Oversampling techniques (SMOTE, ADASYN)
- Implement Undersampling techniques (Random, Tomek Links)
- Perform cost-sensitive learning through Class Weight adjustment
- Select and interpret appropriate evaluation metrics (F1, ROC-AUC, PR-AUC)
- Optimize performance through Threshold adjustment
Key Concepts
1. What is Imbalanced Data?
Data where there's a significant difference in sample count between classes.
| Real Case | Majority Class | Minority Class Ratio |
|---|---|---|
| Fraud Detection | Normal transactions | Fraud 0.1% |
| Medical Diagnosis | Normal patients | Rare disease 1% |
| Manufacturing Defects | Normal products | Defects 2% |
| Customer Churn | Retained customers | Churned 5% |
Why is it a Problem?
Example: 1000 data points (990 normal, 10 fraud)
Model: Predicts "all transactions are normal"
→ Accuracy: 99% (looks good but...)
→ Fraud detection rate: 0% (completely useless!)Even with high Accuracy, the model might not detect the minority class at all. Don't trust Accuracy with imbalanced data.
2. Solution Categories
| Category | Method | Description |
|---|---|---|
| Data Level | Over/Under Sampling | Adjust sample count |
| Algorithm Level | Class Weight, Cost-Sensitive | Adjust loss function |
| Evaluation Level | Appropriate metric selection | F1, AUC, etc. |
| Threshold Level | Threshold adjustment | Change probability cutoff |
3. Oversampling
Increases minority class sample count to balance classes.
Random Oversampling
Randomly duplicates minority class samples.
from imblearn.over_sampling import RandomOverSampler
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)SMOTE (Synthetic Minority Over-sampling Technique)
Generates synthetic samples of minority class.
from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)SMOTE Principle:
- Select minority class sample
- Select one of k-nearest neighbors
- Generate new sample by linear interpolation between two samples
SMOTE generates new synthetic samples rather than simple duplication, increasing diversity and reducing overfitting risk.
ADASYN (Adaptive Synthetic Sampling)
A SMOTE variant that focuses on harder-to-learn samples.
from imblearn.over_sampling import ADASYN
adasyn = ADASYN(random_state=42)
X_resampled, y_resampled = adasyn.fit_resample(X, y)4. Undersampling
Reduces majority class sample count to balance classes.
Random Undersampling
Randomly removes samples from majority class.
from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)Random Undersampling can cause significant information loss. Example: 8000 → 800 (90% loss)
Tomek Links
Removes ambiguous samples at class boundaries to clarify boundaries.
from imblearn.under_sampling import TomekLinks
tomek = TomekLinks()
X_resampled, y_resampled = tomek.fit_resample(X, y)Tomek Link: Nearest neighbor pairs from different classes
- Remove majority class Tomek Link samples → Clarify boundary
5. Combined Techniques (Over + Under)
Combines oversampling and undersampling to maximize advantages.
from imblearn.combine import SMOTETomek
smote_tomek = SMOTETomek(random_state=42)
X_resampled, y_resampled = smote_tomek.fit_resample(X, y)SMOTETomek Principle:
- Synthesize minority class with SMOTE
- Remove noise with Tomek Links
6. Class Weight
Assigns higher weight to minority class in loss function.
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
# Auto-calculate (inverse of class frequency)
model = LogisticRegression(class_weight='balanced')
# Or specify directly
model = RandomForestClassifier(class_weight={0: 1, 1: 10})balanced formula: weight = n_samples / (n_classes * n_samples_class)
Weight Calculation Example
from sklearn.utils.class_weight import compute_class_weight
# Example: 7600 vs 400 (19:1 ratio)
weights = compute_class_weight('balanced', classes=[0, 1], y=y_train)
# Class 0: 0.5263
# Class 1: 10.0000 (19x higher weight)Class Weight doesn't modify data itself, so it can correct imbalance while maintaining original data distribution.
Random Forest's balanced_subsample
rf_balanced = RandomForestClassifier(
n_estimators=100,
class_weight='balanced_subsample', # Applied to each tree bootstrap sample
random_state=42
)7. Evaluation Metric Selection
Why Accuracy is Inappropriate
With 99% normal, 1% fraud data:
→ Predicting all as normal gives Accuracy = 99%
→ But fraud detection completely failed!Recommended Metrics
| Metric | Formula | Meaning | When to Use |
|---|---|---|---|
| Precision | TP/(TP+FP) | Ratio of actual positives among positive predictions | When FP cost is high |
| Recall | TP/(TP+FN) | Ratio of detections among actual positives | When FN cost is high |
| F1-Score | 2×P×R/(P+R) | Harmonic mean of Precision and Recall | When balance needed |
| PR-AUC | PR curve area | Comprehensive evaluation sensitive to imbalance | When severely imbalanced |
| ROC-AUC | ROC curve area | Comprehensive classification ability evaluation | General |
ROC-AUC vs PR-AUC: PR-AUC is more sensitive for imbalanced data. ROC-AUC tends to be high when TN is large, so PR-AUC is recommended for extreme imbalance.
8. Threshold Adjustment
Adjust based on business requirements instead of default threshold (0.5).
from sklearn.metrics import precision_recall_curve
# Calculate Precision/Recall per Threshold
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
# Find threshold guaranteeing specific Recall
target_recall = 0.9
idx = np.argmax(recall >= target_recall)
optimal_threshold = thresholds[idx]
# Apply optimal threshold
y_pred_optimal = (y_proba >= optimal_threshold).astype(int)Lowering threshold increases Recall, decreases Precision, raising threshold does the opposite. Choose based on business costs.
Code Summary
SMOTE + Model Pipeline
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
# SMOTE + Model Pipeline
pipeline = ImbPipeline([
('smote', SMOTE(random_state=42)),
('classifier', RandomForestClassifier(random_state=42))
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
# Evaluation
print(classification_report(y_test, y_pred))Class Weight Method
model_weighted = RandomForestClassifier(
class_weight='balanced',
random_state=42
)
model_weighted.fit(X_train, y_train)Correct Cross-Validation with SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
# imblearn Pipeline (SMOTE + model)
pipeline = ImbPipeline([
('smote', SMOTE(random_state=42)),
('classifier', LogisticRegression(max_iter=1000, random_state=42))
])
# Stratified K-Fold (SMOTE applied within each fold)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X_train, y_train, cv=skf, scoring='f1')
print(f'Mean F1: {scores.mean():.3f} (±{scores.std():.3f})')Important: Resampling should only be applied to Train data. Keep Test data as original. Using imblearn's Pipeline in CV applies SMOTE only within each fold.
Method Comparison
| Method | Pros | Cons |
|---|---|---|
| Random Undersampling | Fast, Reduces training time | Large information loss |
| Random Oversampling | Preserves information | Overfitting risk (simple duplication) |
| SMOTE | Increases diversity, Generates new samples | May create noise |
| ADASYN | Focuses on difficult samples | May create more noise than SMOTE |
| Tomek Links | Clarifies boundaries | Only removes small amount |
| SMOTETomek | Combines advantages | Increased computational cost |
| Class Weight | No data modification | Model dependent |
Practical Guide
| Situation | Recommended Method |
|---|---|
| Sufficient data | Undersampling or Class Weight |
| Insufficient data | SMOTE or ADASYN |
| Tree-based models | Class Weight (balanced/balanced_subsample) |
| Noisy data | SMOTETomek (noise removal) |
| Severe imbalance (100:1+) | SMOTE + Class Weight combination |
| Real-time prediction needed | Class Weight (shorter training time) |
Best Practices
- Always use stratified sampling: Split Train/Test with
stratify=y - Resample only Train: Keep Test data as original
- Cross-Validation caution: Use StratifiedKFold, resample within each fold
- Metric selection: Use F1, ROC-AUC, PR-AUC instead of Accuracy
- Consider Threshold adjustment: Optimize according to business requirements
Interview Questions Preview
- Why is Accuracy inappropriate for imbalanced data?
- What are SMOTE's principles and limitations?
- Should you prioritize Precision or Recall?
- What's the difference between ROC-AUC and PR-AUC?
- How is Class Weight calculated?
Check out more interview questions at Premium Interviews (opens in a new tab).
Practice Notebook
Practice all imbalanced data handling techniques:
The notebook additionally covers:
- Imbalanced data generation and visualization (PCA 2D projection)
- Performance comparison experiments of all methods (6 methods)
- ROC Curve vs PR Curve comparison analysis
- Performance change visualization by Threshold
- Random Forest balanced_subsample usage
- Practice problems (extreme imbalance, cost-sensitive learning, multi-class)
Previous: 09. Anomaly Detection | Next: 11. Time Series