en
Tutorials
10. Imbalanced Data

10. Mastering Imbalanced Data

SMOTE, Class Weight, Evaluation Metric Selection


Learning Objectives

After completing this tutorial, you will be able to:

  1. Understand Imbalanced Data definition and real-world problem cases
  2. Implement and understand Oversampling techniques (SMOTE, ADASYN)
  3. Implement Undersampling techniques (Random, Tomek Links)
  4. Perform cost-sensitive learning through Class Weight adjustment
  5. Select and interpret appropriate evaluation metrics (F1, ROC-AUC, PR-AUC)
  6. Optimize performance through Threshold adjustment

Key Concepts

1. What is Imbalanced Data?

Data where there's a significant difference in sample count between classes.

Real CaseMajority ClassMinority Class Ratio
Fraud DetectionNormal transactionsFraud 0.1%
Medical DiagnosisNormal patientsRare disease 1%
Manufacturing DefectsNormal productsDefects 2%
Customer ChurnRetained customersChurned 5%

Why is it a Problem?

Example: 1000 data points (990 normal, 10 fraud)

Model: Predicts "all transactions are normal"
→ Accuracy: 99%  (looks good but...)
→ Fraud detection rate: 0%  (completely useless!)
⚠️

Even with high Accuracy, the model might not detect the minority class at all. Don't trust Accuracy with imbalanced data.


2. Solution Categories

CategoryMethodDescription
Data LevelOver/Under SamplingAdjust sample count
Algorithm LevelClass Weight, Cost-SensitiveAdjust loss function
Evaluation LevelAppropriate metric selectionF1, AUC, etc.
Threshold LevelThreshold adjustmentChange probability cutoff

3. Oversampling

Increases minority class sample count to balance classes.

Random Oversampling

Randomly duplicates minority class samples.

from imblearn.over_sampling import RandomOverSampler
 
ros = RandomOverSampler(random_state=42)
X_resampled, y_resampled = ros.fit_resample(X, y)

SMOTE (Synthetic Minority Over-sampling Technique)

Generates synthetic samples of minority class.

from imblearn.over_sampling import SMOTE
 
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

SMOTE Principle:

  1. Select minority class sample
  2. Select one of k-nearest neighbors
  3. Generate new sample by linear interpolation between two samples

SMOTE generates new synthetic samples rather than simple duplication, increasing diversity and reducing overfitting risk.

ADASYN (Adaptive Synthetic Sampling)

A SMOTE variant that focuses on harder-to-learn samples.

from imblearn.over_sampling import ADASYN
 
adasyn = ADASYN(random_state=42)
X_resampled, y_resampled = adasyn.fit_resample(X, y)

4. Undersampling

Reduces majority class sample count to balance classes.

Random Undersampling

Randomly removes samples from majority class.

from imblearn.under_sampling import RandomUnderSampler
 
rus = RandomUnderSampler(random_state=42)
X_resampled, y_resampled = rus.fit_resample(X, y)
⚠️

Random Undersampling can cause significant information loss. Example: 8000 → 800 (90% loss)

Tomek Links

Removes ambiguous samples at class boundaries to clarify boundaries.

from imblearn.under_sampling import TomekLinks
 
tomek = TomekLinks()
X_resampled, y_resampled = tomek.fit_resample(X, y)

Tomek Link: Nearest neighbor pairs from different classes

  • Remove majority class Tomek Link samples → Clarify boundary

5. Combined Techniques (Over + Under)

Combines oversampling and undersampling to maximize advantages.

from imblearn.combine import SMOTETomek
 
smote_tomek = SMOTETomek(random_state=42)
X_resampled, y_resampled = smote_tomek.fit_resample(X, y)

SMOTETomek Principle:

  1. Synthesize minority class with SMOTE
  2. Remove noise with Tomek Links

6. Class Weight

Assigns higher weight to minority class in loss function.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
 
# Auto-calculate (inverse of class frequency)
model = LogisticRegression(class_weight='balanced')
 
# Or specify directly
model = RandomForestClassifier(class_weight={0: 1, 1: 10})

balanced formula: weight = n_samples / (n_classes * n_samples_class)

Weight Calculation Example

from sklearn.utils.class_weight import compute_class_weight
 
# Example: 7600 vs 400 (19:1 ratio)
weights = compute_class_weight('balanced', classes=[0, 1], y=y_train)
# Class 0: 0.5263
# Class 1: 10.0000  (19x higher weight)

Class Weight doesn't modify data itself, so it can correct imbalance while maintaining original data distribution.

Random Forest's balanced_subsample

rf_balanced = RandomForestClassifier(
    n_estimators=100,
    class_weight='balanced_subsample',  # Applied to each tree bootstrap sample
    random_state=42
)

7. Evaluation Metric Selection

Why Accuracy is Inappropriate

With 99% normal, 1% fraud data:
→ Predicting all as normal gives Accuracy = 99%
→ But fraud detection completely failed!

Recommended Metrics

MetricFormulaMeaningWhen to Use
PrecisionTP/(TP+FP)Ratio of actual positives among positive predictionsWhen FP cost is high
RecallTP/(TP+FN)Ratio of detections among actual positivesWhen FN cost is high
F1-Score2×P×R/(P+R)Harmonic mean of Precision and RecallWhen balance needed
PR-AUCPR curve areaComprehensive evaluation sensitive to imbalanceWhen severely imbalanced
ROC-AUCROC curve areaComprehensive classification ability evaluationGeneral

ROC-AUC vs PR-AUC: PR-AUC is more sensitive for imbalanced data. ROC-AUC tends to be high when TN is large, so PR-AUC is recommended for extreme imbalance.


8. Threshold Adjustment

Adjust based on business requirements instead of default threshold (0.5).

from sklearn.metrics import precision_recall_curve
 
# Calculate Precision/Recall per Threshold
precision, recall, thresholds = precision_recall_curve(y_test, y_proba)
 
# Find threshold guaranteeing specific Recall
target_recall = 0.9
idx = np.argmax(recall >= target_recall)
optimal_threshold = thresholds[idx]
 
# Apply optimal threshold
y_pred_optimal = (y_proba >= optimal_threshold).astype(int)

Lowering threshold increases Recall, decreases Precision, raising threshold does the opposite. Choose based on business costs.


Code Summary

SMOTE + Model Pipeline

from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
 
# SMOTE + Model Pipeline
pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', RandomForestClassifier(random_state=42))
])
 
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
 
# Evaluation
print(classification_report(y_test, y_pred))

Class Weight Method

model_weighted = RandomForestClassifier(
    class_weight='balanced',
    random_state=42
)
model_weighted.fit(X_train, y_train)

Correct Cross-Validation with SMOTE

from imblearn.pipeline import Pipeline as ImbPipeline
from sklearn.model_selection import cross_val_score, StratifiedKFold
 
# imblearn Pipeline (SMOTE + model)
pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=42)),
    ('classifier', LogisticRegression(max_iter=1000, random_state=42))
])
 
# Stratified K-Fold (SMOTE applied within each fold)
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(pipeline, X_train, y_train, cv=skf, scoring='f1')
 
print(f'Mean F1: {scores.mean():.3f}{scores.std():.3f})')
🚫

Important: Resampling should only be applied to Train data. Keep Test data as original. Using imblearn's Pipeline in CV applies SMOTE only within each fold.


Method Comparison

MethodProsCons
Random UndersamplingFast, Reduces training timeLarge information loss
Random OversamplingPreserves informationOverfitting risk (simple duplication)
SMOTEIncreases diversity, Generates new samplesMay create noise
ADASYNFocuses on difficult samplesMay create more noise than SMOTE
Tomek LinksClarifies boundariesOnly removes small amount
SMOTETomekCombines advantagesIncreased computational cost
Class WeightNo data modificationModel dependent

Practical Guide

SituationRecommended Method
Sufficient dataUndersampling or Class Weight
Insufficient dataSMOTE or ADASYN
Tree-based modelsClass Weight (balanced/balanced_subsample)
Noisy dataSMOTETomek (noise removal)
Severe imbalance (100:1+)SMOTE + Class Weight combination
Real-time prediction neededClass Weight (shorter training time)

Best Practices

  1. Always use stratified sampling: Split Train/Test with stratify=y
  2. Resample only Train: Keep Test data as original
  3. Cross-Validation caution: Use StratifiedKFold, resample within each fold
  4. Metric selection: Use F1, ROC-AUC, PR-AUC instead of Accuracy
  5. Consider Threshold adjustment: Optimize according to business requirements

Interview Questions Preview

  1. Why is Accuracy inappropriate for imbalanced data?
  2. What are SMOTE's principles and limitations?
  3. Should you prioritize Precision or Recall?
  4. What's the difference between ROC-AUC and PR-AUC?
  5. How is Class Weight calculated?

Check out more interview questions at Premium Interviews (opens in a new tab).


Practice Notebook

Practice all imbalanced data handling techniques:

The notebook additionally covers:

  • Imbalanced data generation and visualization (PCA 2D projection)
  • Performance comparison experiments of all methods (6 methods)
  • ROC Curve vs PR Curve comparison analysis
  • Performance change visualization by Threshold
  • Random Forest balanced_subsample usage
  • Practice problems (extreme imbalance, cost-sensitive learning, multi-class)

Previous: 09. Anomaly Detection | Next: 11. Time Series