en
Tutorials
05. Ensemble Methods

05. Ensemble Methods

Bagging, Random Forest, Boosting, XGBoost, LightGBM


Learning Objectives

After completing this tutorial, you will be able to:

  1. Understand the key differences between Bagging and Boosting
  2. Understand Random Forest operation principles and experiment with main parameters
  3. Understand the mathematical principles of Gradient Boosting and XGBoost implementation
  4. Compare performance of XGBoost, LightGBM, CatBoost
  5. Perform hyperparameter tuning with GridSearch, RandomSearch, Optuna
  6. Analyze and interpret Feature Importance

Key Concepts

1. What is Ensemble?

Ensemble learning is a technique that combines multiple weak learners to create a strong learner.

"The wisdom of three people is better than one genius" - Collective Intelligence
          ┌─────────────────────────────────────┐
          │         Ensemble Learning           │
          └─────────────────────────────────────┘

          ┌───────────────┴───────────────┐
          ▼                               ▼
  ┌───────────────┐               ┌───────────────┐
  │    Bagging    │               │   Boosting    │
  │  (Parallel)   │               │  (Sequential) │
  └───────────────┘               └───────────────┘
          │                               │
          ▼                               ▼
  Random Forest               AdaBoost, GBM, XGBoost

2. Ensemble Types

MethodCharacteristicsRepresentative Models
BaggingParallel training, Variance reductionRandom Forest
BoostingSequential training, Bias reductionXGBoost, LightGBM
StackingMeta-model learningStackingClassifier
VotingMajority vote/AverageVotingClassifier

3. Bagging (Bootstrap Aggregating)

Principle:

  1. Bootstrap sampling from original data (sampling with replacement)
  2. Train independent models on each sample (parallelizable)
  3. Combine predictions through voting (classification) or averaging (regression)

Mathematical Expression:

f^bag(x)=1Bb=1Bf^b(x)\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^{B}\hat{f}_b(x)

Advantages:

  • Variance reduction → Overfitting prevention
  • Fast training through parallel processing
  • Validation possible with OOB (Out-of-Bag) samples

OOB (Out-of-Bag) Insight: About 36.8% of original data is excluded in each Bootstrap sample. Using these OOB samples enables performance evaluation without a separate Validation Set!

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
 
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    bootstrap=True,
    oob_score=True,  # Calculate OOB score
    random_state=42
)

4. Random Forest

Random Forest = Bagging + Feature Randomness

  • Uses randomly selected feature subset when training each tree
  • Reduces correlation between trees → Maximizes ensemble effect

Main Parameters:

ParameterDescriptionDefault
n_estimatorsNumber of trees (more is better but diminishing returns)100
max_featuresNumber of features to consider at each splitsqrt(n)
max_depthMaximum tree depthNone
min_samples_splitMinimum samples to split2
min_samples_leafMinimum samples in leaf node1
from sklearn.ensemble import RandomForestClassifier
 
rf = RandomForestClassifier(
    n_estimators=100,      # Number of trees
    max_depth=10,          # Tree depth
    max_features='sqrt',   # Random features
    min_samples_leaf=2,
    n_jobs=-1,             # Parallel processing
    random_state=42
)
⚠️

n_estimators Selection Guide: Performance improves as trees increase but with diminishing returns. Monitor OOB Score to find where performance improvement stops. Since OOB Score is similar to Test Score, model selection is possible without a separate validation set.

OOB (Out-of-Bag) Score

Validation using samples not included in bootstrap:

rf = RandomForestClassifier(oob_score=True)
rf.fit(X_train, y_train)
print(f"OOB Score: {rf.oob_score_}")

5. Boosting

A method that sequentially corrects errors of previous models.

Principle:

  1. Train first model
  2. Train next model focusing on previous model's errors (Residual)
  3. Weighted sum of all model predictions

Mathematical Expression:

f^(x)=m=1Mγmhm(x)\hat{f}(x) = \sum_{m=1}^{M}\gamma_m h_m(x)

Gradient Boosting

Reduces error using the gradient of the loss function:

Fm(x)=Fm1(x)+ηhm(x)F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)

Where:

  • Fm(x)F_m(x): Ensemble prediction at step m
  • hm(x)h_m(x): m-th weak learner (learns residuals)
  • η\eta: Learning rate (shrinkage factor)
from sklearn.ensemble import GradientBoostingClassifier
 
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

XGBoost

from xgboost import XGBClassifier
 
xgb = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    n_jobs=-1,
    tree_method='hist',  # Fast training
    random_state=42
)

Early Stopping Usage: XGBoost allows setting early stopping with the early_stopping_rounds parameter. This prevents overfitting and automatically finds the optimal number of iterations.

# Early Stopping example
xgb_es = XGBClassifier(
    n_estimators=1000,  # Set large value
    learning_rate=0.1,
    early_stopping_rounds=20,
)
 
xgb_es.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)
print(f"Optimal iterations: {xgb_es.best_iteration}")

LightGBM

from lightgbm import LGBMClassifier
 
lgbm = LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=-1,  # No limit
    num_leaves=31,
    n_jobs=-1,
    random_state=42
)

6. XGBoost vs LightGBM vs CatBoost Comparison

FeatureXGBoostLightGBMCatBoost
Tree GrowthLevel-wiseLeaf-wiseLevel-wise
SpeedFastVery fastModerate
Categorical HandlingEncoding requiredSupportedStrong support
Missing Value HandlingAutomaticAutomaticAutomatic
GPU SupportYesYesYes
ModelProsCons
Random ForestParallelizable, Overfitting prevention, Easy tuningSlow prediction
XGBoostHigh performance, RegularizationMemory usage, Tuning sensitive
LightGBMFast speed, Large-scaleLeaf-wise overfitting
CatBoostCategorical handlingSlow training

7. Feature Importance

# Random Forest
importances = rf.feature_importances_
 
# XGBoost (multiple types)
xgb.feature_importances_  # gain-based
 
import xgboost as xgb
xgb.plot_importance(model, importance_type='weight')  # Split count
xgb.plot_importance(model, importance_type='gain')    # Information gain
xgb.plot_importance(model, importance_type='cover')   # Coverage
💡

Feature Importance Interpretation Tip: Each model calculates Feature Importance differently. Combining Importance from multiple models provides more reliable interpretation.


Code Summary

from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    VotingClassifier
)
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
 
# Models
models = {
    'RF': RandomForestClassifier(n_estimators=100, random_state=42),
    'GB': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGB': XGBClassifier(n_estimators=100, random_state=42),
    'LGBM': LGBMClassifier(n_estimators=100, random_state=42)
}
 
# Performance comparison
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name}: {scores.mean():.4f} (+/-{scores.std():.4f})")

Hyperparameter Tuning

GridSearchCV

from sklearn.model_selection import GridSearchCV
 
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}
 
grid_search = GridSearchCV(
    XGBClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")

RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
 
param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(2, 10),
    'learning_rate': uniform(0.01, 0.3),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4)
}
 
random_search = RandomizedSearchCV(
    XGBClassifier(random_state=42),
    param_distributions,
    n_iter=50,
    cv=5,
    n_jobs=-1
)
random_search.fit(X_train, y_train)

Optuna Recommendation: For more efficient hyperparameter tuning, try Optuna. It uses TPE (Tree-structured Parzen Estimator) sampler to explore the search space more effectively.


Voting Ensemble

Combine multiple models for additional performance improvement:

from sklearn.ensemble import VotingClassifier
 
voting_clf = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(n_estimators=200, random_state=42)),
        ('xgb', XGBClassifier(n_estimators=200, random_state=42)),
        ('lgb', LGBMClassifier(n_estimators=200, random_state=42))
    ],
    voting='soft'  # Probability-based voting
)
 
voting_clf.fit(X_train, y_train)

Ensemble Methods Checklist

ItemBagging (RF)Boosting (XGB)
Training MethodParallel (Independent)Sequential (Dependent)
GoalVariance reductionBias + Variance reduction
Overfitting RiskLowHigh (needs caution)
Training SpeedFast (parallelizable)Slow (sequential)
Tuning SensitivityLowHigh

Interview Questions Preview

  1. What's the difference between Bagging and Boosting?
  2. What is the role of max_features in Random Forest?
  3. What are the differences between XGBoost and LightGBM?

Check out more interview questions at Premium Interviews (opens in a new tab).


Practice Notebook

💡

Additional notebook content: Bootstrap Sampling visualization, Bagging vs Single Tree variability comparison, n_estimators/max_features effect experiments, Boosting step-by-step learning process visualization, California Housing regression performance comparison, Optuna hyperparameter tuning, and practice problems (House Prices regression, Stacking Ensemble, SHAP analysis).


Previous: 04. Decision Tree | Next: 06. Feature Engineering