05. Ensemble Methods

Bagging, Random Forest, Boosting, XGBoost, LightGBM

Learning Objectives

After completing this tutorial, you will be able to:

Understand the key differences between Bagging and Boosting
Understand Random Forest operation principles and experiment with main parameters
Understand the mathematical principles of Gradient Boosting and XGBoost implementation
Compare performance of XGBoost, LightGBM, CatBoost
Perform hyperparameter tuning with GridSearch, RandomSearch, Optuna
Analyze and interpret Feature Importance

Key Concepts

1. What is Ensemble?

Ensemble learning is a technique that combines multiple weak learners to create a strong learner.

"The wisdom of three people is better than one genius" - Collective Intelligence

          ┌─────────────────────────────────────┐
          │         Ensemble Learning           │
          └─────────────────────────────────────┘
                          │
          ┌───────────────┴───────────────┐
          ▼                               ▼
  ┌───────────────┐               ┌───────────────┐
  │    Bagging    │               │   Boosting    │
  │  (Parallel)   │               │  (Sequential) │
  └───────────────┘               └───────────────┘
          │                               │
          ▼                               ▼
  Random Forest               AdaBoost, GBM, XGBoost

2. Ensemble Types

Method	Characteristics	Representative Models
Bagging	Parallel training, Variance reduction	Random Forest
Boosting	Sequential training, Bias reduction	XGBoost, LightGBM
Stacking	Meta-model learning	StackingClassifier
Voting	Majority vote/Average	VotingClassifier

3. Bagging (Bootstrap Aggregating)

Principle:

Bootstrap sampling from original data (sampling with replacement)
Train independent models on each sample (parallelizable)
Combine predictions through voting (classification) or averaging (regression)

Mathematical Expression:

$\hat{f}_{bag}(x) = \frac{1}{B}\sum_{b=1}^{B}\hat{f}_b(x)$

Advantages:

Variance reduction → Overfitting prevention
Fast training through parallel processing
Validation possible with OOB (Out-of-Bag) samples

OOB (Out-of-Bag) Insight: About 36.8% of original data is excluded in each Bootstrap sample. Using these OOB samples enables performance evaluation without a separate Validation Set!

from sklearn.ensemble import BaggingClassifier
from sklearn.tree import DecisionTreeClassifier
 
bagging = BaggingClassifier(
    estimator=DecisionTreeClassifier(),
    n_estimators=100,
    max_samples=0.8,
    bootstrap=True,
    oob_score=True,  # Calculate OOB score
    random_state=42
)

4. Random Forest

Random Forest = Bagging + Feature Randomness

Uses randomly selected feature subset when training each tree
Reduces correlation between trees → Maximizes ensemble effect

Main Parameters:

Parameter	Description	Default
`n_estimators`	Number of trees (more is better but diminishing returns)	100
`max_features`	Number of features to consider at each split	sqrt(n)
`max_depth`	Maximum tree depth	None
`min_samples_split`	Minimum samples to split	2
`min_samples_leaf`	Minimum samples in leaf node	1

from sklearn.ensemble import RandomForestClassifier
 
rf = RandomForestClassifier(
    n_estimators=100,      # Number of trees
    max_depth=10,          # Tree depth
    max_features='sqrt',   # Random features
    min_samples_leaf=2,
    n_jobs=-1,             # Parallel processing
    random_state=42
)

⚠️

n_estimators Selection Guide: Performance improves as trees increase but with diminishing returns. Monitor OOB Score to find where performance improvement stops. Since OOB Score is similar to Test Score, model selection is possible without a separate validation set.

OOB (Out-of-Bag) Score

Validation using samples not included in bootstrap:

rf = RandomForestClassifier(oob_score=True)
rf.fit(X_train, y_train)
print(f"OOB Score: {rf.oob_score_}")

5. Boosting

A method that sequentially corrects errors of previous models.

Principle:

Train first model
Train next model focusing on previous model's errors (Residual)
Weighted sum of all model predictions

Mathematical Expression:

$\hat{f}(x) = \sum_{m=1}^{M}\gamma_m h_m(x)$

Gradient Boosting

Reduces error using the gradient of the loss function:

$F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)$

Where:

$F_m(x)$ : Ensemble prediction at step m
$h_m(x)$ : m-th weak learner (learns residuals)
$\eta$ : Learning rate (shrinkage factor)

from sklearn.ensemble import GradientBoostingClassifier
 
gb = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    random_state=42
)

XGBoost

from xgboost import XGBClassifier
 
xgb = XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    n_jobs=-1,
    tree_method='hist',  # Fast training
    random_state=42
)

Early Stopping Usage: XGBoost allows setting early stopping with the early_stopping_rounds parameter. This prevents overfitting and automatically finds the optimal number of iterations.

# Early Stopping example
xgb_es = XGBClassifier(
    n_estimators=1000,  # Set large value
    learning_rate=0.1,
    early_stopping_rounds=20,
)
 
xgb_es.fit(
    X_train, y_train,
    eval_set=[(X_val, y_val)],
    verbose=False
)
print(f"Optimal iterations: {xgb_es.best_iteration}")

LightGBM

from lightgbm import LGBMClassifier
 
lgbm = LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=-1,  # No limit
    num_leaves=31,
    n_jobs=-1,
    random_state=42
)

6. XGBoost vs LightGBM vs CatBoost Comparison

Feature	XGBoost	LightGBM	CatBoost
Tree Growth	Level-wise	Leaf-wise	Level-wise
Speed	Fast	Very fast	Moderate
Categorical Handling	Encoding required	Supported	Strong support
Missing Value Handling	Automatic	Automatic	Automatic
GPU Support	Yes	Yes	Yes

Model	Pros	Cons
Random Forest	Parallelizable, Overfitting prevention, Easy tuning	Slow prediction
XGBoost	High performance, Regularization	Memory usage, Tuning sensitive
LightGBM	Fast speed, Large-scale	Leaf-wise overfitting
CatBoost	Categorical handling	Slow training

7. Feature Importance

# Random Forest
importances = rf.feature_importances_
 
# XGBoost (multiple types)
xgb.feature_importances_  # gain-based
 
import xgboost as xgb
xgb.plot_importance(model, importance_type='weight')  # Split count
xgb.plot_importance(model, importance_type='gain')    # Information gain
xgb.plot_importance(model, importance_type='cover')   # Coverage

💡

Feature Importance Interpretation Tip: Each model calculates Feature Importance differently. Combining Importance from multiple models provides more reliable interpretation.

Code Summary

from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    VotingClassifier
)
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
 
# Models
models = {
    'RF': RandomForestClassifier(n_estimators=100, random_state=42),
    'GB': GradientBoostingClassifier(n_estimators=100, random_state=42),
    'XGB': XGBClassifier(n_estimators=100, random_state=42),
    'LGBM': LGBMClassifier(n_estimators=100, random_state=42)
}
 
# Performance comparison
for name, model in models.items():
    scores = cross_val_score(model, X, y, cv=5)
    print(f"{name}: {scores.mean():.4f} (+/-{scores.std():.4f})")

Hyperparameter Tuning

GridSearchCV

from sklearn.model_selection import GridSearchCV
 
param_grid = {
    'n_estimators': [100, 200],
    'max_depth': [3, 5, 7],
    'learning_rate': [0.01, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0]
}
 
grid_search = GridSearchCV(
    XGBClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best parameters: {grid_search.best_params_}")

RandomizedSearchCV

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
 
param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': randint(2, 10),
    'learning_rate': uniform(0.01, 0.3),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.6, 0.4)
}
 
random_search = RandomizedSearchCV(
    XGBClassifier(random_state=42),
    param_distributions,
    n_iter=50,
    cv=5,
    n_jobs=-1
)
random_search.fit(X_train, y_train)

Optuna Recommendation: For more efficient hyperparameter tuning, try Optuna. It uses TPE (Tree-structured Parzen Estimator) sampler to explore the search space more effectively.

Voting Ensemble

Combine multiple models for additional performance improvement:

from sklearn.ensemble import VotingClassifier
 
voting_clf = VotingClassifier(
    estimators=[
        ('rf', RandomForestClassifier(n_estimators=200, random_state=42)),
        ('xgb', XGBClassifier(n_estimators=200, random_state=42)),
        ('lgb', LGBMClassifier(n_estimators=200, random_state=42))
    ],
    voting='soft'  # Probability-based voting
)
 
voting_clf.fit(X_train, y_train)

Ensemble Methods Checklist

Item	Bagging (RF)	Boosting (XGB)
Training Method	Parallel (Independent)	Sequential (Dependent)
Goal	Variance reduction	Bias + Variance reduction
Overfitting Risk	Low	High (needs caution)
Training Speed	Fast (parallelizable)	Slow (sequential)
Tuning Sensitivity	Low	High

Interview Questions Preview

What's the difference between Bagging and Boosting?
What is the role of max_features in Random Forest?
What are the differences between XGBoost and LightGBM?

Check out more interview questions at Premium Interviews (opens in a new tab).

Practice Notebook

💡

Additional notebook content: Bootstrap Sampling visualization, Bagging vs Single Tree variability comparison, n_estimators/max_features effect experiments, Boosting step-by-step learning process visualization, California Housing regression performance comparison, Optuna hyperparameter tuning, and practice problems (House Prices regression, Stacking Ensemble, SHAP analysis).

Previous: 04. Decision Tree | Next: 06. Feature Engineering

04. Decision Tree 06. Feature Engineering