en
Tutorials
04. Decision Tree

04. Mastering Decision Trees

Gini, Entropy, Pruning, Feature Importance


Learning Objectives

After completing this tutorial, you will be able to:

  1. Understand the mathematical principles and differences between Gini Impurity and Entropy
  2. Calculate the splitting criteria selection process through Information Gain step by step
  3. Visualize and interpret tree structure using Graphviz
  4. Prevent overfitting through Pruning techniques
  5. Analyze and interpret Feature Importance
  6. Search for optimal tree structure through hyperparameter tuning

Key Concepts

1. What is a Decision Tree?

A Decision Tree is an algorithm that classifies data through a sequence of if-else rules. It's a tree-structured model that divides data by questions (conditions) to make predictions.

          [Root Node]
         /           \
    [Internal]    [Internal]
      /    \        /    \
   [Leaf] [Leaf] [Leaf] [Leaf]

Decision Tree is a White-box model, making it easy to visualize and interpret the decision-making process and explain the prediction rationale.


2. Impurity Metrics

Decision Trees split data in the direction that minimizes impurity.

Gini Impurity

Gini = 1 - Σ(pᵢ)²
  • pᵢ: Proportion of class i
  • Range: [0, 1 - 1/C]
  • 0: Perfectly pure (only one class)
  • 0.5: Maximum impurity (equal ratio of 2 classes)

Example (Binary Classification):

  • Node with A:50, B:50 → Gini = 1 - (0.5² + 0.5²) = 0.5
  • Node with A:90, B:10 → Gini = 1 - (0.9² + 0.1²) = 0.18
  • Node with A:100, B:0 → Gini = 1 - (1.0² + 0.0²) = 0.0

Entropy

Entropy = -Σ pᵢ log₂(pᵢ)
  • Uncertainty measure derived from information theory
  • Range: [0, log₂C]
  • Perfectly pure: Entropy = 0
  • Perfectly impure: Entropy = log₂C

Example (Binary Classification):

  • A:50, B:50 → Entropy = 1.0
  • A:90, B:10 → Entropy ≈ 0.469
  • A:100, B:0 → Entropy = 0
⚠️

Gini vs Entropy Selection: Performance difference is minimal in practice, but Gini is faster (no log computation needed). Entropy responds more sensitively to impurity.

Information Gain

Measures the reduction in impurity before and after splitting.

IG = Impurity(parent) - Σ(weighted) Impurity(children)

Decision Tree selects the split that maximizes Information Gain.


3. Splitting Process

  1. Calculate impurity reduction for all features and split points
  2. Select the split with the largest reduction
  3. Repeat recursively
  4. Create Leaf nodes when termination conditions are met

4. Pruning

Limits tree growth to prevent overfitting.

Pre-Pruning

ParameterDescription
max_depthMaximum tree depth
min_samples_splitMinimum samples required to split a node
min_samples_leafMinimum samples in a Leaf node
max_leaf_nodesMaximum number of Leaf nodes

Post-Pruning

# Cost Complexity Pruning (ccp_alpha)
path = tree.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas
 
# Find optimal alpha with CV
for alpha in alphas:
    clf = DecisionTreeClassifier(ccp_alpha=alpha)
    scores = cross_val_score(clf, X_train, y_train, cv=5)

Post-pruning tip: Calculate the ccp_alpha path from a fully grown tree, then find the optimal alpha value through Cross-Validation.


5. Feature Importance

# Gini-based importance
importances = model.feature_importances_
 
# Visualization
pd.Series(importances, index=feature_names).sort_values().plot(kind='barh')

Feature Importance indicates the degree to which a feature contributes to classification and can be used for Feature Selection.


Code Summary

from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import cross_val_score, GridSearchCV
 
# Basic model
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
 
# Overfitting prevention (Pre-pruning)
tree_pruned = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)
 
# Tree structure visualization
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 10))
plot_tree(tree_pruned, feature_names=feature_names,
          class_names=class_names, filled=True, rounded=True)
plt.show()
 
# Print tree rules as text
print(export_text(tree_pruned, feature_names=list(feature_names)))

Hyperparameter Tuning with GridSearchCV

from sklearn.model_selection import GridSearchCV
 
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}
 
grid = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best CV Score: {grid.best_score_:.4f}")

Pros and Cons

ProsCons
Easy to interpret (White-box model)Prone to overfitting
No scaling requiredUnstable (sensitive to data changes, High Variance)
Learns non-linear relationshipsPerformance limitations
Handles both categorical and numerical featuresCreates only axis-perpendicular boundaries
⚠️

Overfitting warning: Deep trees easily memorize Train data. If Train Accuracy approaches 1.0, it's an overfitting signal. Always apply Pruning!


Regression vs Classification

ProblemSplit CriterionLeaf Prediction
ClassificationGini / EntropyMajority class
RegressionMSE / MAEMean value
from sklearn.tree import DecisionTreeRegressor
 
reg_tree = DecisionTreeRegressor(max_depth=5)
reg_tree.fit(X_train, y_train)

Key Summary Checklist

ItemDescription
Impurity MetricGini (fast) vs Entropy (sensitive) - minimal performance difference
Information GainSelect Feature/value that maximizes impurity reduction after split
Overfitting PreventionAdjust max_depth, min_samples_split, min_samples_leaf
Post-pruningCost-complexity pruning with ccp_alpha
Feature ImportanceDegree of contribution to classification, use for Feature Selection

Interview Questions Preview

  1. What's the difference between Gini and Entropy?
  2. How do you prevent overfitting in Decision Trees?
  3. How is Feature Importance calculated?
  4. Why does Decision Tree only create axis-perpendicular boundaries?

Check out more interview questions at Premium Interviews (opens in a new tab).


Practice Notebook

The notebook additionally covers:

  • Manual implementation and visualization of Gini Impurity and Entropy
  • Step-by-step manual calculation of Information Gain
  • Exercises on both Iris and Wine datasets
  • Decision Boundary visualization (comparison by max_depth)
  • Detailed analysis of Cost Complexity Pruning
  • Model diagnosis through Learning Curve
  • Feature Selection effect analysis

Previous: 03. Logistic Regression | Next: 05. Ensemble Methods