04. Mastering Decision Trees

Gini, Entropy, Pruning, Feature Importance

Learning Objectives

After completing this tutorial, you will be able to:

Understand the mathematical principles and differences between Gini Impurity and Entropy
Calculate the splitting criteria selection process through Information Gain step by step
Visualize and interpret tree structure using Graphviz
Prevent overfitting through Pruning techniques
Analyze and interpret Feature Importance
Search for optimal tree structure through hyperparameter tuning

Key Concepts

1. What is a Decision Tree?

A Decision Tree is an algorithm that classifies data through a sequence of if-else rules. It's a tree-structured model that divides data by questions (conditions) to make predictions.

          [Root Node]
         /           \
    [Internal]    [Internal]
      /    \        /    \
   [Leaf] [Leaf] [Leaf] [Leaf]

Decision Tree is a White-box model, making it easy to visualize and interpret the decision-making process and explain the prediction rationale.

2. Impurity Metrics

Decision Trees split data in the direction that minimizes impurity.

Gini Impurity

Gini = 1 - Σ(pᵢ)²

pᵢ: Proportion of class i
Range: [0, 1 - 1/C]
0: Perfectly pure (only one class)
0.5: Maximum impurity (equal ratio of 2 classes)

Example (Binary Classification):

Node with A:50, B:50 → Gini = 1 - (0.5² + 0.5²) = 0.5
Node with A:90, B:10 → Gini = 1 - (0.9² + 0.1²) = 0.18
Node with A:100, B:0 → Gini = 1 - (1.0² + 0.0²) = 0.0

Entropy

Entropy = -Σ pᵢ log₂(pᵢ)

Uncertainty measure derived from information theory
Range: [0, log₂C]
Perfectly pure: Entropy = 0
Perfectly impure: Entropy = log₂C

Example (Binary Classification):

A:50, B:50 → Entropy = 1.0
A:90, B:10 → Entropy ≈ 0.469
A:100, B:0 → Entropy = 0

⚠️

Gini vs Entropy Selection: Performance difference is minimal in practice, but Gini is faster (no log computation needed). Entropy responds more sensitively to impurity.

Information Gain

Measures the reduction in impurity before and after splitting.

IG = Impurity(parent) - Σ(weighted) Impurity(children)

Decision Tree selects the split that maximizes Information Gain.

3. Splitting Process

Calculate impurity reduction for all features and split points
Select the split with the largest reduction
Repeat recursively
Create Leaf nodes when termination conditions are met

4. Pruning

Limits tree growth to prevent overfitting.

Pre-Pruning

Parameter	Description
`max_depth`	Maximum tree depth
`min_samples_split`	Minimum samples required to split a node
`min_samples_leaf`	Minimum samples in a Leaf node
`max_leaf_nodes`	Maximum number of Leaf nodes

Post-Pruning

# Cost Complexity Pruning (ccp_alpha)
path = tree.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas
 
# Find optimal alpha with CV
for alpha in alphas:
    clf = DecisionTreeClassifier(ccp_alpha=alpha)
    scores = cross_val_score(clf, X_train, y_train, cv=5)

Post-pruning tip: Calculate the ccp_alpha path from a fully grown tree, then find the optimal alpha value through Cross-Validation.

5. Feature Importance

# Gini-based importance
importances = model.feature_importances_
 
# Visualization
pd.Series(importances, index=feature_names).sort_values().plot(kind='barh')

Feature Importance indicates the degree to which a feature contributes to classification and can be used for Feature Selection.

Code Summary

from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import cross_val_score, GridSearchCV
 
# Basic model
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
 
# Overfitting prevention (Pre-pruning)
tree_pruned = DecisionTreeClassifier(
    max_depth=5,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)
 
# Tree structure visualization
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 10))
plot_tree(tree_pruned, feature_names=feature_names,
          class_names=class_names, filled=True, rounded=True)
plt.show()
 
# Print tree rules as text
print(export_text(tree_pruned, feature_names=list(feature_names)))

Hyperparameter Tuning with GridSearchCV

from sklearn.model_selection import GridSearchCV
 
param_grid = {
    'criterion': ['gini', 'entropy'],
    'max_depth': [None, 3, 5, 7, 10],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'max_features': ['sqrt', 'log2', None]
}
 
grid = GridSearchCV(
    DecisionTreeClassifier(random_state=42),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best CV Score: {grid.best_score_:.4f}")

Pros and Cons

Pros	Cons
Easy to interpret (White-box model)	Prone to overfitting
No scaling required	Unstable (sensitive to data changes, High Variance)
Learns non-linear relationships	Performance limitations
Handles both categorical and numerical features	Creates only axis-perpendicular boundaries

⚠️

Overfitting warning: Deep trees easily memorize Train data. If Train Accuracy approaches 1.0, it's an overfitting signal. Always apply Pruning!

Regression vs Classification

Problem	Split Criterion	Leaf Prediction
Classification	Gini / Entropy	Majority class
Regression	MSE / MAE	Mean value

from sklearn.tree import DecisionTreeRegressor
 
reg_tree = DecisionTreeRegressor(max_depth=5)
reg_tree.fit(X_train, y_train)

Key Summary Checklist

Item	Description
Impurity Metric	Gini (fast) vs Entropy (sensitive) - minimal performance difference
Information Gain	Select Feature/value that maximizes impurity reduction after split
Overfitting Prevention	Adjust max_depth, min_samples_split, min_samples_leaf
Post-pruning	Cost-complexity pruning with ccp_alpha
Feature Importance	Degree of contribution to classification, use for Feature Selection

Interview Questions Preview

What's the difference between Gini and Entropy?
How do you prevent overfitting in Decision Trees?
How is Feature Importance calculated?
Why does Decision Tree only create axis-perpendicular boundaries?

Check out more interview questions at Premium Interviews (opens in a new tab).

Practice Notebook

The notebook additionally covers:

Manual implementation and visualization of Gini Impurity and Entropy
Step-by-step manual calculation of Information Gain
Exercises on both Iris and Wine datasets
Decision Boundary visualization (comparison by max_depth)
Detailed analysis of Cost Complexity Pruning
Model diagnosis through Learning Curve
Feature Selection effect analysis

Previous: 03. Logistic Regression | Next: 05. Ensemble Methods

03. Logistic Regression 05. Ensemble Methods