04. Mastering Decision Trees
Gini, Entropy, Pruning, Feature Importance
Learning Objectives
After completing this tutorial, you will be able to:
- Understand the mathematical principles and differences between Gini Impurity and Entropy
- Calculate the splitting criteria selection process through Information Gain step by step
- Visualize and interpret tree structure using Graphviz
- Prevent overfitting through Pruning techniques
- Analyze and interpret Feature Importance
- Search for optimal tree structure through hyperparameter tuning
Key Concepts
1. What is a Decision Tree?
A Decision Tree is an algorithm that classifies data through a sequence of if-else rules. It's a tree-structured model that divides data by questions (conditions) to make predictions.
[Root Node]
/ \
[Internal] [Internal]
/ \ / \
[Leaf] [Leaf] [Leaf] [Leaf]Decision Tree is a White-box model, making it easy to visualize and interpret the decision-making process and explain the prediction rationale.
2. Impurity Metrics
Decision Trees split data in the direction that minimizes impurity.
Gini Impurity
Gini = 1 - Σ(pᵢ)²pᵢ: Proportion of class i- Range: [0, 1 - 1/C]
- 0: Perfectly pure (only one class)
- 0.5: Maximum impurity (equal ratio of 2 classes)
Example (Binary Classification):
- Node with A:50, B:50 → Gini = 1 - (0.5² + 0.5²) = 0.5
- Node with A:90, B:10 → Gini = 1 - (0.9² + 0.1²) = 0.18
- Node with A:100, B:0 → Gini = 1 - (1.0² + 0.0²) = 0.0
Entropy
Entropy = -Σ pᵢ log₂(pᵢ)- Uncertainty measure derived from information theory
- Range: [0, log₂C]
- Perfectly pure: Entropy = 0
- Perfectly impure: Entropy = log₂C
Example (Binary Classification):
- A:50, B:50 → Entropy = 1.0
- A:90, B:10 → Entropy ≈ 0.469
- A:100, B:0 → Entropy = 0
Gini vs Entropy Selection: Performance difference is minimal in practice, but Gini is faster (no log computation needed). Entropy responds more sensitively to impurity.
Information Gain
Measures the reduction in impurity before and after splitting.
IG = Impurity(parent) - Σ(weighted) Impurity(children)Decision Tree selects the split that maximizes Information Gain.
3. Splitting Process
- Calculate impurity reduction for all features and split points
- Select the split with the largest reduction
- Repeat recursively
- Create Leaf nodes when termination conditions are met
4. Pruning
Limits tree growth to prevent overfitting.
Pre-Pruning
| Parameter | Description |
|---|---|
max_depth | Maximum tree depth |
min_samples_split | Minimum samples required to split a node |
min_samples_leaf | Minimum samples in a Leaf node |
max_leaf_nodes | Maximum number of Leaf nodes |
Post-Pruning
# Cost Complexity Pruning (ccp_alpha)
path = tree.cost_complexity_pruning_path(X_train, y_train)
alphas = path.ccp_alphas
# Find optimal alpha with CV
for alpha in alphas:
clf = DecisionTreeClassifier(ccp_alpha=alpha)
scores = cross_val_score(clf, X_train, y_train, cv=5)Post-pruning tip: Calculate the ccp_alpha path from a fully grown tree, then find the optimal alpha value through Cross-Validation.
5. Feature Importance
# Gini-based importance
importances = model.feature_importances_
# Visualization
pd.Series(importances, index=feature_names).sort_values().plot(kind='barh')Feature Importance indicates the degree to which a feature contributes to classification and can be used for Feature Selection.
Code Summary
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
from sklearn.model_selection import cross_val_score, GridSearchCV
# Basic model
tree = DecisionTreeClassifier(random_state=42)
tree.fit(X_train, y_train)
# Overfitting prevention (Pre-pruning)
tree_pruned = DecisionTreeClassifier(
max_depth=5,
min_samples_split=10,
min_samples_leaf=5,
random_state=42
)
# Tree structure visualization
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 10))
plot_tree(tree_pruned, feature_names=feature_names,
class_names=class_names, filled=True, rounded=True)
plt.show()
# Print tree rules as text
print(export_text(tree_pruned, feature_names=list(feature_names)))Hyperparameter Tuning with GridSearchCV
from sklearn.model_selection import GridSearchCV
param_grid = {
'criterion': ['gini', 'entropy'],
'max_depth': [None, 3, 5, 7, 10],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4],
'max_features': ['sqrt', 'log2', None]
}
grid = GridSearchCV(
DecisionTreeClassifier(random_state=42),
param_grid,
cv=5,
scoring='accuracy',
n_jobs=-1
)
grid.fit(X_train, y_train)
print(f"Best params: {grid.best_params_}")
print(f"Best CV Score: {grid.best_score_:.4f}")Pros and Cons
| Pros | Cons |
|---|---|
| Easy to interpret (White-box model) | Prone to overfitting |
| No scaling required | Unstable (sensitive to data changes, High Variance) |
| Learns non-linear relationships | Performance limitations |
| Handles both categorical and numerical features | Creates only axis-perpendicular boundaries |
Overfitting warning: Deep trees easily memorize Train data. If Train Accuracy approaches 1.0, it's an overfitting signal. Always apply Pruning!
Regression vs Classification
| Problem | Split Criterion | Leaf Prediction |
|---|---|---|
| Classification | Gini / Entropy | Majority class |
| Regression | MSE / MAE | Mean value |
from sklearn.tree import DecisionTreeRegressor
reg_tree = DecisionTreeRegressor(max_depth=5)
reg_tree.fit(X_train, y_train)Key Summary Checklist
| Item | Description |
|---|---|
| Impurity Metric | Gini (fast) vs Entropy (sensitive) - minimal performance difference |
| Information Gain | Select Feature/value that maximizes impurity reduction after split |
| Overfitting Prevention | Adjust max_depth, min_samples_split, min_samples_leaf |
| Post-pruning | Cost-complexity pruning with ccp_alpha |
| Feature Importance | Degree of contribution to classification, use for Feature Selection |
Interview Questions Preview
- What's the difference between Gini and Entropy?
- How do you prevent overfitting in Decision Trees?
- How is Feature Importance calculated?
- Why does Decision Tree only create axis-perpendicular boundaries?
Check out more interview questions at Premium Interviews (opens in a new tab).
Practice Notebook
The notebook additionally covers:
- Manual implementation and visualization of Gini Impurity and Entropy
- Step-by-step manual calculation of Information Gain
- Exercises on both Iris and Wine datasets
- Decision Boundary visualization (comparison by max_depth)
- Detailed analysis of Cost Complexity Pruning
- Model diagnosis through Learning Curve
- Feature Selection effect analysis
Previous: 03. Logistic Regression | Next: 05. Ensemble Methods