09. Anomaly Detection

Isolation Forest, LOF, One-Class SVM

Learning Objectives

After completing this tutorial, you will be able to:

Understand anomaly detection concepts and various types (Point, Contextual, Collective)
Detect simple anomalies with statistical methods (Z-Score, IQR)
Implement and compare machine learning methods (Isolation Forest, LOF, One-Class SVM)
Apply anomaly detection to practical data (credit card fraud detection, etc.)
Select optimal algorithm for different situations

Key Concepts

1. What is an Anomaly?

Data with patterns significantly different from normal data.

Type	Description	Example
Point Anomaly	Individual data is anomalous	Suddenly high transaction amount
Contextual Anomaly	Anomalous in specific context	Heating bill spike in summer
Collective Anomaly	Group is anomalous	Consecutive abnormal heartbeats

Major Use Cases

Finance: Credit card fraud detection
Manufacturing: Defect detection
Security: Network intrusion detection
Medical: Abnormal diagnosis
IoT: Sensor anomaly detection

2. Statistical Methods

Z-Score

Calculates how many standard deviations each data point is from the mean.

$Z = \frac{x - \mu}{\sigma}$

from scipy import stats
import numpy as np
 
# Calculate Z-score for each feature
z_scores = np.abs(stats.zscore(X))
 
# Flag as anomaly if exceeds threshold (typically |Z| > 3)
threshold = 3
anomalies = (z_scores > threshold).any(axis=1)

Z-Score Threshold: 3 is commonly used, but can be adjusted between 2-4 depending on data characteristics. Assumes normal distribution, so caution needed for different distributions.

IQR (Interquartile Range)

Anomaly detection using quartile range, more robust to outliers.

Q1 = X.quantile(0.25)
Q3 = X.quantile(0.75)
IQR = Q3 - Q1
 
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
anomalies = ((X < lower) | (X > upper)).any(axis=1)

3. Isolation Forest

Core Idea: Anomalies are isolated with few splits

Select random feature, random split point
Split recursively
Measure path length to isolation
Shorter path = more anomalous

from sklearn.ensemble import IsolationForest
 
iso_forest = IsolationForest(
    n_estimators=100,
    contamination=0.1,  # Expected anomaly ratio
    random_state=42
)
labels = iso_forest.fit_predict(X)  # 1: normal, -1: anomaly
scores = iso_forest.decision_function(X)  # Score (lower = more anomalous)

⚠️

contamination Parameter: Sets expected anomaly ratio. Should be set close to actual anomaly ratio for good performance. Too high may flag normal as anomaly, too low may miss anomalies.

Advantages:

Effective in high dimensions
Fast training/prediction
Memory efficient

4. LOF (Local Outlier Factor)

Core Idea: Compare local density

Anomalous if density is lower than neighbors
Effective for local anomaly detection
LOF ≈ 1: Normal (similar density to neighbors)
LOF > 1: Anomaly (lower density than neighbors)

from sklearn.neighbors import LocalOutlierFactor
 
lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.1
)
labels = lof.fit_predict(X)  # 1: normal, -1: anomaly
scores = -lof.negative_outlier_factor_  # Higher = more anomalous

n_neighbors Parameter: Number of neighbors for local density calculation. Small values are sensitive to local anomalies, large values capture global patterns. sklearn default is 20, adjustment needed based on dataset size and characteristics.

⚠️

LOF Limitation: Only fit_predict available, difficult to predict on new data after training. Use novelty=True option if new data prediction is needed.

5. One-Class SVM

Learns boundary using only normal data. Can learn non-linear boundaries with kernel trick.

from sklearn.svm import OneClassSVM
 
ocsvm = OneClassSVM(
    kernel='rbf',
    nu=0.1,  # Upper bound on anomaly ratio
    gamma='scale'
)
ocsvm.fit(X_train_normal)  # Normal data only
labels = ocsvm.predict(X_test)  # 1: normal, -1: anomaly

6. Algorithm Comparison

Algorithm	Pros	Cons	Suitable Situation
Z-Score	Fast, Easy to interpret	Assumes normal distribution	Univariate, Normal distribution
IQR	Robust to outliers	Simple	Univariate, Outliers exist
Isolation Forest	Fast, High-dimensional	Parameter sensitive	Large-scale, High-dimensional
LOF	Local anomaly detection	Slow, Density assumption	Cluster boundary anomalies
One-Class SVM	Complex boundaries	Slow, Scaling required	Only normal data available

Code Summary

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
 
# Scaling (required!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Isolation Forest
iso = IsolationForest(contamination=0.1, random_state=42)
iso_labels = iso.fit_predict(X_scaled)
 
# LOF
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
lof_labels = lof.fit_predict(X_scaled)
 
# Check anomalies
print(f"Isolation Forest anomalies: {(iso_labels == -1).sum()}")
print(f"LOF anomalies: {(lof_labels == -1).sum()}")

⚠️

Scaling Required: Especially One-Class SVM and LOF are distance-based, so scaling is essential. Use RobustScaler for more robustness when outliers exist.

Evaluation Methods

When labels are available:

from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
 
# Convert -1 → 1 (anomaly=1)
y_pred = (labels == -1).astype(int)
 
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall: {recall_score(y_true, y_pred):.4f}")
print(f"F1: {f1_score(y_true, y_pred):.4f}")
 
# Score-based evaluation (Isolation Forest)
scores = -iso_forest.decision_function(X)  # Higher = more anomalous
print(f"ROC-AUC: {roc_auc_score(y_true, scores):.4f}")

Precision vs Recall Trade-off: Prioritize Recall when missing anomalies is unacceptable (like fraud detection), Precision when false positives are costly. Adjust threshold according to business requirements.

Best Practices

Utilize Domain Knowledge
- Understand which features indicate anomalies
- Estimate contamination ratio beforehand
Ensemble Multiple Methods
- Combine results from multiple methods (voting or score averaging)
- Classify as anomaly if 2+ methods flag as anomaly
Scaling
- Use StandardScaler or RobustScaler
- RobustScaler: Robust to outliers
Evaluation Metrics
- Consider Precision/Recall trade-off
- Utilize ROC-AUC, PR-AUC
Threshold Adjustment
- Tune contamination, nu parameters
- Adjust according to business requirements

Selection Guide

Situation	Recommended Algorithm
Large-scale, High-dimensional	Isolation Forest
Local anomalies	LOF
Only normal data available	One-Class SVM
Univariate, Normal distribution	Z-Score
Univariate, Outliers exist	IQR
Labels available	Supervised classifier

Interview Questions Preview

What is the principle of Isolation Forest?
What's the difference between LOF and global anomaly detection?
How do you set the contamination parameter?

Check out more interview questions at Premium Interviews (opens in a new tab).

Practice Notebook

Additional notebook content: The practice notebook covers synthetic data generation and visualization, parameter impact analysis for each algorithm, credit card fraud detection simulation, ROC curve and Confusion Matrix analysis, and ensemble anomaly detection practice problems.

Previous: 08. Dimensionality Reduction | Next: 10. Imbalanced Data

08. Dimensionality Reduction 10. Imbalanced Data