09. Anomaly Detection
Isolation Forest, LOF, One-Class SVM
Learning Objectives
After completing this tutorial, you will be able to:
- Understand anomaly detection concepts and various types (Point, Contextual, Collective)
- Detect simple anomalies with statistical methods (Z-Score, IQR)
- Implement and compare machine learning methods (Isolation Forest, LOF, One-Class SVM)
- Apply anomaly detection to practical data (credit card fraud detection, etc.)
- Select optimal algorithm for different situations
Key Concepts
1. What is an Anomaly?
Data with patterns significantly different from normal data.
| Type | Description | Example |
|---|---|---|
| Point Anomaly | Individual data is anomalous | Suddenly high transaction amount |
| Contextual Anomaly | Anomalous in specific context | Heating bill spike in summer |
| Collective Anomaly | Group is anomalous | Consecutive abnormal heartbeats |
Major Use Cases
- Finance: Credit card fraud detection
- Manufacturing: Defect detection
- Security: Network intrusion detection
- Medical: Abnormal diagnosis
- IoT: Sensor anomaly detection
2. Statistical Methods
Z-Score
Calculates how many standard deviations each data point is from the mean.
from scipy import stats
import numpy as np
# Calculate Z-score for each feature
z_scores = np.abs(stats.zscore(X))
# Flag as anomaly if exceeds threshold (typically |Z| > 3)
threshold = 3
anomalies = (z_scores > threshold).any(axis=1)Z-Score Threshold: 3 is commonly used, but can be adjusted between 2-4 depending on data characteristics. Assumes normal distribution, so caution needed for different distributions.
IQR (Interquartile Range)
Anomaly detection using quartile range, more robust to outliers.
Q1 = X.quantile(0.25)
Q3 = X.quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
anomalies = ((X < lower) | (X > upper)).any(axis=1)3. Isolation Forest
Core Idea: Anomalies are isolated with few splits
- Select random feature, random split point
- Split recursively
- Measure path length to isolation
- Shorter path = more anomalous
from sklearn.ensemble import IsolationForest
iso_forest = IsolationForest(
n_estimators=100,
contamination=0.1, # Expected anomaly ratio
random_state=42
)
labels = iso_forest.fit_predict(X) # 1: normal, -1: anomaly
scores = iso_forest.decision_function(X) # Score (lower = more anomalous)contamination Parameter: Sets expected anomaly ratio. Should be set close to actual anomaly ratio for good performance. Too high may flag normal as anomaly, too low may miss anomalies.
Advantages:
- Effective in high dimensions
- Fast training/prediction
- Memory efficient
4. LOF (Local Outlier Factor)
Core Idea: Compare local density
- Anomalous if density is lower than neighbors
- Effective for local anomaly detection
- LOF ≈ 1: Normal (similar density to neighbors)
- LOF > 1: Anomaly (lower density than neighbors)
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(
n_neighbors=20,
contamination=0.1
)
labels = lof.fit_predict(X) # 1: normal, -1: anomaly
scores = -lof.negative_outlier_factor_ # Higher = more anomalousn_neighbors Parameter: Number of neighbors for local density calculation. Small values are sensitive to local anomalies, large values capture global patterns. sklearn default is 20, adjustment needed based on dataset size and characteristics.
LOF Limitation: Only fit_predict available, difficult to predict on new data after training. Use novelty=True option if new data prediction is needed.
5. One-Class SVM
Learns boundary using only normal data. Can learn non-linear boundaries with kernel trick.
from sklearn.svm import OneClassSVM
ocsvm = OneClassSVM(
kernel='rbf',
nu=0.1, # Upper bound on anomaly ratio
gamma='scale'
)
ocsvm.fit(X_train_normal) # Normal data only
labels = ocsvm.predict(X_test) # 1: normal, -1: anomaly6. Algorithm Comparison
| Algorithm | Pros | Cons | Suitable Situation |
|---|---|---|---|
| Z-Score | Fast, Easy to interpret | Assumes normal distribution | Univariate, Normal distribution |
| IQR | Robust to outliers | Simple | Univariate, Outliers exist |
| Isolation Forest | Fast, High-dimensional | Parameter sensitive | Large-scale, High-dimensional |
| LOF | Local anomaly detection | Slow, Density assumption | Cluster boundary anomalies |
| One-Class SVM | Complex boundaries | Slow, Scaling required | Only normal data available |
Code Summary
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
# Scaling (required!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Isolation Forest
iso = IsolationForest(contamination=0.1, random_state=42)
iso_labels = iso.fit_predict(X_scaled)
# LOF
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
lof_labels = lof.fit_predict(X_scaled)
# Check anomalies
print(f"Isolation Forest anomalies: {(iso_labels == -1).sum()}")
print(f"LOF anomalies: {(lof_labels == -1).sum()}")Scaling Required: Especially One-Class SVM and LOF are distance-based, so scaling is essential. Use RobustScaler for more robustness when outliers exist.
Evaluation Methods
When labels are available:
from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
# Convert -1 → 1 (anomaly=1)
y_pred = (labels == -1).astype(int)
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall: {recall_score(y_true, y_pred):.4f}")
print(f"F1: {f1_score(y_true, y_pred):.4f}")
# Score-based evaluation (Isolation Forest)
scores = -iso_forest.decision_function(X) # Higher = more anomalous
print(f"ROC-AUC: {roc_auc_score(y_true, scores):.4f}")Precision vs Recall Trade-off: Prioritize Recall when missing anomalies is unacceptable (like fraud detection), Precision when false positives are costly. Adjust threshold according to business requirements.
Best Practices
-
Utilize Domain Knowledge
- Understand which features indicate anomalies
- Estimate contamination ratio beforehand
-
Ensemble Multiple Methods
- Combine results from multiple methods (voting or score averaging)
- Classify as anomaly if 2+ methods flag as anomaly
-
Scaling
- Use StandardScaler or RobustScaler
- RobustScaler: Robust to outliers
-
Evaluation Metrics
- Consider Precision/Recall trade-off
- Utilize ROC-AUC, PR-AUC
-
Threshold Adjustment
- Tune contamination, nu parameters
- Adjust according to business requirements
Selection Guide
| Situation | Recommended Algorithm |
|---|---|
| Large-scale, High-dimensional | Isolation Forest |
| Local anomalies | LOF |
| Only normal data available | One-Class SVM |
| Univariate, Normal distribution | Z-Score |
| Univariate, Outliers exist | IQR |
| Labels available | Supervised classifier |
Interview Questions Preview
- What is the principle of Isolation Forest?
- What's the difference between LOF and global anomaly detection?
- How do you set the contamination parameter?
Check out more interview questions at Premium Interviews (opens in a new tab).
Practice Notebook
Additional notebook content: The practice notebook covers synthetic data generation and visualization, parameter impact analysis for each algorithm, credit card fraud detection simulation, ROC curve and Confusion Matrix analysis, and ensemble anomaly detection practice problems.
Previous: 08. Dimensionality Reduction | Next: 10. Imbalanced Data