en
Tutorials
09. Anomaly Detection

09. Anomaly Detection

Isolation Forest, LOF, One-Class SVM


Learning Objectives

After completing this tutorial, you will be able to:

  • Understand anomaly detection concepts and various types (Point, Contextual, Collective)
  • Detect simple anomalies with statistical methods (Z-Score, IQR)
  • Implement and compare machine learning methods (Isolation Forest, LOF, One-Class SVM)
  • Apply anomaly detection to practical data (credit card fraud detection, etc.)
  • Select optimal algorithm for different situations

Key Concepts

1. What is an Anomaly?

Data with patterns significantly different from normal data.

TypeDescriptionExample
Point AnomalyIndividual data is anomalousSuddenly high transaction amount
Contextual AnomalyAnomalous in specific contextHeating bill spike in summer
Collective AnomalyGroup is anomalousConsecutive abnormal heartbeats

Major Use Cases

  • Finance: Credit card fraud detection
  • Manufacturing: Defect detection
  • Security: Network intrusion detection
  • Medical: Abnormal diagnosis
  • IoT: Sensor anomaly detection

2. Statistical Methods

Z-Score

Calculates how many standard deviations each data point is from the mean.

Z=xμσZ = \frac{x - \mu}{\sigma}

from scipy import stats
import numpy as np
 
# Calculate Z-score for each feature
z_scores = np.abs(stats.zscore(X))
 
# Flag as anomaly if exceeds threshold (typically |Z| > 3)
threshold = 3
anomalies = (z_scores > threshold).any(axis=1)

Z-Score Threshold: 3 is commonly used, but can be adjusted between 2-4 depending on data characteristics. Assumes normal distribution, so caution needed for different distributions.

IQR (Interquartile Range)

Anomaly detection using quartile range, more robust to outliers.

Q1 = X.quantile(0.25)
Q3 = X.quantile(0.75)
IQR = Q3 - Q1
 
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
anomalies = ((X < lower) | (X > upper)).any(axis=1)

3. Isolation Forest

Core Idea: Anomalies are isolated with few splits

  1. Select random feature, random split point
  2. Split recursively
  3. Measure path length to isolation
  4. Shorter path = more anomalous
from sklearn.ensemble import IsolationForest
 
iso_forest = IsolationForest(
    n_estimators=100,
    contamination=0.1,  # Expected anomaly ratio
    random_state=42
)
labels = iso_forest.fit_predict(X)  # 1: normal, -1: anomaly
scores = iso_forest.decision_function(X)  # Score (lower = more anomalous)
⚠️

contamination Parameter: Sets expected anomaly ratio. Should be set close to actual anomaly ratio for good performance. Too high may flag normal as anomaly, too low may miss anomalies.

Advantages:

  • Effective in high dimensions
  • Fast training/prediction
  • Memory efficient

4. LOF (Local Outlier Factor)

Core Idea: Compare local density

  • Anomalous if density is lower than neighbors
  • Effective for local anomaly detection
  • LOF ≈ 1: Normal (similar density to neighbors)
  • LOF > 1: Anomaly (lower density than neighbors)
from sklearn.neighbors import LocalOutlierFactor
 
lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.1
)
labels = lof.fit_predict(X)  # 1: normal, -1: anomaly
scores = -lof.negative_outlier_factor_  # Higher = more anomalous

n_neighbors Parameter: Number of neighbors for local density calculation. Small values are sensitive to local anomalies, large values capture global patterns. sklearn default is 20, adjustment needed based on dataset size and characteristics.

⚠️

LOF Limitation: Only fit_predict available, difficult to predict on new data after training. Use novelty=True option if new data prediction is needed.


5. One-Class SVM

Learns boundary using only normal data. Can learn non-linear boundaries with kernel trick.

from sklearn.svm import OneClassSVM
 
ocsvm = OneClassSVM(
    kernel='rbf',
    nu=0.1,  # Upper bound on anomaly ratio
    gamma='scale'
)
ocsvm.fit(X_train_normal)  # Normal data only
labels = ocsvm.predict(X_test)  # 1: normal, -1: anomaly

6. Algorithm Comparison

AlgorithmProsConsSuitable Situation
Z-ScoreFast, Easy to interpretAssumes normal distributionUnivariate, Normal distribution
IQRRobust to outliersSimpleUnivariate, Outliers exist
Isolation ForestFast, High-dimensionalParameter sensitiveLarge-scale, High-dimensional
LOFLocal anomaly detectionSlow, Density assumptionCluster boundary anomalies
One-Class SVMComplex boundariesSlow, Scaling requiredOnly normal data available

Code Summary

from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.preprocessing import StandardScaler
 
# Scaling (required!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# Isolation Forest
iso = IsolationForest(contamination=0.1, random_state=42)
iso_labels = iso.fit_predict(X_scaled)
 
# LOF
lof = LocalOutlierFactor(n_neighbors=20, contamination=0.1)
lof_labels = lof.fit_predict(X_scaled)
 
# Check anomalies
print(f"Isolation Forest anomalies: {(iso_labels == -1).sum()}")
print(f"LOF anomalies: {(lof_labels == -1).sum()}")
⚠️

Scaling Required: Especially One-Class SVM and LOF are distance-based, so scaling is essential. Use RobustScaler for more robustness when outliers exist.


Evaluation Methods

When labels are available:

from sklearn.metrics import precision_score, recall_score, f1_score, roc_auc_score
 
# Convert -1 → 1 (anomaly=1)
y_pred = (labels == -1).astype(int)
 
print(f"Precision: {precision_score(y_true, y_pred):.4f}")
print(f"Recall: {recall_score(y_true, y_pred):.4f}")
print(f"F1: {f1_score(y_true, y_pred):.4f}")
 
# Score-based evaluation (Isolation Forest)
scores = -iso_forest.decision_function(X)  # Higher = more anomalous
print(f"ROC-AUC: {roc_auc_score(y_true, scores):.4f}")

Precision vs Recall Trade-off: Prioritize Recall when missing anomalies is unacceptable (like fraud detection), Precision when false positives are costly. Adjust threshold according to business requirements.


Best Practices

  1. Utilize Domain Knowledge

    • Understand which features indicate anomalies
    • Estimate contamination ratio beforehand
  2. Ensemble Multiple Methods

    • Combine results from multiple methods (voting or score averaging)
    • Classify as anomaly if 2+ methods flag as anomaly
  3. Scaling

    • Use StandardScaler or RobustScaler
    • RobustScaler: Robust to outliers
  4. Evaluation Metrics

    • Consider Precision/Recall trade-off
    • Utilize ROC-AUC, PR-AUC
  5. Threshold Adjustment

    • Tune contamination, nu parameters
    • Adjust according to business requirements

Selection Guide

SituationRecommended Algorithm
Large-scale, High-dimensionalIsolation Forest
Local anomaliesLOF
Only normal data availableOne-Class SVM
Univariate, Normal distributionZ-Score
Univariate, Outliers existIQR
Labels availableSupervised classifier

Interview Questions Preview

  1. What is the principle of Isolation Forest?
  2. What's the difference between LOF and global anomaly detection?
  3. How do you set the contamination parameter?

Check out more interview questions at Premium Interviews (opens in a new tab).


Practice Notebook

Additional notebook content: The practice notebook covers synthetic data generation and visualization, parameter impact analysis for each algorithm, credit card fraud detection simulation, ROC curve and Confusion Matrix analysis, and ensemble anomaly detection practice problems.


Previous: 08. Dimensionality Reduction | Next: 10. Imbalanced Data