en
Tutorials
01. ML Pipeline

01. Mastering ML Pipeline

Train/Val/Test Split, Cross-Validation, Preventing Data Leakage


Learning Objectives

After completing this tutorial, you will be able to:

  1. Understand the principles and proper methods of Train/Validation/Test Split
  2. Implement and utilize Cross-Validation (K-Fold, Stratified K-Fold)
  3. Understand the risks of Data Leakage and build Pipelines to prevent it
  4. Compare and analyze Feature Scaling methods (Standard, MinMax, Robust)
  5. Build reproducible ML workflows using sklearn.pipeline.Pipeline and ColumnTransformer

Key Concepts

1. What is a Machine Learning Pipeline?

An ML Pipeline is a workflow that sequentially connects all steps from data preprocessing to model training.

Raw Data → Preprocessing → Feature Engineering → Model Training → Prediction

Without a Pipeline:

  • Manual management of preprocessing steps
  • Risk of applying different transformations to Train/Test data
  • Increased possibility of Data Leakage

2. Train/Validation/Test Split

Why is 3-way Split necessary?

SetPurposeRatio
TrainModel training60-70%
ValidationHyperparameter tuning15-20%
TestFinal performance evaluation (only once!)15-20%
⚠️

The Test Set should be used only once for final evaluation. Using it multiple times causes overfitting to the Test Set.

Wrong approach: Simple Random Split

# Splitting without stratify → Class imbalance may occur
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Correct approach: Stratified Split

from sklearn.model_selection import train_test_split
 
# Split while maintaining target distribution
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
# Result: Train 60%, Val 20%, Test 20% (all with same class ratio)

3. Data Leakage

When information from test data leaks into the training process

Main Causes

  1. Preprocessing Leakage: Calculating mean/std from entire data before scaling
  2. Target Leakage: Using features derived from the target
  3. Time Leakage: Predicting the past with future data (time series)

Wrong Example

# ❌ Scale entire data → Split (Leakage!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses mean/std of entire data
X_train, X_test = train_test_split(X_scaled, ...)
 
# Problem: Test data information leaks into Train

Correct Example

# ✅ Split → Fit on Train only → Transform each
X_train, X_test = train_test_split(X, ...)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Train only!
X_test_scaled = scaler.transform(X_test)  # transform only (no fit)
🚫

Even with leakage, test performance may appear good. However, performance significantly degrades in actual production!


4. Cross-Validation

Evaluate by splitting data multiple times to overcome limitations of single split

K-Fold CV

Divide data into K parts, use each fold once as validation:

from sklearn.model_selection import cross_val_score, StratifiedKFold
 
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print(f"Mean: {scores.mean():.4f}{scores.std():.4f})")

K-Fold vs Stratified K-Fold

MethodCharacteristicsWhen to Use
K-FoldSimply divides into K partsRegression problems
Stratified K-FoldDivides into K parts while maintaining class ratioClassification (especially imbalanced data)

CV with Multiple Metrics

from sklearn.model_selection import cross_validate
 
scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
    'roc_auc': 'roc_auc'
}
 
results = cross_validate(pipeline, X, y, cv=cv, scoring=scoring)

5. Feature Scaling Comparison

ScalerFormulaCharacteristicsWhen to Use
StandardScaler(x - μ) / σMean 0, Variance 1General cases
MinMaxScaler(x - min) / (max - min)[0, 1] rangeNeural networks, images
RobustScaler(x - Q2) / (Q3 - Q1)Uses median, IQRWhen many outliers

Models that Require Scaling vs Don't Require

Scaling RequiredScaling Not Required
Logistic RegressionDecision Tree
SVMRandom Forest
KNNXGBoost, LightGBM
Neural Network

Building a Pipeline

Basic Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
 
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier())
])
 
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Separate Processing for Numeric/Categorical with ColumnTransformer

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
 
# Numeric preprocessing
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
 
# Categorical preprocessing
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])
 
# Combine
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, ['age', 'fare', 'sibsp', 'parch']),
    ('cat', categorical_transformer, ['pclass', 'sex', 'embarked'])
])
 
# Full pipeline (preprocessing + model)
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100))
])
 
# Train & Predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Learning Curve Analysis

Diagnose Overfitting/Underfitting through changes in Train/Validation performance:

from sklearn.model_selection import learning_curve
 
train_sizes, train_scores, val_scores = learning_curve(
    pipeline, X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='accuracy'
)
 
# Visualization
plt.plot(train_sizes, train_scores.mean(axis=1), label='Train')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.legend()

Interpretation:

  • Both curves converging → Good sign
  • Large gap → Overfitting (need regularization or simpler model)
  • Both curves low → Underfitting (need more complex model)

Checklist

StepCheck Items
Data Split☐ Use Stratified Split
☐ Train/Val/Test 3-way Split
☐ Use Test Set only for final evaluation
Preprocessing☐ Preprocess after splitting
☐ Use Pipeline to prevent Leakage
☐ fit_transform only on Train
Validation☐ Use Cross-Validation
☐ Use multiple metrics
☐ Check overfitting with Learning Curve

Interview Questions Preview

  1. What is Data Leakage and how do you prevent it?
  2. What's the difference between K-Fold CV and Hold-out?
  3. When do you use Stratified Split?
  4. What's the difference between StandardScaler and RobustScaler?
  5. Why use a Pipeline?

Check out more interview questions at Premium Interviews (opens in a new tab).


Practice Notebook

Practice the above concepts with the Titanic dataset:

The notebook additionally covers:

  • EDA (Exploratory Data Analysis) and visualization
  • Performance comparison experiments with/without Leakage
  • Model performance comparison across different Scalers
  • Feature Importance analysis
  • Practice problems

Next: 02. Linear Regression