01. Mastering ML Pipeline

Train/Val/Test Split, Cross-Validation, Preventing Data Leakage

Learning Objectives

After completing this tutorial, you will be able to:

Understand the principles and proper methods of Train/Validation/Test Split
Implement and utilize Cross-Validation (K-Fold, Stratified K-Fold)
Understand the risks of Data Leakage and build Pipelines to prevent it
Compare and analyze Feature Scaling methods (Standard, MinMax, Robust)
Build reproducible ML workflows using sklearn.pipeline.Pipeline and ColumnTransformer

Key Concepts

1. What is a Machine Learning Pipeline?

An ML Pipeline is a workflow that sequentially connects all steps from data preprocessing to model training.

Raw Data → Preprocessing → Feature Engineering → Model Training → Prediction

Without a Pipeline:

Manual management of preprocessing steps
Risk of applying different transformations to Train/Test data
Increased possibility of Data Leakage

2. Train/Validation/Test Split

Why is 3-way Split necessary?

Set	Purpose	Ratio
Train	Model training	60-70%
Validation	Hyperparameter tuning	15-20%
Test	Final performance evaluation (only once!)	15-20%

⚠️

The Test Set should be used only once for final evaluation. Using it multiple times causes overfitting to the Test Set.

Wrong approach: Simple Random Split

# Splitting without stratify → Class imbalance may occur
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

Correct approach: Stratified Split

from sklearn.model_selection import train_test_split
 
# Split while maintaining target distribution
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
# Result: Train 60%, Val 20%, Test 20% (all with same class ratio)

3. Data Leakage

When information from test data leaks into the training process

Main Causes

Preprocessing Leakage: Calculating mean/std from entire data before scaling
Target Leakage: Using features derived from the target
Time Leakage: Predicting the past with future data (time series)

Wrong Example

# ❌ Scale entire data → Split (Leakage!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # Uses mean/std of entire data
X_train, X_test = train_test_split(X_scaled, ...)
 
# Problem: Test data information leaks into Train

Correct Example

# ✅ Split → Fit on Train only → Transform each
X_train, X_test = train_test_split(X, ...)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)  # Train only!
X_test_scaled = scaler.transform(X_test)  # transform only (no fit)

🚫

Even with leakage, test performance may appear good. However, performance significantly degrades in actual production!

4. Cross-Validation

Evaluate by splitting data multiple times to overcome limitations of single split

K-Fold CV

Divide data into K parts, use each fold once as validation:

from sklearn.model_selection import cross_val_score, StratifiedKFold
 
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print(f"Mean: {scores.mean():.4f} (±{scores.std():.4f})")

K-Fold vs Stratified K-Fold

Method	Characteristics	When to Use
K-Fold	Simply divides into K parts	Regression problems
Stratified K-Fold	Divides into K parts while maintaining class ratio	Classification (especially imbalanced data)

CV with Multiple Metrics

from sklearn.model_selection import cross_validate
 
scoring = {
    'accuracy': 'accuracy',
    'precision': 'precision',
    'recall': 'recall',
    'f1': 'f1',
    'roc_auc': 'roc_auc'
}
 
results = cross_validate(pipeline, X, y, cv=cv, scoring=scoring)

5. Feature Scaling Comparison

Scaler	Formula	Characteristics	When to Use
StandardScaler	`(x - μ) / σ`	Mean 0, Variance 1	General cases
MinMaxScaler	`(x - min) / (max - min)`	[0, 1] range	Neural networks, images
RobustScaler	`(x - Q2) / (Q3 - Q1)`	Uses median, IQR	When many outliers

Models that Require Scaling vs Don't Require

Scaling Required	Scaling Not Required
Logistic Regression	Decision Tree
SVM	Random Forest
KNN	XGBoost, LightGBM
Neural Network

Building a Pipeline

Basic Pipeline

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
 
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('classifier', RandomForestClassifier())
])
 
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Separate Processing for Numeric/Categorical with ColumnTransformer

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
 
# Numeric preprocessing
numeric_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])
 
# Categorical preprocessing
categorical_transformer = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))
])
 
# Combine
preprocessor = ColumnTransformer([
    ('num', numeric_transformer, ['age', 'fare', 'sibsp', 'parch']),
    ('cat', categorical_transformer, ['pclass', 'sex', 'embarked'])
])
 
# Full pipeline (preprocessing + model)
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', RandomForestClassifier(n_estimators=100))
])
 
# Train & Predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

Learning Curve Analysis

Diagnose Overfitting/Underfitting through changes in Train/Validation performance:

from sklearn.model_selection import learning_curve
 
train_sizes, train_scores, val_scores = learning_curve(
    pipeline, X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring='accuracy'
)
 
# Visualization
plt.plot(train_sizes, train_scores.mean(axis=1), label='Train')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.legend()

Interpretation:

Both curves converging → Good sign
Large gap → Overfitting (need regularization or simpler model)
Both curves low → Underfitting (need more complex model)

Checklist

Step	Check Items
Data Split	☐ Use Stratified Split
	☐ Train/Val/Test 3-way Split
	☐ Use Test Set only for final evaluation
Preprocessing	☐ Preprocess after splitting
	☐ Use Pipeline to prevent Leakage
	☐ fit_transform only on Train
Validation	☐ Use Cross-Validation
	☐ Use multiple metrics
	☐ Check overfitting with Learning Curve

Interview Questions Preview

What is Data Leakage and how do you prevent it?
What's the difference between K-Fold CV and Hold-out?
When do you use Stratified Split?
What's the difference between StandardScaler and RobustScaler?
Why use a Pipeline?

Check out more interview questions at Premium Interviews (opens in a new tab).

Practice Notebook

Practice the above concepts with the Titanic dataset:

The notebook additionally covers:

EDA (Exploratory Data Analysis) and visualization
Performance comparison experiments with/without Leakage
Model performance comparison across different Scalers
Feature Importance analysis
Practice problems

Next: 02. Linear Regression

Getting Started 02. Linear Regression