01. Mastering ML Pipeline
Train/Val/Test Split, Cross-Validation, Preventing Data Leakage
Learning Objectives
After completing this tutorial, you will be able to:
- Understand the principles and proper methods of Train/Validation/Test Split
- Implement and utilize Cross-Validation (K-Fold, Stratified K-Fold)
- Understand the risks of Data Leakage and build Pipelines to prevent it
- Compare and analyze Feature Scaling methods (Standard, MinMax, Robust)
- Build reproducible ML workflows using
sklearn.pipeline.PipelineandColumnTransformer
Key Concepts
1. What is a Machine Learning Pipeline?
An ML Pipeline is a workflow that sequentially connects all steps from data preprocessing to model training.
Raw Data → Preprocessing → Feature Engineering → Model Training → PredictionWithout a Pipeline:
- Manual management of preprocessing steps
- Risk of applying different transformations to Train/Test data
- Increased possibility of Data Leakage
2. Train/Validation/Test Split
Why is 3-way Split necessary?
| Set | Purpose | Ratio |
|---|---|---|
| Train | Model training | 60-70% |
| Validation | Hyperparameter tuning | 15-20% |
| Test | Final performance evaluation (only once!) | 15-20% |
The Test Set should be used only once for final evaluation. Using it multiple times causes overfitting to the Test Set.
Wrong approach: Simple Random Split
# Splitting without stratify → Class imbalance may occur
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)Correct approach: Stratified Split
from sklearn.model_selection import train_test_split
# Split while maintaining target distribution
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, random_state=42, stratify=y_temp
)
# Result: Train 60%, Val 20%, Test 20% (all with same class ratio)3. Data Leakage
When information from test data leaks into the training process
Main Causes
- Preprocessing Leakage: Calculating mean/std from entire data before scaling
- Target Leakage: Using features derived from the target
- Time Leakage: Predicting the past with future data (time series)
Wrong Example
# ❌ Scale entire data → Split (Leakage!)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X) # Uses mean/std of entire data
X_train, X_test = train_test_split(X_scaled, ...)
# Problem: Test data information leaks into TrainCorrect Example
# ✅ Split → Fit on Train only → Transform each
X_train, X_test = train_test_split(X, ...)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train) # Train only!
X_test_scaled = scaler.transform(X_test) # transform only (no fit)Even with leakage, test performance may appear good. However, performance significantly degrades in actual production!
4. Cross-Validation
Evaluate by splitting data multiple times to overcome limitations of single split
K-Fold CV
Divide data into K parts, use each fold once as validation:
from sklearn.model_selection import cross_val_score, StratifiedKFold
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
print(f"Mean: {scores.mean():.4f} (±{scores.std():.4f})")K-Fold vs Stratified K-Fold
| Method | Characteristics | When to Use |
|---|---|---|
| K-Fold | Simply divides into K parts | Regression problems |
| Stratified K-Fold | Divides into K parts while maintaining class ratio | Classification (especially imbalanced data) |
CV with Multiple Metrics
from sklearn.model_selection import cross_validate
scoring = {
'accuracy': 'accuracy',
'precision': 'precision',
'recall': 'recall',
'f1': 'f1',
'roc_auc': 'roc_auc'
}
results = cross_validate(pipeline, X, y, cv=cv, scoring=scoring)5. Feature Scaling Comparison
| Scaler | Formula | Characteristics | When to Use |
|---|---|---|---|
| StandardScaler | (x - μ) / σ | Mean 0, Variance 1 | General cases |
| MinMaxScaler | (x - min) / (max - min) | [0, 1] range | Neural networks, images |
| RobustScaler | (x - Q2) / (Q3 - Q1) | Uses median, IQR | When many outliers |
Models that Require Scaling vs Don't Require
| Scaling Required | Scaling Not Required |
|---|---|
| Logistic Regression | Decision Tree |
| SVM | Random Forest |
| KNN | XGBoost, LightGBM |
| Neural Network |
Building a Pipeline
Basic Pipeline
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', RandomForestClassifier())
])
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)Separate Processing for Numeric/Categorical with ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
# Numeric preprocessing
numeric_transformer = Pipeline([
('imputer', SimpleImputer(strategy='median')),
('scaler', StandardScaler())
])
# Categorical preprocessing
categorical_transformer = Pipeline([
('imputer', SimpleImputer(strategy='most_frequent')),
('encoder', OneHotEncoder(handle_unknown='ignore'))
])
# Combine
preprocessor = ColumnTransformer([
('num', numeric_transformer, ['age', 'fare', 'sibsp', 'parch']),
('cat', categorical_transformer, ['pclass', 'sex', 'embarked'])
])
# Full pipeline (preprocessing + model)
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', RandomForestClassifier(n_estimators=100))
])
# Train & Predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)Learning Curve Analysis
Diagnose Overfitting/Underfitting through changes in Train/Validation performance:
from sklearn.model_selection import learning_curve
train_sizes, train_scores, val_scores = learning_curve(
pipeline, X, y,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5, scoring='accuracy'
)
# Visualization
plt.plot(train_sizes, train_scores.mean(axis=1), label='Train')
plt.plot(train_sizes, val_scores.mean(axis=1), label='Validation')
plt.xlabel('Training Set Size')
plt.ylabel('Accuracy')
plt.legend()Interpretation:
- Both curves converging → Good sign
- Large gap → Overfitting (need regularization or simpler model)
- Both curves low → Underfitting (need more complex model)
Checklist
| Step | Check Items |
|---|---|
| Data Split | ☐ Use Stratified Split |
| ☐ Train/Val/Test 3-way Split | |
| ☐ Use Test Set only for final evaluation | |
| Preprocessing | ☐ Preprocess after splitting |
| ☐ Use Pipeline to prevent Leakage | |
| ☐ fit_transform only on Train | |
| Validation | ☐ Use Cross-Validation |
| ☐ Use multiple metrics | |
| ☐ Check overfitting with Learning Curve |
Interview Questions Preview
- What is Data Leakage and how do you prevent it?
- What's the difference between K-Fold CV and Hold-out?
- When do you use Stratified Split?
- What's the difference between StandardScaler and RobustScaler?
- Why use a Pipeline?
Check out more interview questions at Premium Interviews (opens in a new tab).
Practice Notebook
Practice the above concepts with the Titanic dataset:
The notebook additionally covers:
- EDA (Exploratory Data Analysis) and visualization
- Performance comparison experiments with/without Leakage
- Model performance comparison across different Scalers
- Feature Importance analysis
- Practice problems
Next: 02. Linear Regression