en
Tutorials
02. Linear Regression

02. Mastering Linear Regression

OLS, Gradient Descent, L1/L2 Regularization, VIF, Residual Analysis


Learning Objectives

After completing this tutorial, you will be able to:

  1. Understand OLS (Ordinary Least Squares) formula and implement it using the normal equation
  2. Implement Gradient Descent algorithm and understand convergence based on learning rate
  3. Understand the principles and compare effects of L1 (Lasso) / L2 (Ridge) regularization
  4. Understand Multicollinearity problem and calculate VIF (Variance Inflation Factor)
  5. Verify linear regression assumptions through Residual Analysis

Key Concepts

1. Linear Regression Model

Linear regression models the linear relationship between input variable X and output variable y:

ŷ = w₀ + w₁x₁ + w₂x₂ + ... + wₚxₚ
  • w₀: Intercept (bias)
  • w₁...wₚ: Weights for each feature (coefficients)

"Linear" in linear regression means linear with respect to parameters (weights). Even if you transform input features (e.g., x²), it's still linear regression as long as weights are linearly combined.


2. Loss Function (MSE)

We minimize Mean Squared Error (MSE):

L(w) = (1/n) Σ(yᵢ - ŷᵢ)² = (1/n)||y - Xw||²

Why use MSE:

  • Differentiable, easy to optimize
  • Larger penalty for larger errors
  • Closed-form solution exists

3. Normal Equation

Differentiating the loss function with respect to weights and setting to 0 gives the closed-form solution:

w = (XᵀX)⁻¹Xᵀy

From Scratch Implementation

class LinearRegressionScratch:
    def fit(self, X, y):
        # Add bias term (add column of ones to X)
        X_b = np.c_[np.ones((len(X), 1)), X]
        # Normal equation
        w = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
        self.bias = w[0]
        self.weights = w[1:]
        return self
 
    def predict(self, X):
        return X @ self.weights + self.bias
🚫

Watch out for missing Intercept! Without adding the column of ones, the regression line must pass through the origin. In experiments, R² score can drop from 0.58 to -2.71 without the intercept.

Pros and Cons of Normal Equation

ProsCons
Closed-form solution (computed at once)O(n³) complexity (matrix inversion)
No learning rate tuning neededSlow with many features
Guaranteed convergenceXᵀX may be non-invertible

4. Gradient Descent

For many features or large data, use Gradient Descent which iteratively moves in the opposite direction of the gradient:

w_{t+1} = w_t - α * ∇L(w_t)

Where the gradient is:

∇L(w) = -(2/n) Xᵀ(y - Xw)

From Scratch Implementation

class LinearRegressionGD:
    def fit(self, X, y, lr=0.01, n_iter=1000):
        self.weights = np.zeros(X.shape[1])
        self.bias = 0
        self.loss_history = []
 
        for _ in range(n_iter):
            y_pred = X @ self.weights + self.bias
            error = y_pred - y
 
            # Compute gradients
            dw = (1/len(y)) * (X.T @ error)
            db = (1/len(y)) * np.sum(error)
 
            # Update weights
            self.weights -= lr * dw
            self.bias -= lr * db
 
            # Record loss
            loss = (1/(2*len(y))) * np.sum((y - y_pred)**2)
            self.loss_history.append(loss)
 
        return self

Importance of Learning Rate Selection

Learning RateResultExample Convergence
Too small (0.001)Slow convergenceStill not converged after 1000 iterations
Appropriate (0.1)Fast and stable convergenceConverges in 100-200 iterations
Too large (1.0+)Divergence or oscillationLoss increases to infinity
⚠️

When using Gradient Descent, you must apply Feature Scaling. Features with different scales distort the loss function contours and hinder convergence.

OLS vs Gradient Descent Comparison

AspectNormal Equation (OLS)Gradient Descent
Computational ComplexityO(n³)O(n·k·iter)
Number of FeaturesRecommended under 10,000Scalable to large
MemoryNeed to store XᵀXOnly batch size
HyperparametersNoneNeed to tune lr, n_iter
Convergence GuaranteeAlways optimalDepends on lr

5. Regularization

To prevent overfitting and control weights, add a penalty term to the loss function:

MethodLoss FunctionEffectWhen to Use
Ridge (L2)L(w) + λ‖w‖₂²Keep weights smallWhen multicollinearity exists
Lasso (L1)L(w) + λ‖w‖₁Some weights become 0When Feature Selection is needed
Elastic NetL(w) + λ₁‖w‖₁ + λ₂‖w‖₂²L1 + L2 combinedWhen many correlated features

Geometric Interpretation of Ridge vs Lasso

L1 (Lasso): Diamond-shaped constraint
  → Contacts loss function contour at vertex → Some coefficients exactly 0

L2 (Ridge): Circular constraint
  → Contacts loss function contour on curve → All coefficients shrink but not 0

Finding Optimal Alpha (Cross-Validation)

from sklearn.linear_model import RidgeCV, LassoCV
 
# Auto-search optimal alpha with Cross-Validation
ridge_cv = RidgeCV(alphas=np.logspace(-4, 4, 50), cv=5)
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {ridge_cv.alpha_}")
 
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 50), cv=5, max_iter=10000)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {lasso_cv.alpha_}")

In experiments with California Housing dataset, R² scores of all three models (OLS, Ridge, Lasso) were nearly identical (~0.576). This is a case where basic linear regression is sufficient. Regularization is more useful when overfitting is suspected or multicollinearity exists.


6. Multicollinearity (VIF)

Multicollinearity occurs when independent variables have high correlation with each other.

VIF (Variance Inflation Factor) Calculation

Regress each feature on other features to calculate R²:

VIF = 1 / (1 - R²)
VIF ValueInterpretation
1No multicollinearity
1-5Generally acceptable
5-10Caution needed
> 10Severe multicollinearity → Consider removing variable

VIF Calculation Implementation

def calculate_vif(X):
    vif_data = []
    for col in X.columns:
        X_temp = X.drop(columns=[col])
        r2 = LinearRegression().fit(X_temp, X[col]).score(X_temp, X[col])
        vif = 1 / (1 - r2) if r2 < 1 else float('inf')
        vif_data.append({'Feature': col, 'VIF': vif})
    return pd.DataFrame(vif_data)

Problems with Multicollinearity

  1. Coefficient estimation instability: Coefficients change significantly with small data changes
  2. Difficulty interpreting coefficients: Hard to separate individual variable effects
  3. Increased standard errors: Statistical significance tests become unreliable
⚠️

In California Housing dataset, Latitude (9.2) and Longitude (8.9) had high VIF. This is because the two variables are strongly geographically related. Solution: Remove variable or use Ridge regularization


Residual Analysis

Analyze residuals (residual = actual - predicted) to verify linear regression assumptions:

4 Main Assumptions

AssumptionVerification MethodAction if Violated
LinearityNo pattern in residual vs predicted plotNonlinear transformation, polynomial regression
NormalityQ-Q Plot, Shapiro-Wilk testTarget transformation (log, etc.)
HomoscedasticityConstant variance in residual vs predictedWeighted least squares
IndependenceDurbin-Watson testConsider time series model
from scipy import stats
 
residuals = y_test - y_pred
 
# Basic statistics
print(f"Mean: {residuals.mean():.6f}")  # Should be close to 0
print(f"Skewness: {stats.skew(residuals):.4f}")  # Should be close to 0
print(f"Kurtosis: {stats.kurtosis(residuals):.4f}")  # Should be close to 0
 
# Normality test
_, p_value = stats.shapiro(residuals[:5000])
print(f"Shapiro-Wilk p-value: {p_value:.6f}")

Code Summary

from sklearn.linear_model import LinearRegression, Ridge, Lasso, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
 
# Scaling (required for GD and regularization!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 
# Train and compare models
models = {
    'OLS': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1)
}
 
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    print(f"{name}: R²={r2_score(y_test, y_pred):.4f}, "
          f"RMSE={np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

Checklist

StepCheck Items
Data Preprocessing☐ Handle missing values
☐ Feature Scaling (StandardScaler recommended)
☐ Check and handle outliers
Model Selection☐ Check multicollinearity (calculate VIF)
☐ Remove variables with VIF > 10 or use Ridge
☐ Use Lasso when Feature Selection needed
Validation☐ Residual analysis (linearity, normality, homoscedasticity)
☐ Tune alpha with Cross-Validation
☐ Final performance evaluation on Test Set

Interview Questions Preview

  1. What's the difference between OLS and Gradient Descent?
  2. What's the difference between Ridge and Lasso? When do you use each?
  3. Why is multicollinearity a problem? How do you solve it?
  4. What are the assumptions of linear regression and how do you verify them?
  5. What happens when alpha (lambda) in regularization is too large?

Check out more interview questions at Premium Interviews (opens in a new tab).


Practice Notebook

Practice the above concepts with California Housing dataset:

The notebook additionally covers:

  • Detailed EDA (Exploratory Data Analysis) and visualization
  • Gradient Descent convergence animation
  • Convergence speed comparison by learning rate
  • Visualizing coefficient instability from multicollinearity using Bootstrap
  • 4-panel residual analysis plot
  • Practice problems (Mini-batch GD, Elastic Net, Polynomial Regression)

Previous: 01. ML Pipeline | Next: 03. Logistic Regression