02. Mastering Linear Regression

OLS, Gradient Descent, L1/L2 Regularization, VIF, Residual Analysis

Learning Objectives

After completing this tutorial, you will be able to:

Understand OLS (Ordinary Least Squares) formula and implement it using the normal equation
Implement Gradient Descent algorithm and understand convergence based on learning rate
Understand the principles and compare effects of L1 (Lasso) / L2 (Ridge) regularization
Understand Multicollinearity problem and calculate VIF (Variance Inflation Factor)
Verify linear regression assumptions through Residual Analysis

Key Concepts

1. Linear Regression Model

Linear regression models the linear relationship between input variable X and output variable y:

ŷ = w₀ + w₁x₁ + w₂x₂ + ... + wₚxₚ

w₀: Intercept (bias)
w₁...wₚ: Weights for each feature (coefficients)

"Linear" in linear regression means linear with respect to parameters (weights). Even if you transform input features (e.g., x²), it's still linear regression as long as weights are linearly combined.

2. Loss Function (MSE)

We minimize Mean Squared Error (MSE):

L(w) = (1/n) Σ(yᵢ - ŷᵢ)² = (1/n)||y - Xw||²

Why use MSE:

Differentiable, easy to optimize
Larger penalty for larger errors
Closed-form solution exists

3. Normal Equation

Differentiating the loss function with respect to weights and setting to 0 gives the closed-form solution:

w = (XᵀX)⁻¹Xᵀy

From Scratch Implementation

class LinearRegressionScratch:
    def fit(self, X, y):
        # Add bias term (add column of ones to X)
        X_b = np.c_[np.ones((len(X), 1)), X]
        # Normal equation
        w = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
        self.bias = w[0]
        self.weights = w[1:]
        return self
 
    def predict(self, X):
        return X @ self.weights + self.bias

🚫

Watch out for missing Intercept! Without adding the column of ones, the regression line must pass through the origin. In experiments, R² score can drop from 0.58 to -2.71 without the intercept.

Pros and Cons of Normal Equation

Pros	Cons
Closed-form solution (computed at once)	O(n³) complexity (matrix inversion)
No learning rate tuning needed	Slow with many features
Guaranteed convergence	XᵀX may be non-invertible

4. Gradient Descent

For many features or large data, use Gradient Descent which iteratively moves in the opposite direction of the gradient:

w_{t+1} = w_t - α * ∇L(w_t)

Where the gradient is:

∇L(w) = -(2/n) Xᵀ(y - Xw)

From Scratch Implementation

class LinearRegressionGD:
    def fit(self, X, y, lr=0.01, n_iter=1000):
        self.weights = np.zeros(X.shape[1])
        self.bias = 0
        self.loss_history = []
 
        for _ in range(n_iter):
            y_pred = X @ self.weights + self.bias
            error = y_pred - y
 
            # Compute gradients
            dw = (1/len(y)) * (X.T @ error)
            db = (1/len(y)) * np.sum(error)
 
            # Update weights
            self.weights -= lr * dw
            self.bias -= lr * db
 
            # Record loss
            loss = (1/(2*len(y))) * np.sum((y - y_pred)**2)
            self.loss_history.append(loss)
 
        return self

Importance of Learning Rate Selection

Learning Rate	Result	Example Convergence
Too small (0.001)	Slow convergence	Still not converged after 1000 iterations
Appropriate (0.1)	Fast and stable convergence	Converges in 100-200 iterations
Too large (1.0+)	Divergence or oscillation	Loss increases to infinity

⚠️

When using Gradient Descent, you must apply Feature Scaling. Features with different scales distort the loss function contours and hinder convergence.

OLS vs Gradient Descent Comparison

Aspect	Normal Equation (OLS)	Gradient Descent
Computational Complexity	O(n³)	O(n·k·iter)
Number of Features	Recommended under 10,000	Scalable to large
Memory	Need to store XᵀX	Only batch size
Hyperparameters	None	Need to tune lr, n_iter
Convergence Guarantee	Always optimal	Depends on lr

5. Regularization

To prevent overfitting and control weights, add a penalty term to the loss function:

Method	Loss Function	Effect	When to Use
Ridge (L2)	`L(w) + λ‖w‖₂²`	Keep weights small	When multicollinearity exists
Lasso (L1)	`L(w) + λ‖w‖₁`	Some weights become 0	When Feature Selection is needed
Elastic Net	`L(w) + λ₁‖w‖₁ + λ₂‖w‖₂²`	L1 + L2 combined	When many correlated features

Geometric Interpretation of Ridge vs Lasso

L1 (Lasso): Diamond-shaped constraint
  → Contacts loss function contour at vertex → Some coefficients exactly 0

L2 (Ridge): Circular constraint
  → Contacts loss function contour on curve → All coefficients shrink but not 0

Finding Optimal Alpha (Cross-Validation)

from sklearn.linear_model import RidgeCV, LassoCV
 
# Auto-search optimal alpha with Cross-Validation
ridge_cv = RidgeCV(alphas=np.logspace(-4, 4, 50), cv=5)
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {ridge_cv.alpha_}")
 
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 50), cv=5, max_iter=10000)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {lasso_cv.alpha_}")

In experiments with California Housing dataset, R² scores of all three models (OLS, Ridge, Lasso) were nearly identical (~0.576). This is a case where basic linear regression is sufficient. Regularization is more useful when overfitting is suspected or multicollinearity exists.

6. Multicollinearity (VIF)

Multicollinearity occurs when independent variables have high correlation with each other.

VIF (Variance Inflation Factor) Calculation

Regress each feature on other features to calculate R²:

VIF = 1 / (1 - R²)

VIF Value	Interpretation
1	No multicollinearity
1-5	Generally acceptable
5-10	Caution needed
> 10	Severe multicollinearity → Consider removing variable

VIF Calculation Implementation

def calculate_vif(X):
    vif_data = []
    for col in X.columns:
        X_temp = X.drop(columns=[col])
        r2 = LinearRegression().fit(X_temp, X[col]).score(X_temp, X[col])
        vif = 1 / (1 - r2) if r2 < 1 else float('inf')
        vif_data.append({'Feature': col, 'VIF': vif})
    return pd.DataFrame(vif_data)

Problems with Multicollinearity

Coefficient estimation instability: Coefficients change significantly with small data changes
Difficulty interpreting coefficients: Hard to separate individual variable effects
Increased standard errors: Statistical significance tests become unreliable

⚠️

In California Housing dataset, Latitude (9.2) and Longitude (8.9) had high VIF. This is because the two variables are strongly geographically related. Solution: Remove variable or use Ridge regularization

Residual Analysis

Analyze residuals (residual = actual - predicted) to verify linear regression assumptions:

4 Main Assumptions

Assumption	Verification Method	Action if Violated
Linearity	No pattern in residual vs predicted plot	Nonlinear transformation, polynomial regression
Normality	Q-Q Plot, Shapiro-Wilk test	Target transformation (log, etc.)
Homoscedasticity	Constant variance in residual vs predicted	Weighted least squares
Independence	Durbin-Watson test	Consider time series model

from scipy import stats
 
residuals = y_test - y_pred
 
# Basic statistics
print(f"Mean: {residuals.mean():.6f}")  # Should be close to 0
print(f"Skewness: {stats.skew(residuals):.4f}")  # Should be close to 0
print(f"Kurtosis: {stats.kurtosis(residuals):.4f}")  # Should be close to 0
 
# Normality test
_, p_value = stats.shapiro(residuals[:5000])
print(f"Shapiro-Wilk p-value: {p_value:.6f}")

Code Summary

from sklearn.linear_model import LinearRegression, Ridge, Lasso, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
 
# Scaling (required for GD and regularization!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
 
# Train and compare models
models = {
    'OLS': LinearRegression(),
    'Ridge': Ridge(alpha=1.0),
    'Lasso': Lasso(alpha=0.1)
}
 
for name, model in models.items():
    model.fit(X_train_scaled, y_train)
    y_pred = model.predict(X_test_scaled)
    print(f"{name}: R²={r2_score(y_test, y_pred):.4f}, "
          f"RMSE={np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

Checklist

Step	Check Items
Data Preprocessing	☐ Handle missing values
	☐ Feature Scaling (StandardScaler recommended)
	☐ Check and handle outliers
Model Selection	☐ Check multicollinearity (calculate VIF)
	☐ Remove variables with VIF > 10 or use Ridge
	☐ Use Lasso when Feature Selection needed
Validation	☐ Residual analysis (linearity, normality, homoscedasticity)
	☐ Tune alpha with Cross-Validation
	☐ Final performance evaluation on Test Set

Interview Questions Preview

What's the difference between OLS and Gradient Descent?
What's the difference between Ridge and Lasso? When do you use each?
Why is multicollinearity a problem? How do you solve it?
What are the assumptions of linear regression and how do you verify them?
What happens when alpha (lambda) in regularization is too large?

Check out more interview questions at Premium Interviews (opens in a new tab).

Practice Notebook

Practice the above concepts with California Housing dataset:

The notebook additionally covers:

Detailed EDA (Exploratory Data Analysis) and visualization
Gradient Descent convergence animation
Convergence speed comparison by learning rate
Visualizing coefficient instability from multicollinearity using Bootstrap
4-panel residual analysis plot
Practice problems (Mini-batch GD, Elastic Net, Polynomial Regression)

Previous: 01. ML Pipeline | Next: 03. Logistic Regression

01. ML Pipeline 03. Logistic Regression