02. Mastering Linear Regression
OLS, Gradient Descent, L1/L2 Regularization, VIF, Residual Analysis
Learning Objectives
After completing this tutorial, you will be able to:
- Understand OLS (Ordinary Least Squares) formula and implement it using the normal equation
- Implement Gradient Descent algorithm and understand convergence based on learning rate
- Understand the principles and compare effects of L1 (Lasso) / L2 (Ridge) regularization
- Understand Multicollinearity problem and calculate VIF (Variance Inflation Factor)
- Verify linear regression assumptions through Residual Analysis
Key Concepts
1. Linear Regression Model
Linear regression models the linear relationship between input variable X and output variable y:
ŷ = w₀ + w₁x₁ + w₂x₂ + ... + wₚxₚw₀: Intercept (bias)w₁...wₚ: Weights for each feature (coefficients)
"Linear" in linear regression means linear with respect to parameters (weights). Even if you transform input features (e.g., x²), it's still linear regression as long as weights are linearly combined.
2. Loss Function (MSE)
We minimize Mean Squared Error (MSE):
L(w) = (1/n) Σ(yᵢ - ŷᵢ)² = (1/n)||y - Xw||²Why use MSE:
- Differentiable, easy to optimize
- Larger penalty for larger errors
- Closed-form solution exists
3. Normal Equation
Differentiating the loss function with respect to weights and setting to 0 gives the closed-form solution:
w = (XᵀX)⁻¹XᵀyFrom Scratch Implementation
class LinearRegressionScratch:
def fit(self, X, y):
# Add bias term (add column of ones to X)
X_b = np.c_[np.ones((len(X), 1)), X]
# Normal equation
w = np.linalg.inv(X_b.T @ X_b) @ X_b.T @ y
self.bias = w[0]
self.weights = w[1:]
return self
def predict(self, X):
return X @ self.weights + self.biasWatch out for missing Intercept! Without adding the column of ones, the regression line must pass through the origin. In experiments, R² score can drop from 0.58 to -2.71 without the intercept.
Pros and Cons of Normal Equation
| Pros | Cons |
|---|---|
| Closed-form solution (computed at once) | O(n³) complexity (matrix inversion) |
| No learning rate tuning needed | Slow with many features |
| Guaranteed convergence | XᵀX may be non-invertible |
4. Gradient Descent
For many features or large data, use Gradient Descent which iteratively moves in the opposite direction of the gradient:
w_{t+1} = w_t - α * ∇L(w_t)Where the gradient is:
∇L(w) = -(2/n) Xᵀ(y - Xw)From Scratch Implementation
class LinearRegressionGD:
def fit(self, X, y, lr=0.01, n_iter=1000):
self.weights = np.zeros(X.shape[1])
self.bias = 0
self.loss_history = []
for _ in range(n_iter):
y_pred = X @ self.weights + self.bias
error = y_pred - y
# Compute gradients
dw = (1/len(y)) * (X.T @ error)
db = (1/len(y)) * np.sum(error)
# Update weights
self.weights -= lr * dw
self.bias -= lr * db
# Record loss
loss = (1/(2*len(y))) * np.sum((y - y_pred)**2)
self.loss_history.append(loss)
return selfImportance of Learning Rate Selection
| Learning Rate | Result | Example Convergence |
|---|---|---|
| Too small (0.001) | Slow convergence | Still not converged after 1000 iterations |
| Appropriate (0.1) | Fast and stable convergence | Converges in 100-200 iterations |
| Too large (1.0+) | Divergence or oscillation | Loss increases to infinity |
When using Gradient Descent, you must apply Feature Scaling. Features with different scales distort the loss function contours and hinder convergence.
OLS vs Gradient Descent Comparison
| Aspect | Normal Equation (OLS) | Gradient Descent |
|---|---|---|
| Computational Complexity | O(n³) | O(n·k·iter) |
| Number of Features | Recommended under 10,000 | Scalable to large |
| Memory | Need to store XᵀX | Only batch size |
| Hyperparameters | None | Need to tune lr, n_iter |
| Convergence Guarantee | Always optimal | Depends on lr |
5. Regularization
To prevent overfitting and control weights, add a penalty term to the loss function:
| Method | Loss Function | Effect | When to Use |
|---|---|---|---|
| Ridge (L2) | L(w) + λ‖w‖₂² | Keep weights small | When multicollinearity exists |
| Lasso (L1) | L(w) + λ‖w‖₁ | Some weights become 0 | When Feature Selection is needed |
| Elastic Net | L(w) + λ₁‖w‖₁ + λ₂‖w‖₂² | L1 + L2 combined | When many correlated features |
Geometric Interpretation of Ridge vs Lasso
L1 (Lasso): Diamond-shaped constraint
→ Contacts loss function contour at vertex → Some coefficients exactly 0
L2 (Ridge): Circular constraint
→ Contacts loss function contour on curve → All coefficients shrink but not 0Finding Optimal Alpha (Cross-Validation)
from sklearn.linear_model import RidgeCV, LassoCV
# Auto-search optimal alpha with Cross-Validation
ridge_cv = RidgeCV(alphas=np.logspace(-4, 4, 50), cv=5)
ridge_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {ridge_cv.alpha_}")
lasso_cv = LassoCV(alphas=np.logspace(-4, 1, 50), cv=5, max_iter=10000)
lasso_cv.fit(X_train_scaled, y_train)
print(f"Best alpha: {lasso_cv.alpha_}")In experiments with California Housing dataset, R² scores of all three models (OLS, Ridge, Lasso) were nearly identical (~0.576). This is a case where basic linear regression is sufficient. Regularization is more useful when overfitting is suspected or multicollinearity exists.
6. Multicollinearity (VIF)
Multicollinearity occurs when independent variables have high correlation with each other.
VIF (Variance Inflation Factor) Calculation
Regress each feature on other features to calculate R²:
VIF = 1 / (1 - R²)| VIF Value | Interpretation |
|---|---|
| 1 | No multicollinearity |
| 1-5 | Generally acceptable |
| 5-10 | Caution needed |
| > 10 | Severe multicollinearity → Consider removing variable |
VIF Calculation Implementation
def calculate_vif(X):
vif_data = []
for col in X.columns:
X_temp = X.drop(columns=[col])
r2 = LinearRegression().fit(X_temp, X[col]).score(X_temp, X[col])
vif = 1 / (1 - r2) if r2 < 1 else float('inf')
vif_data.append({'Feature': col, 'VIF': vif})
return pd.DataFrame(vif_data)Problems with Multicollinearity
- Coefficient estimation instability: Coefficients change significantly with small data changes
- Difficulty interpreting coefficients: Hard to separate individual variable effects
- Increased standard errors: Statistical significance tests become unreliable
In California Housing dataset, Latitude (9.2) and Longitude (8.9) had high VIF. This is because the two variables are strongly geographically related. Solution: Remove variable or use Ridge regularization
Residual Analysis
Analyze residuals (residual = actual - predicted) to verify linear regression assumptions:
4 Main Assumptions
| Assumption | Verification Method | Action if Violated |
|---|---|---|
| Linearity | No pattern in residual vs predicted plot | Nonlinear transformation, polynomial regression |
| Normality | Q-Q Plot, Shapiro-Wilk test | Target transformation (log, etc.) |
| Homoscedasticity | Constant variance in residual vs predicted | Weighted least squares |
| Independence | Durbin-Watson test | Consider time series model |
from scipy import stats
residuals = y_test - y_pred
# Basic statistics
print(f"Mean: {residuals.mean():.6f}") # Should be close to 0
print(f"Skewness: {stats.skew(residuals):.4f}") # Should be close to 0
print(f"Kurtosis: {stats.kurtosis(residuals):.4f}") # Should be close to 0
# Normality test
_, p_value = stats.shapiro(residuals[:5000])
print(f"Shapiro-Wilk p-value: {p_value:.6f}")Code Summary
from sklearn.linear_model import LinearRegression, Ridge, Lasso, RidgeCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score, mean_squared_error
# Scaling (required for GD and regularization!)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train and compare models
models = {
'OLS': LinearRegression(),
'Ridge': Ridge(alpha=1.0),
'Lasso': Lasso(alpha=0.1)
}
for name, model in models.items():
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
print(f"{name}: R²={r2_score(y_test, y_pred):.4f}, "
f"RMSE={np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")Checklist
| Step | Check Items |
|---|---|
| Data Preprocessing | ☐ Handle missing values |
| ☐ Feature Scaling (StandardScaler recommended) | |
| ☐ Check and handle outliers | |
| Model Selection | ☐ Check multicollinearity (calculate VIF) |
| ☐ Remove variables with VIF > 10 or use Ridge | |
| ☐ Use Lasso when Feature Selection needed | |
| Validation | ☐ Residual analysis (linearity, normality, homoscedasticity) |
| ☐ Tune alpha with Cross-Validation | |
| ☐ Final performance evaluation on Test Set |
Interview Questions Preview
- What's the difference between OLS and Gradient Descent?
- What's the difference between Ridge and Lasso? When do you use each?
- Why is multicollinearity a problem? How do you solve it?
- What are the assumptions of linear regression and how do you verify them?
- What happens when alpha (lambda) in regularization is too large?
Check out more interview questions at Premium Interviews (opens in a new tab).
Practice Notebook
Practice the above concepts with California Housing dataset:
The notebook additionally covers:
- Detailed EDA (Exploratory Data Analysis) and visualization
- Gradient Descent convergence animation
- Convergence speed comparison by learning rate
- Visualizing coefficient instability from multicollinearity using Bootstrap
- 4-panel residual analysis plot
- Practice problems (Mini-batch GD, Elastic Net, Polynomial Regression)
Previous: 01. ML Pipeline | Next: 03. Logistic Regression