11. Time Series Analysis
ARIMA, Seasonality, Forecasting
Learning Objectives
After completing this tutorial, you will be able to:
- Understand time series data characteristics (Trend, Seasonality, Stationarity)
- Perform Time Series Decomposition
- Execute and interpret Stationarity tests (ADF Test)
- Understand and apply ARIMA models
- Select parameters using ACF/PACF
- Evaluate forecasting performance (RMSE, MAE, MAPE)
Key Concepts
1. What is Time Series Data?
Data ordered by time, with special characteristics different from regular data.
| Characteristic | Description |
|---|---|
| Order matters | Data order has meaning |
| Autocorrelation | Past values influence future |
3 Components of Time Series
| Component | Description |
|---|---|
| Trend | Long-term increase/decrease pattern |
| Seasonality | Periodic repeating pattern |
| Noise | Irregular variation |
# Time series data creation example
import numpy as np
import pandas as pd
np.random.seed(42)
dates = pd.date_range(start='2020-01-01', periods=730, freq='D')
# Components
trend = np.linspace(100, 200, 730) # Linear trend
seasonality = 30 * np.sin(2 * np.pi * np.arange(730) / 365) # Annual seasonality
weekly = 10 * np.sin(2 * np.pi * np.arange(730) / 7) # Weekly pattern
noise = np.random.normal(0, 10, 730) # Noise
# Final time series
values = trend + seasonality + weekly + noise
df = pd.DataFrame({'date': dates, 'sales': values})
df.set_index('date', inplace=True)2. Time Series Decomposition
Decompose time series into trend, seasonality, and residuals to analyze each component.
Additive vs Multiplicative Model
| Model | Formula | When to Use |
|---|---|---|
| Additive | Yt = Tt + St + Rt | When seasonal variation is constant |
| Multiplicative | Yt = Tt × St × Rt | When seasonal variation grows with trend |
- T: Trend
- S: Seasonal
- R: Residual
from statsmodels.tsa.seasonal import seasonal_decompose
# Decompose with additive model
decomposition = seasonal_decompose(df['sales'], model='additive', period=365)
# Access each component
print(f'Trend start: {decomposition.trend.dropna().iloc[0]:.2f}')
print(f'Seasonality range: [{decomposition.seasonal.min():.2f}, {decomposition.seasonal.max():.2f}]')
print(f'Residual std: {decomposition.resid.std():.2f}')Use multiplicative model for data like airline passengers where seasonal variation grows as trend increases.
3. Stationarity
Core assumption of time series analysis: Statistical properties are constant over time
| Condition | Description |
|---|---|
| Constant mean | E[Yt] = μ |
| Constant variance | Var(Yt) = σ² |
| Autocovariance | Cov(Yt, Yt-k) = γk (depends only on lag) |
Why is Stationarity Important? Most time series models (like ARIMA) assume stationarity. Non-stationary series require transformation.
Stationarity Test (ADF Test)
Use ADF (Augmented Dickey-Fuller) test to check stationarity.
from statsmodels.tsa.stattools import adfuller
def adf_test(series, name=''):
result = adfuller(series.dropna(), autolag='AIC')
print(f'=== ADF Test: {name} ===')
print(f'ADF Statistic: {result[0]:.4f}')
print(f'p-value: {result[1]:.4f}')
print(f'Critical Values:')
for key, value in result[4].items():
print(f' {key}: {value:.4f}')
if result[1] < 0.05:
print('\n→ Stationary')
else:
print('\n→ Non-stationary')
adf_test(df['sales'], 'Original Series')
# p-value < 0.05 → StationaryNon-stationary → Stationary Transformation
# First differencing
series_diff = series.diff().dropna()
# Log transform + differencing (when variance increases)
series_log_diff = np.log(series).diff().dropna()4. ACF and PACF
Key tools for determining ARIMA model parameters.
| Metric | Meaning | Formula | ARIMA Application |
|---|---|---|---|
| ACF | Correlation by lag | Corr(Yt, Yt-k) | Determine MA(q) order |
| PACF | Pure lag correlation | Remove intermediate lag effects | Determine AR(p) order |
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
plot_acf(series_diff, ax=axes[0], lags=50, alpha=0.05)
plot_pacf(series_diff, ax=axes[1], lags=50, alpha=0.05)
plt.tight_layout()
plt.show()ACF/PACF Interpretation Guide
| Pattern | ACF | PACF | Model |
|---|---|---|---|
| AR(p) | Exponential decay | Cut off at p | ARIMA(p,d,0) |
| MA(q) | Cut off at q | Exponential decay | ARIMA(0,d,q) |
| ARMA | Exponential decay | Exponential decay | ARIMA(p,d,q) |
5. ARIMA Model
AutoRegressive Integrated Moving Average
ARIMA(p, d, q):
- p: AR (AutoRegressive) order - Determined by PACF
- d: Differencing order - Number of differences needed for stationarity
- q: MA (Moving Average) order - Determined by ACF
from statsmodels.tsa.arima.model import ARIMA
# Train/Test split (maintain time order!)
train_size = int(len(df) * 0.8)
train = df['sales'][:train_size]
test = df['sales'][train_size:]
# Model creation and training
model = ARIMA(train, order=(2, 1, 2))
model_fit = model.fit()
# Forecast
forecast = model_fit.forecast(steps=len(test))
# Summary
print(model_fit.summary())Caution: Time series data must be split in time order. Random splitting causes future information leakage (Data Leakage)!
Automatic Parameter Selection (Grid Search)
from itertools import product
p_values = range(0, 4)
d_values = range(0, 2)
q_values = range(0, 4)
results = []
for p, d, q in product(p_values, d_values, q_values):
try:
model = ARIMA(train, order=(p, d, q))
model_fit = model.fit()
aic = model_fit.aic
results.append({'Order': f'({p},{d},{q})', 'AIC': aic})
except:
continue
results_df = pd.DataFrame(results)
print(results_df.nsmallest(5, 'AIC')) # Lower AIC is better6. Seasonal ARIMA (SARIMA)
Use SARIMA for data with seasonality.
SARIMA(p, d, q)(P, D, Q, s):
- (p, d, q): Non-seasonal parameters
- (P, D, Q, s): Seasonal parameters, s=period
from statsmodels.tsa.statespace.sarimax import SARIMAX
model = SARIMAX(train,
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12)) # Monthly data
model_fit = model.fit(disp=False)7. Auto ARIMA
Automatically find optimal parameters with pmdarima library.
from pmdarima import auto_arima
auto_model = auto_arima(
train,
seasonal=True,
m=12, # Seasonal period
trace=True,
error_action='ignore',
suppress_warnings=True
)
print(auto_model.summary())8. Residual Diagnostics
Good models should have residuals that are white noise.
from statsmodels.stats.diagnostic import acorr_ljungbox
residuals = model_fit.resid
# Ljung-Box test
lb_result = acorr_ljungbox(residuals, lags=[10, 20, 30], return_df=True)
print(lb_result)
# p-value > 0.05 means no autocorrelation in residuals (Good!)Residual diagnostic checklist:
- Residuals randomly distributed around 0
- Residual ACF not significant
- Residuals follow normal distribution (check Q-Q Plot)
9. Moving Average Based Forecasting
Simple but effective baseline model.
# Simple Moving Average (SMA)
df['SMA_7'] = df['sales'].rolling(window=7).mean()
df['SMA_30'] = df['sales'].rolling(window=30).mean()
# Exponential Moving Average (EMA) - More weight on recent values
df['EMA_7'] = df['sales'].ewm(span=7).mean()
df['EMA_30'] = df['sales'].ewm(span=30).mean()10. Time Series Cross-Validation
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
for train_idx, test_idx in tscv.split(X):
X_train, X_test = X[train_idx], X[test_idx]
y_train, y_test = y[train_idx], y[test_idx]Code Summary
import pandas as pd
import numpy as np
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import adfuller
from sklearn.metrics import mean_squared_error, mean_absolute_error
# 1. Stationarity test
result = adfuller(series.dropna())
print(f'ADF p-value: {result[1]:.4f}')
# 2. Difference if needed
series_diff = series.diff().dropna()
# 3. Data split (maintain time order)
train_size = int(len(series) * 0.8)
train, test = series[:train_size], series[train_size:]
# 4. ARIMA model
model = ARIMA(train, order=(2, 1, 2))
model_fit = model.fit()
# 5. Forecast
forecast = model_fit.forecast(steps=len(test))
# 6. Evaluation
rmse = np.sqrt(mean_squared_error(test, forecast))
mae = mean_absolute_error(test, forecast)
mape = np.mean(np.abs((test - forecast) / test)) * 100
print(f"RMSE: {rmse:.4f}")
print(f"MAE: {mae:.4f}")
print(f"MAPE: {mape:.2f}%")Evaluation Metrics
| Metric | Formula | Characteristics |
|---|---|---|
| RMSE | √(MSE) | Sensitive to large errors |
| MAE | `Mean( | error |
| MAPE | `Mean( | error/y |
Time Series Forecasting Best Practices
Checklist
- Data Exploration: Check trend, seasonality, outliers and decompose time series
- Ensure Stationarity: Difference/log transform if needed after ADF test
- Model Selection: ACF/PACF analysis, AIC/BIC-based parameter selection
- Residual Diagnostics: Check residual autocorrelation, normality test
- Forecast Evaluation: Time-based split, Rolling window cross-validation
Common Mistakes
| Mistake | Correct Approach |
|---|---|
| Random Train/Test split | Split by time order |
| Using future information | Use only past data |
| Skipping stationarity test | ADF test required |
| Single evaluation metric | Comprehensive evaluation with RMSE, MAE, MAPE |
Interview Questions Preview
- What is stationarity and why is it important?
- How do you determine p, d, q for ARIMA?
- What are the considerations for Train/Test Split with time series data?
- What's the difference between ACF and PACF?
- When do you use additive vs multiplicative models?
Check out more interview questions at Premium Interviews (opens in a new tab).
Practice Notebook
The notebook additionally covers practice with synthetic data and real airline passenger data, various moving average comparisons, residual diagnostic visualization, and practice problems.
Previous: 10. Imbalanced Data | Next: 12. Neural Network