12. Neural Network Fundamentals
Perceptron, Backpropagation, Activation Functions
Learning Objectives
After completing this tutorial, you will understand:
- Neural network basic structure (neurons, layers)
- Role and characteristics of activation functions (Sigmoid, ReLU, Tanh, Softmax)
- Forward Propagation process
- Backpropagation and Chain Rule
- Neural network training through gradient descent
- Non-linear classification through XOR problem solving
Key Concepts
1. What is a Neural Network?
A machine learning model inspired by the brain's neuron structure. Composed of input layer, hidden layers, and output layer, where neurons in each layer are connected to learn complex patterns.
Input Layer Hidden Layer Output Layer
x1 ────┬───► h1 ────┬───► y
│ │
x2 ────┼───► h2 ────┤
│ │
└───► h3 ────┘2. Perceptron (Single Neuron)
Perceptron is the simplest form of neural network, which multiplies inputs by weights, adds bias, then applies an activation function.
z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
y = activation(z)class Perceptron:
def __init__(self, n_features, learning_rate=0.1):
self.weights = np.random.randn(n_features) * 0.1
self.bias = 0
self.lr = learning_rate
def step_function(self, z):
return (z >= 0).astype(int)
def predict(self, X):
z = np.dot(X, self.weights) + self.bias
return self.step_function(z)
def fit(self, X, y, epochs=100):
for epoch in range(epochs):
errors = 0
for xi, yi in zip(X, y):
prediction = self.predict(xi.reshape(1, -1))[0]
error = yi - prediction
# Weight update
self.weights += self.lr * error * xi
self.bias += self.lr * error
errors += abs(error)
if errors == 0:
print(f'Epoch {epoch+1}: Converged!')
break
return selfSingle perceptron can only solve linearly separable problems. AND, OR gates are learnable, but XOR problem cannot be solved.
3. Activation Functions
Activation functions transform neuron output non-linearly, enabling neural networks to learn complex patterns.
| Function | Formula | Range | Characteristics |
|---|---|---|---|
| Sigmoid | 1/(1+e^(-z)) | (0, 1) | Probability output, Vanishing Gradient |
| Tanh | (e^z - e^(-z))/(e^z + e^(-z)) | (-1, 1) | Zero-centered, Vanishing Gradient |
| ReLU | max(0, z) | [0, ∞) | Default choice, Dead ReLU problem |
| Leaky ReLU | max(0.01z, z) | (-∞, ∞) | Solves Dead ReLU |
| Softmax | e^zᵢ / Σe^zⱼ | (0, 1) | Multi-class output |
def sigmoid(x):
return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
def sigmoid_derivative(x):
s = sigmoid(x)
return s * (1 - s)
def relu(x):
return np.maximum(0, x)
def relu_derivative(x):
return (x > 0).astype(float)
def leaky_relu(x, alpha=0.01):
return np.where(x > 0, x, alpha * x)
def softmax(z):
exp_z = np.exp(z - np.max(z)) # Numerical stability
return exp_z / exp_z.sum()Activation Function Selection Guide:
- Hidden layers: ReLU (default), Leaky ReLU (prevent Dead neurons)
- Binary classification output: Sigmoid
- Multi-class output: Softmax
- Regression output: Linear (no activation)
4. Forward Propagation
The process of sequentially computing from input to output. Calculates weighted sum and applies activation function at each layer.
def forward(X, W1, b1, W2, b2):
# Hidden layer
z1 = X @ W1 + b1
a1 = relu(z1)
# Output layer
z2 = a1 @ W2 + b2
a2 = sigmoid(z2)
return a25. Backpropagation
Computes gradients backward from output to input. Uses Chain Rule to calculate the gradient of loss function with respect to each weight.
Chain Rule:
∂L/∂w = ∂L/∂y × ∂y/∂z × ∂z/∂w2-layer Neural Network Example:
x → [W1] → z1 → [σ] → a1 → [W2] → z2 → [σ] → a2 → L
def backprop(X, y, a1, a2, W2):
m = len(y)
# Output layer gradient
dz2 = a2 - y
dW2 = (1/m) * a1.T @ dz2
db2 = (1/m) * np.sum(dz2, axis=0)
# Hidden layer gradient
dz1 = (dz2 @ W2.T) * relu_derivative(a1)
dW1 = (1/m) * X.T @ dz1
db1 = (1/m) * np.sum(dz1, axis=0)
return dW1, db1, dW2, db26. Multi-Layer Perceptron (MLP) Implementation
Hidden layers are needed to solve XOR problem. Here's a complete multi-layer neural network implementation:
class NeuralNetwork:
def __init__(self, layer_sizes, learning_rate=0.1, activation='sigmoid'):
"""
layer_sizes: [input, hidden1, hidden2, ..., output]
Example: [2, 4, 1] = 2 input, 4 hidden, 1 output
"""
self.layer_sizes = layer_sizes
self.lr = learning_rate
self.n_layers = len(layer_sizes)
# Xavier initialization
self.weights = []
self.biases = []
for i in range(self.n_layers - 1):
w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i])
b = np.zeros((1, layer_sizes[i+1]))
self.weights.append(w)
self.biases.append(b)
def forward(self, X):
"""Forward propagation"""
self.activations = [X]
self.z_values = []
A = X
for i in range(self.n_layers - 1):
Z = np.dot(A, self.weights[i]) + self.biases[i]
self.z_values.append(Z)
if i == self.n_layers - 2:
A = sigmoid(Z) # Output layer
else:
A = relu(Z) # Hidden layer
self.activations.append(A)
return A
def fit(self, X, y, epochs=1000):
"""Training"""
for epoch in range(epochs):
output = self.forward(X)
loss = -np.mean(y * np.log(output + 1e-8) + (1 - y) * np.log(1 - output + 1e-8))
self.backward(X, y)
return selfXavier/He Initialization: Initializing weights with appropriate variance allows stable training. He initialization using np.sqrt(2.0 / n_input) works well with ReLU.
7. Loss Functions
| Problem | Loss Function |
|---|---|
| Binary Classification | Binary Cross-Entropy |
| Multi-class Classification | Categorical Cross-Entropy |
| Regression | MSE |
# Binary Cross-Entropy
def bce_loss(y_true, y_pred):
epsilon = 1e-15
y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))8. Optimizer
| Optimizer | Characteristics |
|---|---|
| SGD | Basic, Slow |
| Momentum | Adds inertia, Improves convergence speed |
| Adam | Adaptive + Momentum (default choice) |
| RMSprop | Adaptive learning rate |
Learning Rate Setting Caution:
- Too small: Very slow learning
- Too large: Divergence or oscillation
- Recommended: Start with 0.001 ~ 0.01 and adjust
9. Vanishing Gradient Problem
In deep neural networks, using Sigmoid or Tanh causes gradients to become smaller during backpropagation, making learning difficult.
Vanishing Gradient: Maximum gradient of Sigmoid is 0.25. Passing through 5 layers reduces gradient to 0.25^5 ≈ 0.001.
Solutions:
- Use ReLU activation function
- Batch Normalization
- Residual Connection (Skip Connection)
- Proper weight initialization
Keras/TensorFlow Code
import tensorflow as tf
from tensorflow.keras import layers, models
# Model definition
model = models.Sequential([
layers.Dense(64, activation='relu', input_shape=(n_features,)),
layers.Dense(32, activation='relu'),
layers.Dense(1, activation='sigmoid')
])
# Compile
model.compile(
optimizer='adam',
loss='binary_crossentropy',
metrics=['accuracy']
)
# Training
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.2,
verbose=1
)
# Evaluation
loss, accuracy = model.evaluate(X_test, y_test)Overfitting Prevention
Dropout
Randomly deactivates neurons during training to prevent overfitting.
model = models.Sequential([
layers.Dense(64, activation='relu'),
layers.Dropout(0.3), # Deactivate 30% of neurons
layers.Dense(1, activation='sigmoid')
])Early Stopping
Stops training early when validation loss no longer improves.
from tensorflow.keras.callbacks import EarlyStopping
early_stop = EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
)
model.fit(X_train, y_train, callbacks=[early_stop])L2 Regularization
Prevents overfitting by limiting weight magnitude.
from tensorflow.keras import regularizers
layers.Dense(64, activation='relu',
kernel_regularizer=regularizers.l2(0.01))Hyperparameter Guide
| Parameter | Recommended | Description |
|---|---|---|
| Hidden units | 32, 64, 128, 256 | Adjust based on problem complexity |
| Learning rate | 0.001, 0.01 | Adam default 0.001 recommended |
| Batch size | 32, 64, 128 | Consider memory and convergence speed |
| Dropout | 0.2 ~ 0.5 | Adjust based on overfitting degree |
| Epochs | 100 ~ 1000 | Use with Early Stopping |
Interview Questions Preview
- What is the principle of Backpropagation?
- What are the pros and cons of ReLU?
- What are Vanishing/Exploding Gradient problems and solutions?
- Why can't single perceptron solve XOR problem?
- Why use Xavier/He initialization?
Check out more interview questions at Premium Interviews (opens in a new tab).
Practice Notebook
Additional notebook content: The practice notebook covers AND/OR/XOR gate experiments, Moon dataset classification, performance comparison by learning rate and hidden layer size, comparison with Scikit-learn MLPClassifier, and decision boundary visualization.
Previous: 11. Time Series | Next: 13. CNN