12. Neural Network Fundamentals

Perceptron, Backpropagation, Activation Functions

Learning Objectives

After completing this tutorial, you will understand:

Neural network basic structure (neurons, layers)
Role and characteristics of activation functions (Sigmoid, ReLU, Tanh, Softmax)
Forward Propagation process
Backpropagation and Chain Rule
Neural network training through gradient descent
Non-linear classification through XOR problem solving

Key Concepts

1. What is a Neural Network?

A machine learning model inspired by the brain's neuron structure. Composed of input layer, hidden layers, and output layer, where neurons in each layer are connected to learn complex patterns.

Input Layer    Hidden Layer    Output Layer

  x1 ────┬───► h1 ────┬───► y
         │           │
  x2 ────┼───► h2 ────┤
         │           │
         └───► h3 ────┘

2. Perceptron (Single Neuron)

Perceptron is the simplest form of neural network, which multiplies inputs by weights, adds bias, then applies an activation function.

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
y = activation(z)

class Perceptron:
    def __init__(self, n_features, learning_rate=0.1):
        self.weights = np.random.randn(n_features) * 0.1
        self.bias = 0
        self.lr = learning_rate
 
    def step_function(self, z):
        return (z >= 0).astype(int)
 
    def predict(self, X):
        z = np.dot(X, self.weights) + self.bias
        return self.step_function(z)
 
    def fit(self, X, y, epochs=100):
        for epoch in range(epochs):
            errors = 0
            for xi, yi in zip(X, y):
                prediction = self.predict(xi.reshape(1, -1))[0]
                error = yi - prediction
 
                # Weight update
                self.weights += self.lr * error * xi
                self.bias += self.lr * error
                errors += abs(error)
 
            if errors == 0:
                print(f'Epoch {epoch+1}: Converged!')
                break
        return self

⚠️

Single perceptron can only solve linearly separable problems. AND, OR gates are learnable, but XOR problem cannot be solved.

3. Activation Functions

Activation functions transform neuron output non-linearly, enabling neural networks to learn complex patterns.

Function	Formula	Range	Characteristics
Sigmoid	1/(1+e^(-z))	(0, 1)	Probability output, Vanishing Gradient
Tanh	(e^z - e^(-z))/(e^z + e^(-z))	(-1, 1)	Zero-centered, Vanishing Gradient
ReLU	max(0, z)	[0, ∞)	Default choice, Dead ReLU problem
Leaky ReLU	max(0.01z, z)	(-∞, ∞)	Solves Dead ReLU
Softmax	e^zᵢ / Σe^zⱼ	(0, 1)	Multi-class output

def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
 
def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)
 
def relu(x):
    return np.maximum(0, x)
 
def relu_derivative(x):
    return (x > 0).astype(float)
 
def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)
 
def softmax(z):
    exp_z = np.exp(z - np.max(z))  # Numerical stability
    return exp_z / exp_z.sum()

Activation Function Selection Guide:

Hidden layers: ReLU (default), Leaky ReLU (prevent Dead neurons)
Binary classification output: Sigmoid
Multi-class output: Softmax
Regression output: Linear (no activation)

4. Forward Propagation

The process of sequentially computing from input to output. Calculates weighted sum and applies activation function at each layer.

def forward(X, W1, b1, W2, b2):
    # Hidden layer
    z1 = X @ W1 + b1
    a1 = relu(z1)
 
    # Output layer
    z2 = a1 @ W2 + b2
    a2 = sigmoid(z2)
 
    return a2

5. Backpropagation

Computes gradients backward from output to input. Uses Chain Rule to calculate the gradient of loss function with respect to each weight.

Chain Rule:

∂L/∂w = ∂L/∂y × ∂y/∂z × ∂z/∂w

2-layer Neural Network Example:

x → [W1] → z1 → [σ] → a1 → [W2] → z2 → [σ] → a2 → L

$\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial W_2}$

$\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1}$

def backprop(X, y, a1, a2, W2):
    m = len(y)
 
    # Output layer gradient
    dz2 = a2 - y
    dW2 = (1/m) * a1.T @ dz2
    db2 = (1/m) * np.sum(dz2, axis=0)
 
    # Hidden layer gradient
    dz1 = (dz2 @ W2.T) * relu_derivative(a1)
    dW1 = (1/m) * X.T @ dz1
    db1 = (1/m) * np.sum(dz1, axis=0)
 
    return dW1, db1, dW2, db2

6. Multi-Layer Perceptron (MLP) Implementation

Hidden layers are needed to solve XOR problem. Here's a complete multi-layer neural network implementation:

class NeuralNetwork:
    def __init__(self, layer_sizes, learning_rate=0.1, activation='sigmoid'):
        """
        layer_sizes: [input, hidden1, hidden2, ..., output]
        Example: [2, 4, 1] = 2 input, 4 hidden, 1 output
        """
        self.layer_sizes = layer_sizes
        self.lr = learning_rate
        self.n_layers = len(layer_sizes)
 
        # Xavier initialization
        self.weights = []
        self.biases = []
 
        for i in range(self.n_layers - 1):
            w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros((1, layer_sizes[i+1]))
            self.weights.append(w)
            self.biases.append(b)
 
    def forward(self, X):
        """Forward propagation"""
        self.activations = [X]
        self.z_values = []
 
        A = X
        for i in range(self.n_layers - 1):
            Z = np.dot(A, self.weights[i]) + self.biases[i]
            self.z_values.append(Z)
 
            if i == self.n_layers - 2:
                A = sigmoid(Z)  # Output layer
            else:
                A = relu(Z)  # Hidden layer
 
            self.activations.append(A)
 
        return A
 
    def fit(self, X, y, epochs=1000):
        """Training"""
        for epoch in range(epochs):
            output = self.forward(X)
            loss = -np.mean(y * np.log(output + 1e-8) + (1 - y) * np.log(1 - output + 1e-8))
            self.backward(X, y)
        return self

Xavier/He Initialization: Initializing weights with appropriate variance allows stable training. He initialization using np.sqrt(2.0 / n_input) works well with ReLU.

7. Loss Functions

Problem	Loss Function
Binary Classification	Binary Cross-Entropy
Multi-class Classification	Categorical Cross-Entropy
Regression	MSE

# Binary Cross-Entropy
def bce_loss(y_true, y_pred):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

8. Optimizer

Optimizer	Characteristics
SGD	Basic, Slow
Momentum	Adds inertia, Improves convergence speed
Adam	Adaptive + Momentum (default choice)
RMSprop	Adaptive learning rate

⚠️

Learning Rate Setting Caution:

Too small: Very slow learning
Too large: Divergence or oscillation
Recommended: Start with 0.001 ~ 0.01 and adjust

9. Vanishing Gradient Problem

In deep neural networks, using Sigmoid or Tanh causes gradients to become smaller during backpropagation, making learning difficult.

🚫

Vanishing Gradient: Maximum gradient of Sigmoid is 0.25. Passing through 5 layers reduces gradient to 0.25^5 ≈ 0.001.

Solutions:

Use ReLU activation function
Batch Normalization
Residual Connection (Skip Connection)
Proper weight initialization

Keras/TensorFlow Code

import tensorflow as tf
from tensorflow.keras import layers, models
 
# Model definition
model = models.Sequential([
    layers.Dense(64, activation='relu', input_shape=(n_features,)),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])
 
# Compile
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)
 
# Training
history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.2,
    verbose=1
)
 
# Evaluation
loss, accuracy = model.evaluate(X_test, y_test)

Overfitting Prevention

Dropout

Randomly deactivates neurons during training to prevent overfitting.

model = models.Sequential([
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.3),  # Deactivate 30% of neurons
    layers.Dense(1, activation='sigmoid')
])

Early Stopping

Stops training early when validation loss no longer improves.

from tensorflow.keras.callbacks import EarlyStopping
 
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)
 
model.fit(X_train, y_train, callbacks=[early_stop])

L2 Regularization

Prevents overfitting by limiting weight magnitude.

from tensorflow.keras import regularizers
 
layers.Dense(64, activation='relu',
             kernel_regularizer=regularizers.l2(0.01))

Hyperparameter Guide

Parameter	Recommended	Description
Hidden units	32, 64, 128, 256	Adjust based on problem complexity
Learning rate	0.001, 0.01	Adam default 0.001 recommended
Batch size	32, 64, 128	Consider memory and convergence speed
Dropout	0.2 ~ 0.5	Adjust based on overfitting degree
Epochs	100 ~ 1000	Use with Early Stopping

Interview Questions Preview

What is the principle of Backpropagation?
What are the pros and cons of ReLU?
What are Vanishing/Exploding Gradient problems and solutions?
Why can't single perceptron solve XOR problem?
Why use Xavier/He initialization?

Check out more interview questions at Premium Interviews (opens in a new tab).

Practice Notebook

Additional notebook content: The practice notebook covers AND/OR/XOR gate experiments, Moon dataset classification, performance comparison by learning rate and hidden layer size, comparison with Scikit-learn MLPClassifier, and decision boundary visualization.

Previous: 11. Time Series | Next: 13. CNN

11. Time Series 13. CNN