en
Tutorials
12. Neural Network

12. Neural Network Fundamentals

Perceptron, Backpropagation, Activation Functions


Learning Objectives

After completing this tutorial, you will understand:

  • Neural network basic structure (neurons, layers)
  • Role and characteristics of activation functions (Sigmoid, ReLU, Tanh, Softmax)
  • Forward Propagation process
  • Backpropagation and Chain Rule
  • Neural network training through gradient descent
  • Non-linear classification through XOR problem solving

Key Concepts

1. What is a Neural Network?

A machine learning model inspired by the brain's neuron structure. Composed of input layer, hidden layers, and output layer, where neurons in each layer are connected to learn complex patterns.

Input Layer    Hidden Layer    Output Layer

  x1 ────┬───► h1 ────┬───► y
         │           │
  x2 ────┼───► h2 ────┤
         │           │
         └───► h3 ────┘

2. Perceptron (Single Neuron)

Perceptron is the simplest form of neural network, which multiplies inputs by weights, adds bias, then applies an activation function.

z = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
y = activation(z)
class Perceptron:
    def __init__(self, n_features, learning_rate=0.1):
        self.weights = np.random.randn(n_features) * 0.1
        self.bias = 0
        self.lr = learning_rate
 
    def step_function(self, z):
        return (z >= 0).astype(int)
 
    def predict(self, X):
        z = np.dot(X, self.weights) + self.bias
        return self.step_function(z)
 
    def fit(self, X, y, epochs=100):
        for epoch in range(epochs):
            errors = 0
            for xi, yi in zip(X, y):
                prediction = self.predict(xi.reshape(1, -1))[0]
                error = yi - prediction
 
                # Weight update
                self.weights += self.lr * error * xi
                self.bias += self.lr * error
                errors += abs(error)
 
            if errors == 0:
                print(f'Epoch {epoch+1}: Converged!')
                break
        return self
⚠️

Single perceptron can only solve linearly separable problems. AND, OR gates are learnable, but XOR problem cannot be solved.


3. Activation Functions

Activation functions transform neuron output non-linearly, enabling neural networks to learn complex patterns.

FunctionFormulaRangeCharacteristics
Sigmoid1/(1+e^(-z))(0, 1)Probability output, Vanishing Gradient
Tanh(e^z - e^(-z))/(e^z + e^(-z))(-1, 1)Zero-centered, Vanishing Gradient
ReLUmax(0, z)[0, ∞)Default choice, Dead ReLU problem
Leaky ReLUmax(0.01z, z)(-∞, ∞)Solves Dead ReLU
Softmaxe^zᵢ / Σe^zⱼ(0, 1)Multi-class output
def sigmoid(x):
    return 1 / (1 + np.exp(-np.clip(x, -500, 500)))
 
def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)
 
def relu(x):
    return np.maximum(0, x)
 
def relu_derivative(x):
    return (x > 0).astype(float)
 
def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)
 
def softmax(z):
    exp_z = np.exp(z - np.max(z))  # Numerical stability
    return exp_z / exp_z.sum()

Activation Function Selection Guide:

  • Hidden layers: ReLU (default), Leaky ReLU (prevent Dead neurons)
  • Binary classification output: Sigmoid
  • Multi-class output: Softmax
  • Regression output: Linear (no activation)

4. Forward Propagation

The process of sequentially computing from input to output. Calculates weighted sum and applies activation function at each layer.

def forward(X, W1, b1, W2, b2):
    # Hidden layer
    z1 = X @ W1 + b1
    a1 = relu(z1)
 
    # Output layer
    z2 = a1 @ W2 + b2
    a2 = sigmoid(z2)
 
    return a2

5. Backpropagation

Computes gradients backward from output to input. Uses Chain Rule to calculate the gradient of loss function with respect to each weight.

Chain Rule:

∂L/∂w = ∂L/∂y × ∂y/∂z × ∂z/∂w

2-layer Neural Network Example:

x → [W1] → z1 → [σ] → a1 → [W2] → z2 → [σ] → a2 → L

LW2=La2a2z2z2W2\frac{\partial L}{\partial W_2} = \frac{\partial L}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial W_2}

LW1=La2a2z2z2a1a1z1z1W1\frac{\partial L}{\partial W_1} = \frac{\partial L}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_2} \cdot \frac{\partial z_2}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial W_1}

def backprop(X, y, a1, a2, W2):
    m = len(y)
 
    # Output layer gradient
    dz2 = a2 - y
    dW2 = (1/m) * a1.T @ dz2
    db2 = (1/m) * np.sum(dz2, axis=0)
 
    # Hidden layer gradient
    dz1 = (dz2 @ W2.T) * relu_derivative(a1)
    dW1 = (1/m) * X.T @ dz1
    db1 = (1/m) * np.sum(dz1, axis=0)
 
    return dW1, db1, dW2, db2

6. Multi-Layer Perceptron (MLP) Implementation

Hidden layers are needed to solve XOR problem. Here's a complete multi-layer neural network implementation:

class NeuralNetwork:
    def __init__(self, layer_sizes, learning_rate=0.1, activation='sigmoid'):
        """
        layer_sizes: [input, hidden1, hidden2, ..., output]
        Example: [2, 4, 1] = 2 input, 4 hidden, 1 output
        """
        self.layer_sizes = layer_sizes
        self.lr = learning_rate
        self.n_layers = len(layer_sizes)
 
        # Xavier initialization
        self.weights = []
        self.biases = []
 
        for i in range(self.n_layers - 1):
            w = np.random.randn(layer_sizes[i], layer_sizes[i+1]) * np.sqrt(2.0 / layer_sizes[i])
            b = np.zeros((1, layer_sizes[i+1]))
            self.weights.append(w)
            self.biases.append(b)
 
    def forward(self, X):
        """Forward propagation"""
        self.activations = [X]
        self.z_values = []
 
        A = X
        for i in range(self.n_layers - 1):
            Z = np.dot(A, self.weights[i]) + self.biases[i]
            self.z_values.append(Z)
 
            if i == self.n_layers - 2:
                A = sigmoid(Z)  # Output layer
            else:
                A = relu(Z)  # Hidden layer
 
            self.activations.append(A)
 
        return A
 
    def fit(self, X, y, epochs=1000):
        """Training"""
        for epoch in range(epochs):
            output = self.forward(X)
            loss = -np.mean(y * np.log(output + 1e-8) + (1 - y) * np.log(1 - output + 1e-8))
            self.backward(X, y)
        return self

Xavier/He Initialization: Initializing weights with appropriate variance allows stable training. He initialization using np.sqrt(2.0 / n_input) works well with ReLU.


7. Loss Functions

ProblemLoss Function
Binary ClassificationBinary Cross-Entropy
Multi-class ClassificationCategorical Cross-Entropy
RegressionMSE
# Binary Cross-Entropy
def bce_loss(y_true, y_pred):
    epsilon = 1e-15
    y_pred = np.clip(y_pred, epsilon, 1 - epsilon)
    return -np.mean(y_true * np.log(y_pred) + (1 - y_true) * np.log(1 - y_pred))

8. Optimizer

OptimizerCharacteristics
SGDBasic, Slow
MomentumAdds inertia, Improves convergence speed
AdamAdaptive + Momentum (default choice)
RMSpropAdaptive learning rate
⚠️

Learning Rate Setting Caution:

  • Too small: Very slow learning
  • Too large: Divergence or oscillation
  • Recommended: Start with 0.001 ~ 0.01 and adjust

9. Vanishing Gradient Problem

In deep neural networks, using Sigmoid or Tanh causes gradients to become smaller during backpropagation, making learning difficult.

🚫

Vanishing Gradient: Maximum gradient of Sigmoid is 0.25. Passing through 5 layers reduces gradient to 0.25^5 ≈ 0.001.

Solutions:

  • Use ReLU activation function
  • Batch Normalization
  • Residual Connection (Skip Connection)
  • Proper weight initialization

Keras/TensorFlow Code

import tensorflow as tf
from tensorflow.keras import layers, models
 
# Model definition
model = models.Sequential([
    layers.Dense(64, activation='relu', input_shape=(n_features,)),
    layers.Dense(32, activation='relu'),
    layers.Dense(1, activation='sigmoid')
])
 
# Compile
model.compile(
    optimizer='adam',
    loss='binary_crossentropy',
    metrics=['accuracy']
)
 
# Training
history = model.fit(
    X_train, y_train,
    epochs=100,
    batch_size=32,
    validation_split=0.2,
    verbose=1
)
 
# Evaluation
loss, accuracy = model.evaluate(X_test, y_test)

Overfitting Prevention

Dropout

Randomly deactivates neurons during training to prevent overfitting.

model = models.Sequential([
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.3),  # Deactivate 30% of neurons
    layers.Dense(1, activation='sigmoid')
])

Early Stopping

Stops training early when validation loss no longer improves.

from tensorflow.keras.callbacks import EarlyStopping
 
early_stop = EarlyStopping(
    monitor='val_loss',
    patience=10,
    restore_best_weights=True
)
 
model.fit(X_train, y_train, callbacks=[early_stop])

L2 Regularization

Prevents overfitting by limiting weight magnitude.

from tensorflow.keras import regularizers
 
layers.Dense(64, activation='relu',
             kernel_regularizer=regularizers.l2(0.01))

Hyperparameter Guide

ParameterRecommendedDescription
Hidden units32, 64, 128, 256Adjust based on problem complexity
Learning rate0.001, 0.01Adam default 0.001 recommended
Batch size32, 64, 128Consider memory and convergence speed
Dropout0.2 ~ 0.5Adjust based on overfitting degree
Epochs100 ~ 1000Use with Early Stopping

Interview Questions Preview

  1. What is the principle of Backpropagation?
  2. What are the pros and cons of ReLU?
  3. What are Vanishing/Exploding Gradient problems and solutions?
  4. Why can't single perceptron solve XOR problem?
  5. Why use Xavier/He initialization?

Check out more interview questions at Premium Interviews (opens in a new tab).


Practice Notebook

Additional notebook content: The practice notebook covers AND/OR/XOR gate experiments, Moon dataset classification, performance comparison by learning rate and hidden layer size, comparison with Scikit-learn MLPClassifier, and decision boundary visualization.


Previous: 11. Time Series | Next: 13. CNN