13. CNN (Convolutional Neural Network)
Convolution, Pooling, Transfer Learning
Learning Objectives
After completing this tutorial, you will understand:
- CNN core concepts (Convolution, Pooling) and operation principles
- Role of filters/kernels and feature extraction process
- CNN architecture structure and parameter calculation
- Transfer Learning concepts and application strategies
- Characteristics of major CNN models (VGG, ResNet, EfficientNet)
1. Why CNN?
1.1 Limitations of Traditional Neural Networks
Problems when processing images with fully connected (Dense/FC) layers:
28x28 image = 784 pixels
Input(784) → Hidden(512) → Output(10)
Parameters: 784 x 512 + 512 x 10 = 401,408
For larger image (224x224x3)?
150,528 x 512 = 77,070,336!Additional problems with fully connected layers:
- Spatial structure information loss: Ignores pixel position relationships
- Vulnerable to position changes: Cat on left/right recognized as different input
1.2 CNN Core Ideas
| Concept | Description |
|---|---|
| Local Connectivity | Connect only small regions, not entire image → Drastically reduces parameters |
| Weight Sharing | Apply same filter across entire image → Achieves translation invariance |
| Hierarchical Feature Learning | Lower layers: edges, corners / Higher layers: complex patterns, objects |
Image → [Conv] → [Pool] → [Conv] → [Pool] → [FC] → Output
↓ ↓ ↓ ↓
Edges Reduce Patterns Reduce → Classify2. Convolution Layer
Filter (kernel) slides across image extracting features
2.1 What is a Filter (Kernel)?
A small weight matrix slides over the image extracting features:
Image (5x5) Filter (3x3) Output (3x3)
┌─────────────────┐ ┌───────────┐ ┌───────────┐
│ 1 0 1 0 1 │ │ 1 0 1 │ │ ? ? ? │
│ 0 1 0 1 0 │ * │ 0 1 0 │ = │ ? ? ? │
│ 1 0 1 0 1 │ │ 1 0 1 │ │ ? ? ? │
│ 0 1 0 1 0 │ └───────────┘ └───────────┘
│ 1 0 1 0 1 │
└─────────────────┘
Output[i,j] = Σ(image region x filter) (sum of element-wise product)2.2 Output Size Calculation
Input: (H, W, C)
Filter: (K, K, C)
Output: (H', W', F)
H' = (H - K + 2P) / S + 1| Parameter | Description |
|---|---|
| Kernel Size | Filter size (3x3, 5x5) |
| Stride | Movement interval |
| Padding | Border padding (same, valid) |
| Filters | Number of filters = Output channels |
Output size calculation examples:
| Input | Kernel | Padding | Stride | Output |
|---|---|---|---|---|
| 28 | 3 | 0 | 1 | 26 |
| 28 | 3 | 1 | 1 | 28 (same) |
| 28 | 3 | 0 | 2 | 13 |
| 224 | 7 | 3 | 2 | 112 |
from tensorflow.keras.layers import Conv2D
Conv2D(filters=32, kernel_size=(3, 3), strides=1,
padding='same', activation='relu')3. Pooling Layer
Reduces spatial size, decreases computation
3.1 Max Pooling
Selects maximum value in region to emphasize features + reduce size:
Input (4x4) Max Pool (2x2, stride=2) Output (2x2)
┌─────────────────┐ ┌─────────┐
│ 1 3 │ 2 4 │ │ 6 8 │
│ 5 [6]│ 7 [8]│ → max of each region → │ 14 16 │
├────────┼────────┤ └─────────┘
│ 9 11 │ 10 12 │
│ 13 [14]│ 15 [16]│
└─────────────────┘| Type | Description |
|---|---|
| Max Pooling | Select max value (common) |
| Average Pooling | Average value |
| Global Average Pooling | Global average (replaces FC) |
Pooling Advantages:
- No parameters (no learning needed)
- Size reduction → Computation reduction
- Invariance to small position changes (Translation Invariance)
from tensorflow.keras.layers import MaxPooling2D, GlobalAveragePooling2D
MaxPooling2D(pool_size=(2, 2)) # Half the size
GlobalAveragePooling2D() # Replace Flatten4. CNN Basic Structure
4.1 Typical CNN Structure
Input Image (28x28x1)
↓
┌──────────────────┐
│ Conv (32 filters)│ → Feature extraction
│ ReLU │ → Non-linearity
│ MaxPool (2x2) │ → Size reduction (14x14x32)
└──────────────────┘
↓
┌──────────────────┐
│ Conv (64 filters)│ → More complex features
│ ReLU │
│ MaxPool (2x2) │ → (7x7x64)
└──────────────────┘
↓
┌──────────────────┐
│ Flatten │ → 7x7x64 = 3136
│ Dense (128) │ → Prepare for classification
│ ReLU │
│ Dense (10) │ → Number of classes
│ Softmax │ → Probability output
└──────────────────┘from tensorflow.keras import models, layers
model = models.Sequential([
# Conv Block 1
layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
layers.MaxPooling2D((2, 2)),
# Conv Block 2
layers.Conv2D(64, (3, 3), activation='relu'),
layers.MaxPooling2D((2, 2)),
# Conv Block 3
layers.Conv2D(64, (3, 3), activation='relu'),
# Classifier
layers.Flatten(),
layers.Dense(64, activation='relu'),
layers.Dropout(0.5),
layers.Dense(10, activation='softmax')
])4.2 Hierarchical Feature Learning
Lower Layers (Conv1) Middle Layers (Conv2-3) Higher Layers (Conv4-5)
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ ─ │ \ │ │ Eye │ Nose │ │ Cat │
│ | │ / │ → │ Mouth│ Ear │ → │ Dog │
│Edges│Corner │ │ Patterns │ │ Full Object │
└─────────────┘ └─────────────┘ └─────────────┘5. Parameter Calculation
Conv2D(32, (3,3), input_shape=(28,28,1))
Parameters = (3 x 3 x 1 + 1) x 32 = 320
(kernel x channels + bias) x filtersCNN Parameter Calculation Example (MNIST):
| Layer | Calculation | Parameters |
|---|---|---|
| Conv1 (3x3, 1→32) | (3x3x1+1) x 32 | 320 |
| Conv2 (3x3, 32→64) | (3x3x32+1) x 64 | 18,496 |
| FC1 (3136→128) | 3136 x 128 + 128 | 401,536 |
| FC2 (128→10) | 128 x 10 + 10 | 1,290 |
| Total | 421,642 |
Comparison: Using only Dense, first layer alone needs 784 x 512 = 401,408 parameters. CNN is much more efficient through weight sharing!
6. Representative Models
| Model | Year | Parameters | Key Idea |
|---|---|---|---|
| LeNet-5 | 1998 | 60K | CNN beginning, Handwriting recognition |
| AlexNet | 2012 | 60M | ImageNet winner, GPU training, ReLU/Dropout |
| VGG16 | 2014 | 138M | 3x3 filters only, Deep network |
| GoogLeNet | 2014 | 6.8M | Inception module, 1x1 Conv |
| ResNet | 2015 | 25M | Skip Connection, Very deep learning possible |
| EfficientNet | 2019 | 5.3M | Efficient scaling, SOTA performance |
| MobileNet | - | Lightweight | For mobile |
6.1 ResNet Skip Connection
x ──────────────┐
│ │
↓ │
┌─────┐ │
│Conv │ │
│ReLU │ │
│Conv │ │
└──┬──┘ │
│ │
↓ │
(+) ←───────────┘ F(x) + x
│
↓
ReLUSkip Connection solves vanishing gradient problem, enabling training of networks with 100+ layers.
from tensorflow.keras.applications import (
VGG16, ResNet50, InceptionV3, EfficientNetB0, MobileNetV2
)7. Transfer Learning
Utilizing pre-trained models
| Strategy | Situation |
|---|---|
| Feature Extraction | Small dataset |
| Fine-tuning | Sufficient dataset |
from tensorflow.keras.applications import VGG16
# Load pre-trained model (ImageNet)
base_model = VGG16(weights='imagenet', include_top=False,
input_shape=(224, 224, 3))
# Feature Extraction: Freeze
base_model.trainable = False
# Add new classifier
model = models.Sequential([
base_model,
layers.GlobalAveragePooling2D(),
layers.Dense(256, activation='relu'),
layers.Dropout(0.5),
layers.Dense(num_classes, activation='softmax')
])8. Data Augmentation
Augment training data to prevent overfitting
from tensorflow.keras.preprocessing.image import ImageDataGenerator
datagen = ImageDataGenerator(
rotation_range=20,
width_shift_range=0.2,
height_shift_range=0.2,
horizontal_flip=True,
zoom_range=0.2
)
# Train with augmented data
model.fit(datagen.flow(X_train, y_train, batch_size=32),
epochs=50, validation_data=(X_val, y_val))9. CNN Design Guide
9.1 Common Patterns
1. Filter count: Increase gradually
32 → 64 → 128 → 256
2. Spatial size: Decrease gradually
28 → 14 → 7 (via Pooling)
3. Filter size: 3x3 recommended (standard since VGG)
4. Activation: ReLU (after Conv), Softmax (last)
5. Regularization:
- BatchNorm: After Conv
- Dropout: FC layers (0.5)9.2 Performance Improvement Tips
| Problem | Solution |
|---|---|
| Overfitting | Dropout, Data Augmentation, Early Stopping |
| Slow convergence | BatchNorm, Learning Rate Scheduler |
| Vanishing gradient | ReLU, Skip Connection (ResNet) |
| Memory shortage | Reduce Batch Size, Model lightweighting |
| Insufficient data | Transfer Learning, Data Augmentation |
Code Summary
from tensorflow.keras import models, layers
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.preprocessing.image import ImageDataGenerator
# Transfer Learning with MobileNetV2
base_model = MobileNetV2(weights='imagenet', include_top=False,
input_shape=(224, 224, 3))
base_model.trainable = False
model = models.Sequential([
base_model,
layers.GlobalAveragePooling2D(),
layers.Dense(128, activation='relu'),
layers.Dropout(0.3),
layers.Dense(num_classes, activation='softmax')
])
model.compile(optimizer='adam',
loss='categorical_crossentropy',
metrics=['accuracy'])
# Data Augmentation
datagen = ImageDataGenerator(
rescale=1./255,
rotation_range=20,
horizontal_flip=True,
validation_split=0.2
)
# Train
history = model.fit(
datagen.flow(X_train, y_train, subset='training'),
validation_data=datagen.flow(X_train, y_train, subset='validation'),
epochs=30
)Interview Questions Preview
- What are the advantages of Convolution operation?
- When do you use Transfer Learning?
- What are the roles of Stride and Padding?
Check out more interview questions at Premium Interviews (opens in a new tab).
Practice Notebook
Additional notebook content:
- Manual implementation and visualization of Convolution operation
- Effects of various filters (vertical edge, horizontal edge, blur, sharpen)
- Max Pooling visualization and size change verification
- PyTorch code structure example
Previous: 12. Neural Network | Next: 14. NLP