13. CNN (Convolutional Neural Network)

Convolution, Pooling, Transfer Learning

Learning Objectives

After completing this tutorial, you will understand:

CNN core concepts (Convolution, Pooling) and operation principles
Role of filters/kernels and feature extraction process
CNN architecture structure and parameter calculation
Transfer Learning concepts and application strategies
Characteristics of major CNN models (VGG, ResNet, EfficientNet)

1. Why CNN?

1.1 Limitations of Traditional Neural Networks

Problems when processing images with fully connected (Dense/FC) layers:

28x28 image = 784 pixels

Input(784) → Hidden(512) → Output(10)
Parameters: 784 x 512 + 512 x 10 = 401,408

For larger image (224x224x3)?
150,528 x 512 = 77,070,336!

⚠️

Additional problems with fully connected layers:

Spatial structure information loss: Ignores pixel position relationships
Vulnerable to position changes: Cat on left/right recognized as different input

1.2 CNN Core Ideas

Concept	Description
Local Connectivity	Connect only small regions, not entire image → Drastically reduces parameters
Weight Sharing	Apply same filter across entire image → Achieves translation invariance
Hierarchical Feature Learning	Lower layers: edges, corners / Higher layers: complex patterns, objects

Image → [Conv] → [Pool] → [Conv] → [Pool] → [FC] → Output
          ↓          ↓          ↓         ↓
        Edges     Reduce    Patterns   Reduce    → Classify

2. Convolution Layer

Filter (kernel) slides across image extracting features

2.1 What is a Filter (Kernel)?

A small weight matrix slides over the image extracting features:

Image (5x5)             Filter (3x3)         Output (3x3)
┌─────────────────┐    ┌───────────┐    ┌───────────┐
│ 1  0  1  0  1 │    │ 1  0  1 │    │ ?  ?  ? │
│ 0  1  0  1  0 │ *  │ 0  1  0 │ =  │ ?  ?  ? │
│ 1  0  1  0  1 │    │ 1  0  1 │    │ ?  ?  ? │
│ 0  1  0  1  0 │    └───────────┘    └───────────┘
│ 1  0  1  0  1 │
└─────────────────┘

Output[i,j] = Σ(image region x filter)  (sum of element-wise product)

2.2 Output Size Calculation

Input: (H, W, C)
Filter: (K, K, C)
Output: (H', W', F)

H' = (H - K + 2P) / S + 1

Parameter	Description
Kernel Size	Filter size (3x3, 5x5)
Stride	Movement interval
Padding	Border padding (same, valid)
Filters	Number of filters = Output channels

Output size calculation examples:

Input	Kernel	Padding	Stride	Output
28	3	0	1	26
28	3	1	1	28 (same)
28	3	0	2	13
224	7	3	2	112

from tensorflow.keras.layers import Conv2D
 
Conv2D(filters=32, kernel_size=(3, 3), strides=1,
       padding='same', activation='relu')

3. Pooling Layer

Reduces spatial size, decreases computation

3.1 Max Pooling

Selects maximum value in region to emphasize features + reduce size:

Input (4x4)              Max Pool (2x2, stride=2)    Output (2x2)
┌─────────────────┐                               ┌─────────┐
│  1   3 │  2   4 │                               │  6   8  │
│  5  [6]│  7  [8]│     →   max of each region →  │ 14  16  │
├────────┼────────┤                               └─────────┘
│  9  11 │ 10  12 │
│ 13 [14]│ 15 [16]│
└─────────────────┘

Type	Description
Max Pooling	Select max value (common)
Average Pooling	Average value
Global Average Pooling	Global average (replaces FC)

Pooling Advantages:

No parameters (no learning needed)
Size reduction → Computation reduction
Invariance to small position changes (Translation Invariance)

from tensorflow.keras.layers import MaxPooling2D, GlobalAveragePooling2D
 
MaxPooling2D(pool_size=(2, 2))  # Half the size
GlobalAveragePooling2D()  # Replace Flatten

4. CNN Basic Structure

4.1 Typical CNN Structure

Input Image (28x28x1)
       ↓
┌──────────────────┐
│ Conv (32 filters)│ → Feature extraction
│ ReLU             │ → Non-linearity
│ MaxPool (2x2)    │ → Size reduction (14x14x32)
└──────────────────┘
       ↓
┌──────────────────┐
│ Conv (64 filters)│ → More complex features
│ ReLU             │
│ MaxPool (2x2)    │ → (7x7x64)
└──────────────────┘
       ↓
┌──────────────────┐
│ Flatten          │ → 7x7x64 = 3136
│ Dense (128)      │ → Prepare for classification
│ ReLU             │
│ Dense (10)       │ → Number of classes
│ Softmax          │ → Probability output
└──────────────────┘

from tensorflow.keras import models, layers
 
model = models.Sequential([
    # Conv Block 1
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
 
    # Conv Block 2
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
 
    # Conv Block 3
    layers.Conv2D(64, (3, 3), activation='relu'),
 
    # Classifier
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')
])

4.2 Hierarchical Feature Learning

Lower Layers (Conv1)    Middle Layers (Conv2-3)    Higher Layers (Conv4-5)
┌─────────────┐          ┌─────────────┐          ┌─────────────┐
│  ─  │  \   │          │ Eye │ Nose  │          │    Cat      │
│  |  │  /   │    →     │ Mouth│ Ear  │    →     │    Dog      │
│Edges│Corner │          │ Patterns    │          │ Full Object │
└─────────────┘          └─────────────┘          └─────────────┘

5. Parameter Calculation

Conv2D(32, (3,3), input_shape=(28,28,1))
Parameters = (3 x 3 x 1 + 1) x 32 = 320
             (kernel x channels + bias) x filters

CNN Parameter Calculation Example (MNIST):

Layer	Calculation	Parameters
Conv1 (3x3, 1→32)	(3x3x1+1) x 32	320
Conv2 (3x3, 32→64)	(3x3x32+1) x 64	18,496
FC1 (3136→128)	3136 x 128 + 128	401,536
FC2 (128→10)	128 x 10 + 10	1,290
Total		421,642

💡

Comparison: Using only Dense, first layer alone needs 784 x 512 = 401,408 parameters. CNN is much more efficient through weight sharing!

6. Representative Models

Model	Year	Parameters	Key Idea
LeNet-5	1998	60K	CNN beginning, Handwriting recognition
AlexNet	2012	60M	ImageNet winner, GPU training, ReLU/Dropout
VGG16	2014	138M	3x3 filters only, Deep network
GoogLeNet	2014	6.8M	Inception module, 1x1 Conv
ResNet	2015	25M	Skip Connection, Very deep learning possible
EfficientNet	2019	5.3M	Efficient scaling, SOTA performance
MobileNet	-	Lightweight	For mobile

6.1 ResNet Skip Connection

  x ──────────────┐
  │               │
  ↓               │
┌─────┐           │
│Conv │           │
│ReLU │           │
│Conv │           │
└──┬──┘           │
   │              │
   ↓              │
  (+) ←───────────┘   F(x) + x
   │
   ↓
 ReLU

Skip Connection solves vanishing gradient problem, enabling training of networks with 100+ layers.

from tensorflow.keras.applications import (
    VGG16, ResNet50, InceptionV3, EfficientNetB0, MobileNetV2
)

7. Transfer Learning

Utilizing pre-trained models

Strategy	Situation
Feature Extraction	Small dataset
Fine-tuning	Sufficient dataset

from tensorflow.keras.applications import VGG16
 
# Load pre-trained model (ImageNet)
base_model = VGG16(weights='imagenet', include_top=False,
                   input_shape=(224, 224, 3))
 
# Feature Extraction: Freeze
base_model.trainable = False
 
# Add new classifier
model = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_classes, activation='softmax')
])

8. Data Augmentation

Augment training data to prevent overfitting

from tensorflow.keras.preprocessing.image import ImageDataGenerator
 
datagen = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True,
    zoom_range=0.2
)
 
# Train with augmented data
model.fit(datagen.flow(X_train, y_train, batch_size=32),
          epochs=50, validation_data=(X_val, y_val))

9. CNN Design Guide

9.1 Common Patterns

1. Filter count: Increase gradually
   32 → 64 → 128 → 256

2. Spatial size: Decrease gradually
   28 → 14 → 7 (via Pooling)

3. Filter size: 3x3 recommended (standard since VGG)

4. Activation: ReLU (after Conv), Softmax (last)

5. Regularization:
   - BatchNorm: After Conv
   - Dropout: FC layers (0.5)

9.2 Performance Improvement Tips

Problem	Solution
Overfitting	Dropout, Data Augmentation, Early Stopping
Slow convergence	BatchNorm, Learning Rate Scheduler
Vanishing gradient	ReLU, Skip Connection (ResNet)
Memory shortage	Reduce Batch Size, Model lightweighting
Insufficient data	Transfer Learning, Data Augmentation

Code Summary

from tensorflow.keras import models, layers
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.preprocessing.image import ImageDataGenerator
 
# Transfer Learning with MobileNetV2
base_model = MobileNetV2(weights='imagenet', include_top=False,
                         input_shape=(224, 224, 3))
base_model.trainable = False
 
model = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(num_classes, activation='softmax')
])
 
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
 
# Data Augmentation
datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    horizontal_flip=True,
    validation_split=0.2
)
 
# Train
history = model.fit(
    datagen.flow(X_train, y_train, subset='training'),
    validation_data=datagen.flow(X_train, y_train, subset='validation'),
    epochs=30
)

Interview Questions Preview

What are the advantages of Convolution operation?
When do you use Transfer Learning?
What are the roles of Stride and Padding?

Check out more interview questions at Premium Interviews (opens in a new tab).

Practice Notebook

💡

Additional notebook content:

Manual implementation and visualization of Convolution operation
Effects of various filters (vertical edge, horizontal edge, blur, sharpen)
Max Pooling visualization and size change verification
PyTorch code structure example

Previous: 12. Neural Network | Next: 14. NLP

12. Neural Network 14. NLP