en
Tutorials
13. CNN

13. CNN (Convolutional Neural Network)

Convolution, Pooling, Transfer Learning


Learning Objectives

After completing this tutorial, you will understand:

  • CNN core concepts (Convolution, Pooling) and operation principles
  • Role of filters/kernels and feature extraction process
  • CNN architecture structure and parameter calculation
  • Transfer Learning concepts and application strategies
  • Characteristics of major CNN models (VGG, ResNet, EfficientNet)

1. Why CNN?

1.1 Limitations of Traditional Neural Networks

Problems when processing images with fully connected (Dense/FC) layers:

28x28 image = 784 pixels

Input(784) → Hidden(512) → Output(10)
Parameters: 784 x 512 + 512 x 10 = 401,408

For larger image (224x224x3)?
150,528 x 512 = 77,070,336!
⚠️

Additional problems with fully connected layers:

  • Spatial structure information loss: Ignores pixel position relationships
  • Vulnerable to position changes: Cat on left/right recognized as different input

1.2 CNN Core Ideas

ConceptDescription
Local ConnectivityConnect only small regions, not entire image → Drastically reduces parameters
Weight SharingApply same filter across entire image → Achieves translation invariance
Hierarchical Feature LearningLower layers: edges, corners / Higher layers: complex patterns, objects
Image → [Conv] → [Pool] → [Conv] → [Pool] → [FC] → Output
          ↓          ↓          ↓         ↓
        Edges     Reduce    Patterns   Reduce    → Classify

2. Convolution Layer

Filter (kernel) slides across image extracting features

2.1 What is a Filter (Kernel)?

A small weight matrix slides over the image extracting features:

Image (5x5)             Filter (3x3)         Output (3x3)
┌─────────────────┐    ┌───────────┐    ┌───────────┐
│ 1  0  1  0  1 │    │ 1  0  1 │    │ ?  ?  ? │
│ 0  1  0  1  0 │ *  │ 0  1  0 │ =  │ ?  ?  ? │
│ 1  0  1  0  1 │    │ 1  0  1 │    │ ?  ?  ? │
│ 0  1  0  1  0 │    └───────────┘    └───────────┘
│ 1  0  1  0  1 │
└─────────────────┘

Output[i,j] = Σ(image region x filter)  (sum of element-wise product)

2.2 Output Size Calculation

Input: (H, W, C)
Filter: (K, K, C)
Output: (H', W', F)

H' = (H - K + 2P) / S + 1
ParameterDescription
Kernel SizeFilter size (3x3, 5x5)
StrideMovement interval
PaddingBorder padding (same, valid)
FiltersNumber of filters = Output channels

Output size calculation examples:

InputKernelPaddingStrideOutput
2830126
2831128 (same)
2830213
224732112
from tensorflow.keras.layers import Conv2D
 
Conv2D(filters=32, kernel_size=(3, 3), strides=1,
       padding='same', activation='relu')

3. Pooling Layer

Reduces spatial size, decreases computation

3.1 Max Pooling

Selects maximum value in region to emphasize features + reduce size:

Input (4x4)              Max Pool (2x2, stride=2)    Output (2x2)
┌─────────────────┐                               ┌─────────┐
│  1   3 │  2   4 │                               │  6   8  │
│  5  [6]│  7  [8]│     →   max of each region →  │ 14  16  │
├────────┼────────┤                               └─────────┘
│  9  11 │ 10  12 │
│ 13 [14]│ 15 [16]│
└─────────────────┘
TypeDescription
Max PoolingSelect max value (common)
Average PoolingAverage value
Global Average PoolingGlobal average (replaces FC)

Pooling Advantages:

  • No parameters (no learning needed)
  • Size reduction → Computation reduction
  • Invariance to small position changes (Translation Invariance)
from tensorflow.keras.layers import MaxPooling2D, GlobalAveragePooling2D
 
MaxPooling2D(pool_size=(2, 2))  # Half the size
GlobalAveragePooling2D()  # Replace Flatten

4. CNN Basic Structure

4.1 Typical CNN Structure

Input Image (28x28x1)

┌──────────────────┐
│ Conv (32 filters)│ → Feature extraction
│ ReLU             │ → Non-linearity
│ MaxPool (2x2)    │ → Size reduction (14x14x32)
└──────────────────┘

┌──────────────────┐
│ Conv (64 filters)│ → More complex features
│ ReLU             │
│ MaxPool (2x2)    │ → (7x7x64)
└──────────────────┘

┌──────────────────┐
│ Flatten          │ → 7x7x64 = 3136
│ Dense (128)      │ → Prepare for classification
│ ReLU             │
│ Dense (10)       │ → Number of classes
│ Softmax          │ → Probability output
└──────────────────┘
from tensorflow.keras import models, layers
 
model = models.Sequential([
    # Conv Block 1
    layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)),
    layers.MaxPooling2D((2, 2)),
 
    # Conv Block 2
    layers.Conv2D(64, (3, 3), activation='relu'),
    layers.MaxPooling2D((2, 2)),
 
    # Conv Block 3
    layers.Conv2D(64, (3, 3), activation='relu'),
 
    # Classifier
    layers.Flatten(),
    layers.Dense(64, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(10, activation='softmax')
])

4.2 Hierarchical Feature Learning

Lower Layers (Conv1)    Middle Layers (Conv2-3)    Higher Layers (Conv4-5)
┌─────────────┐          ┌─────────────┐          ┌─────────────┐
│  ─  │  \   │          │ Eye │ Nose  │          │    Cat      │
│  |  │  /   │    →     │ Mouth│ Ear  │    →     │    Dog      │
│Edges│Corner │          │ Patterns    │          │ Full Object │
└─────────────┘          └─────────────┘          └─────────────┘

5. Parameter Calculation

Conv2D(32, (3,3), input_shape=(28,28,1))
Parameters = (3 x 3 x 1 + 1) x 32 = 320
             (kernel x channels + bias) x filters

CNN Parameter Calculation Example (MNIST):

LayerCalculationParameters
Conv1 (3x3, 1→32)(3x3x1+1) x 32320
Conv2 (3x3, 32→64)(3x3x32+1) x 6418,496
FC1 (3136→128)3136 x 128 + 128401,536
FC2 (128→10)128 x 10 + 101,290
Total421,642
💡

Comparison: Using only Dense, first layer alone needs 784 x 512 = 401,408 parameters. CNN is much more efficient through weight sharing!


6. Representative Models

ModelYearParametersKey Idea
LeNet-5199860KCNN beginning, Handwriting recognition
AlexNet201260MImageNet winner, GPU training, ReLU/Dropout
VGG162014138M3x3 filters only, Deep network
GoogLeNet20146.8MInception module, 1x1 Conv
ResNet201525MSkip Connection, Very deep learning possible
EfficientNet20195.3MEfficient scaling, SOTA performance
MobileNet-LightweightFor mobile

6.1 ResNet Skip Connection

  x ──────────────┐
  │               │
  ↓               │
┌─────┐           │
│Conv │           │
│ReLU │           │
│Conv │           │
└──┬──┘           │
   │              │
   ↓              │
  (+) ←───────────┘   F(x) + x


 ReLU

Skip Connection solves vanishing gradient problem, enabling training of networks with 100+ layers.

from tensorflow.keras.applications import (
    VGG16, ResNet50, InceptionV3, EfficientNetB0, MobileNetV2
)

7. Transfer Learning

Utilizing pre-trained models

StrategySituation
Feature ExtractionSmall dataset
Fine-tuningSufficient dataset
from tensorflow.keras.applications import VGG16
 
# Load pre-trained model (ImageNet)
base_model = VGG16(weights='imagenet', include_top=False,
                   input_shape=(224, 224, 3))
 
# Feature Extraction: Freeze
base_model.trainable = False
 
# Add new classifier
model = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(256, activation='relu'),
    layers.Dropout(0.5),
    layers.Dense(num_classes, activation='softmax')
])

8. Data Augmentation

Augment training data to prevent overfitting

from tensorflow.keras.preprocessing.image import ImageDataGenerator
 
datagen = ImageDataGenerator(
    rotation_range=20,
    width_shift_range=0.2,
    height_shift_range=0.2,
    horizontal_flip=True,
    zoom_range=0.2
)
 
# Train with augmented data
model.fit(datagen.flow(X_train, y_train, batch_size=32),
          epochs=50, validation_data=(X_val, y_val))

9. CNN Design Guide

9.1 Common Patterns

1. Filter count: Increase gradually
   32 → 64 → 128 → 256

2. Spatial size: Decrease gradually
   28 → 14 → 7 (via Pooling)

3. Filter size: 3x3 recommended (standard since VGG)

4. Activation: ReLU (after Conv), Softmax (last)

5. Regularization:
   - BatchNorm: After Conv
   - Dropout: FC layers (0.5)

9.2 Performance Improvement Tips

ProblemSolution
OverfittingDropout, Data Augmentation, Early Stopping
Slow convergenceBatchNorm, Learning Rate Scheduler
Vanishing gradientReLU, Skip Connection (ResNet)
Memory shortageReduce Batch Size, Model lightweighting
Insufficient dataTransfer Learning, Data Augmentation

Code Summary

from tensorflow.keras import models, layers
from tensorflow.keras.applications import MobileNetV2
from tensorflow.keras.preprocessing.image import ImageDataGenerator
 
# Transfer Learning with MobileNetV2
base_model = MobileNetV2(weights='imagenet', include_top=False,
                         input_shape=(224, 224, 3))
base_model.trainable = False
 
model = models.Sequential([
    base_model,
    layers.GlobalAveragePooling2D(),
    layers.Dense(128, activation='relu'),
    layers.Dropout(0.3),
    layers.Dense(num_classes, activation='softmax')
])
 
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])
 
# Data Augmentation
datagen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=20,
    horizontal_flip=True,
    validation_split=0.2
)
 
# Train
history = model.fit(
    datagen.flow(X_train, y_train, subset='training'),
    validation_data=datagen.flow(X_train, y_train, subset='validation'),
    epochs=30
)

Interview Questions Preview

  1. What are the advantages of Convolution operation?
  2. When do you use Transfer Learning?
  3. What are the roles of Stride and Padding?

Check out more interview questions at Premium Interviews (opens in a new tab).


Practice Notebook

💡

Additional notebook content:

  • Manual implementation and visualization of Convolution operation
  • Effects of various filters (vertical edge, horizontal edge, blur, sharpen)
  • Max Pooling visualization and size change verification
  • PyTorch code structure example

Previous: 12. Neural Network | Next: 14. NLP