Introduction to Convolutional Neural Networks (CNNs)¶

A Convolutional Neural Network (CNN) is a specialized type of artificial neural network designed to process and analyze visual data like images and videos. It is structured to automatically and adaptively learn spatial hierarchies of features through multiple layers. CNNs are widely used in tasks such as image classification, object detection, and facial recognition.

How a CNN Processes an Image¶

An image is essentially a matrix of numbers. For a grayscale image, each pixel holds a single value representing intensity. For a color image, each pixel has three values representing the intensities of red, green, and blue. A 100x100 color image would be represented as a 100x100x3 array, totaling 30,000 numbers.

Convolution Layer¶

The convolution layer is the first layer that processes the image. It uses filters (also called kernels), which are small matrices (e.g., 3x3 or 5x5) that slide across the image. This sliding process is known as convolution. At each position, the filter performs element-wise multiplication with the part of the image it overlaps and sums the result to produce a single number. This process captures specific patterns like edges or corners. Each filter learns to detect a different feature. The result is a feature map that highlights the presence and location of the learned feature in the input image.

Activation Layer (ReLU)¶

After the convolution operation, the feature map passes through an activation function, typically ReLU (Rectified Linear Unit). ReLU replaces all negative values with zero and keeps the positive ones unchanged. This introduces non-linearity, enabling the network to learn more complex patterns.

Pooling Layer¶

Pooling layers reduce the spatial dimensions (height and width) of the feature maps. The most common method is max pooling, where a small window (e.g., 2x2) slides over the feature map and outputs the maximum value within that window. Pooling helps in reducing the number of parameters, speeding up computation, and making the model more robust to slight variations in the input.

Repeating Layers¶

The combination of convolution, activation, and pooling layers is often repeated several times. Each repetition allows the network to learn increasingly abstract features. Initial layers might detect lines or edges, intermediate layers recognize parts of objects, and deeper layers understand complex concepts like faces or animals.

Flattening¶

Once enough features are extracted and the spatial dimensions are sufficiently reduced, the multi-dimensional data is flattened into a one-dimensional vector. This step prepares the data for the fully connected layers, where each value becomes an input node.

Fully Connected (Dense) Layers¶

These layers are traditional neural network layers where each node is connected to every node in the previous and next layer. These layers combine all the features detected by the convolutional layers and make a final decision about what the image represents.

Output Layer¶

The final layer provides the prediction. For classification tasks, it uses functions like softmax or sigmoid to convert raw scores into probabilities. The class with the highest probability is chosen as the output prediction.

Learning and Training¶

The CNN learns through a process called backpropagation. During training, it compares the predicted output to the actual label using a loss function. The network then adjusts its filters and weights to reduce the error. This process is repeated over many examples, gradually improving accuracy.

Conceptual Analogy¶

You can think of a CNN as a step-by-step analyzer. First, it scans the image for tiny patterns, then gradually combines those patterns to form bigger ideas. Finally, it uses all that information to make a decision about what the image most likely contains.

Summary¶

Step Description
Input Layer Accepts the image as a grid of pixel values
Convolution Applies filters to detect patterns
ReLU Activation Keeps positive signals, removes negatives
Pooling Reduces image size while keeping key features
Flatten Converts 2D feature maps to a 1D list
Dense Layers Uses features to classify the image
Output Provides the final prediction

image.png

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Conv2D, MaxPooling2D, Flatten, Dense, Dropout
from tensorflow.keras.datasets import fashion_mnist
from tensorflow.keras.utils import to_categorical
/Users/saroshbaig/Library/Python/3.9/lib/python/site-packages/urllib3/__init__.py:35: NotOpenSSLWarning: urllib3 v2 only supports OpenSSL 1.1.1+, currently the 'ssl' module is compiled with 'LibreSSL 2.8.3'. See: https://github.com/urllib3/urllib3/issues/3020
  warnings.warn(

https://www.kaggle.com/datasets/zalando-research/fashionmnist

In [2]:
# Load data from Keras
(X_train, y_train), (X_test, y_test) = fashion_mnist.load_data()

# Print shapes
print("Training data shape:", X_train.shape)
print("Testing data shape:", X_test.shape)

# Show a few images
class_names = ["T-shirt", "Trouser", "Pullover", "Dress", "Coat", 
               "Sandal", "Shirt", "Sneaker", "Bag", "Ankle boot"]

plt.figure(figsize=(10, 5))
for i in range(10):
    plt.subplot(2, 5, i + 1)
    plt.imshow(X_train[i], cmap="gray")
    plt.title(class_names[y_train[i]])
    plt.axis("off")
plt.tight_layout()
plt.show()
Training data shape: (60000, 28, 28)
Testing data shape: (10000, 28, 28)
No description has been provided for this image
In [3]:
# Reshape to add channel dimension and normalize pixel values
X_train = X_train.reshape(-1, 28, 28, 1).astype("float32") / 255.0
X_test = X_test.reshape(-1, 28, 28, 1).astype("float32") / 255.0

# One-hot encode the labels
y_train_cat = to_categorical(y_train, 10)
y_test_cat = to_categorical(y_test, 10)
In [4]:
model = Sequential([
    Input(shape=(28, 28, 1)),                     # Input Layer
    Conv2D(32, (3, 3), activation='relu'),        # Convolution + ReLU
    MaxPooling2D(pool_size=(2, 2)),               # Pooling Layer
    
    Conv2D(64, (3, 3), activation='relu'),        # Another Convolution
    MaxPooling2D(pool_size=(2, 2)),               # Pooling Again

    Flatten(),                                    # Flatten the 2D to 1D
    Dense(128, activation='relu'),                # Fully Connected Layer
    Dropout(0.3),                                 # Dropout for regularization
    Dense(10, activation='softmax')               # Output Layer (10 classes)
])
In [5]:
model.compile(
    optimizer='adam',
    loss='categorical_crossentropy',
    metrics=['accuracy']
)

# View summary
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                    ┃ Output Shape           ┃       Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ conv2d (Conv2D)                 │ (None, 26, 26, 32)     │           320 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d (MaxPooling2D)    │ (None, 13, 13, 32)     │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ conv2d_1 (Conv2D)               │ (None, 11, 11, 64)     │        18,496 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ max_pooling2d_1 (MaxPooling2D)  │ (None, 5, 5, 64)       │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ flatten (Flatten)               │ (None, 1600)           │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense (Dense)                   │ (None, 128)            │       204,928 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dropout (Dropout)               │ (None, 128)            │             0 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 10)             │         1,290 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 225,034 (879.04 KB)
 Trainable params: 225,034 (879.04 KB)
 Non-trainable params: 0 (0.00 B)
In [6]:
history = model.fit(
    X_train, y_train_cat,
    epochs=10,
    batch_size=64,
    validation_data=(X_test, y_test_cat),
    verbose=2
)
Epoch 1/10
938/938 - 11s - 12ms/step - accuracy: 0.7996 - loss: 0.5498 - val_accuracy: 0.8636 - val_loss: 0.3782
Epoch 2/10
938/938 - 11s - 11ms/step - accuracy: 0.8681 - loss: 0.3620 - val_accuracy: 0.8820 - val_loss: 0.3279
Epoch 3/10
938/938 - 11s - 12ms/step - accuracy: 0.8835 - loss: 0.3136 - val_accuracy: 0.8929 - val_loss: 0.2981
Epoch 4/10
938/938 - 18s - 19ms/step - accuracy: 0.8970 - loss: 0.2800 - val_accuracy: 0.8989 - val_loss: 0.2793
Epoch 5/10
938/938 - 18s - 20ms/step - accuracy: 0.9052 - loss: 0.2587 - val_accuracy: 0.9012 - val_loss: 0.2638
Epoch 6/10
938/938 - 19s - 20ms/step - accuracy: 0.9134 - loss: 0.2360 - val_accuracy: 0.9081 - val_loss: 0.2527
Epoch 7/10
938/938 - 22s - 23ms/step - accuracy: 0.9202 - loss: 0.2179 - val_accuracy: 0.9056 - val_loss: 0.2560
Epoch 8/10
938/938 - 21s - 22ms/step - accuracy: 0.9230 - loss: 0.2049 - val_accuracy: 0.9106 - val_loss: 0.2478
Epoch 9/10
938/938 - 13s - 14ms/step - accuracy: 0.9290 - loss: 0.1886 - val_accuracy: 0.9124 - val_loss: 0.2505
Epoch 10/10
938/938 - 13s - 14ms/step - accuracy: 0.9336 - loss: 0.1755 - val_accuracy: 0.9124 - val_loss: 0.2567
In [7]:
test_loss, test_acc = model.evaluate(X_test, y_test_cat)
print(f"Test Accuracy: {test_acc:.4f}")
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step - accuracy: 0.9131 - loss: 0.2647
Test Accuracy: 0.9124
In [8]:
predictions = model.predict(X_test)

plt.figure(figsize=(10, 5))
for i in range(10):
    plt.subplot(2, 5, i + 1)
    plt.imshow(X_test[i].reshape(28, 28), cmap='gray')
    pred_label = class_names[np.argmax(predictions[i])]
    true_label = class_names[y_test[i]]
    plt.title(f"Pred: {pred_label}\nTrue: {true_label}")
    plt.axis("off")
plt.tight_layout()
plt.show()
313/313 ━━━━━━━━━━━━━━━━━━━━ 1s 3ms/step
No description has been provided for this image