Understanding CNNs — convolutions, feature maps, and pooling
In the previous articles, we trained a neural network on MNIST using dense layers — where every neuron in one layer connects to every neuron in the next, with weights updated via backpropagation. To work with dense layers, we had to flatten each 28×28 image into a 784-element vector and feed it into a layer of 128 neurons. That first layer alone had weights + biases = parameters — every input pixel connects to every neuron. It worked — we hit 97% accuracy — but this approach has two fundamental problems:
1. No spatial awareness. A dense layer treats all 784 inputs as an unordered list — pixel (5, 10) and pixel (5, 11) are neighbors in the image, but nothing in the layer’s structure uses that fact. The flattening threw away all spatial structure — pixel (0, 0) and pixel (27, 27) are just two numbers in a vector. If the digit shifts a few pixels to the right, it looks completely different to a dense network — the same pixel values now land on different input indices, activating different weights. The network has to learn the same pattern at every possible position independently.
2. Too many parameters. With a 28×28 image, 100k parameters is manageable. But real images are much larger. A 224×224 RGB image (a common input size) has pixels color channels values. A single dense layer with 512 neurons would need million parameters — just for one layer. That’s expensive to train and easy to overfit.
Why does having too many parameters lead to overfitting?
More parameters means more capacity to memorize. With 77 million weights and only, say, 50k training images, the network has far more parameters than data points. Instead of learning general patterns like “loops mean 0, straight lines mean 1,” it can essentially store each training image verbatim — achieving perfect training accuracy while failing on new images it hasn’t seen before.
Convolutional Neural Networks (CNNs) solve both problems at once. Instead of looking at all pixels at once, a CNN focuses on small local regions — like a 3×3 patch — and learns to recognize patterns within them. The fundamental difference: dense layers learn global patterns involving all pixels at once, whereas convolutional layers learn local patterns — and reuse the same detector at every position across the image.
The diagram below demonstrates this. Imagine we need to tell squares from circles on a 28×28 grid. For that, we would need to detect local differences — at a corner of the square you see a sharp right angle, while at the circle’s edge you see a gradual curve. These local fragments are exactly what a CNN learns to detect:
Using CNNs allows us to address each problem directly:
- Spatial awareness: because the filter operates on a local 2D region, it naturally understands that neighboring pixels are related. It detects patterns like edges and curves based on how pixels are arranged spatially — not just their position in a flat list.
- Far fewer parameters: a 3×3 filter has just 9 parameters (plus a bias), and the same filter is reused at every position across the image. This is called weight sharing. If the filter learns to detect a straight edge, it detects that edge whether it appears on the left side or the right side — the dense network would need separate weights for each position.
CNNs are the architecture that made modern computer vision possible — they enabled breakthroughs in image search, phone camera scene detection, OCR, and video analysis, and remain widely used in autonomous driving, medical imaging, and robotics (though cutting-edge systems increasingly use Vision Transformers (ViTs) or hybrid approaches). For example, PaddleOCR — one of the most popular open-source OCR systems — uses CNNs as a core part of its architecture for text detection and recognition.
Beyond images, the same idea applies to 1D data — sliding filters across waveforms for speech recognition (wav2vec 2.0), music generation (WaveNet), and audio classification. Even Liquid AI’s LFM2 language model uses 1D convolutions as a core building block — 10 of its 16 blocks are short-range convolution layers for local token mixing.
Convolution in neural networks
To understand a CNN, you really only need to grasp two operations: convolution (slide filters across the image to detect patterns) and pooling (shrink the result to keep the strongest signals). These two are applied in alternation — conv, pool — and then the familiar dense layers handle the final classification. Here’s what the complete model could look like in Keras:
model = keras.Sequential([
Input(...), # input image
Conv2D(...), # convolution
MaxPooling2D(...), # pooling
Conv2D(...), # convolution
MaxPooling2D(...), # pooling
Flatten(...), # flatten to 1D
Dense(...), # classification
])
Convolution detects patterns, pooling compresses the result, and a dense layer makes the final classification. Here’s what the complete CNN pipeline looks like — from pixels to prediction:
The convolution and pool layers together are the feature extractor — they distill the image into a compact representation where squares and circles look different. Then the dense layer at the end does the easy part: drawing a decision boundary in that feature space.
The convolution step detects local patterns — it slides learned filters across the image and produces feature maps that highlight where each pattern appears. It doesn’t classify anything — it transforms the raw pixel grid into a richer representation. This is feature extraction, the same idea as feature engineering in traditional ML. Instead of feeding raw data into a classifier, you first transform it into more useful representations. The difference is that CNNs learn which features to extract automatically — the kernel values are discovered during training, not designed by hand.
The pooling step compresses each feature map spatially — it shrinks the width and height while keeping the strongest signals. This reduces the number of values the dense layer needs to process (from 28×28×2 = 1,568 down to 14×14×2 = 392 in our model) and makes the features more robust to small shifts in position.
By the end of this article, every line will make sense and you’ll see the actual filter values this network learned. Let’s start with convolution.
Convolution as image transformation
Convolution is not an ML invention — it’s a fundamental technique from signal processing and image processing that existed long before deep learning. If you’ve ever applied a blur, sharpen, or edge detection filter in image editing software — that’s convolution. Under the hood, each filter is a small grid of numbers (called a kernel) that slides across the image. At each position, every kernel value is multiplied by the overlapping pixel, the products are summed, and the result is written into the output.
Try switching between the different filters above. Watch two things as you switch: the kernel values in the middle (the 3×3 grid of numbers), and how the output on the right changes. Different filters produce different transformations:
Notice how the horizontal edge detector highlights the top and bottom edges of the shape — this is achieved by having negative values in the top row and positive in the bottom row of the kernel, so it responds where brightness changes vertically. The sharpen filter has a large positive center value and negative neighbors — it amplifies the difference between a pixel and its surroundings, making edges crisper.
These are classical, well-established kernels from image processing — hand-designed decades before deep learning existed. The description below explains what each kernel computes and why the output looks the way it does.
| Filter | How it works |
|---|---|
| Blur | Averages all 9 pixels equally — smooths out differences. Uniform area stays the same, but sharp transitions get diluted |
| Sharpen | Amplifies the center pixel relative to its neighbors. In smooth areas the neighbors cancel out; at edges the mismatch gets exaggerated |
| Horizontal edges | Subtracts top pixels from bottom — if they’re similar, the result is near zero. Large output only where brightness changes vertically, i.e. a horizontal edge |
| Vertical edges | Same idea rotated: subtracts left from right. Output is strong only where brightness varies horizontally, i.e. a vertical edge |
Same sliding operation, same input — different results. The 9 numbers in the kernel completely determine what the output looks like. For an excellent visual explanation of how convolutions work, see this video by 3Blue1Brown.
Convolution as pattern matching
There’s another way to think about convolution. Instead of using a hand-designed kernel to detect edges or blur, what if the kernel is a patch from the image itself? We can take a small fragment of the image and ask: where else does something similar appear? That’s pattern matching — convolution finds where a local pattern appears across the entire image.
Click on any part of the shape below to cut out a 3×3 patch and use it as the kernel. For example, click on a vertical edge of the square to see it highlight all vertical edges, or click on the circle’s rounding to see where similar curves appear. The widget slides the kernel across the entire image, and the output lights up where similar patterns are found. You can also click Play to see the sliding mechanics step by step:
Try clicking on different parts — notice how each produces a different pattern of matches. That’s convolution as a spatial pattern detector. The kernel defines what to look for, and the output tells you where it was found. This is the property that CNNs actually use — not blurring or sharpening, but the ability to detect learned patterns across the image. The blur and edge filters from the previous section were useful for building intuition, but in a CNN the kernel values are learned automatically, and they act as pattern detectors. Everything else about CNNs builds on this one idea.
From hand-designed to learned filters
Both uses of convolution — image transformation and pattern matching — rely on the kernel values being chosen well. Image processing has used hand-designed kernels for decades (the blur, sharpen, and Sobel filters in the demo above). They’re useful, but choosing them by hand is tedious and limited.
The core idea of CNNs is: instead of hand-designing these values, let the network learn them automatically for the task at hand. The filter starts with random numbers and gets updated by gradient descent during training — the same way weights in dense layers are learned. The network discovers which patterns matter for the task. For our square vs circle task, it might learn filters that respond differently to straight edges vs curved edges — whatever helps separate the two classes.
The convolution operation
We’ve seen above how convolutions can help with image transformation and pattern matching. Now let’s look at the actual mechanics: what exactly happens at each position, and how does the output — called a feature map — get built up?
Why "feature map"?
The name uses “feature” in the same sense as in ML generally — a measurable property useful for the task. Just as a tabular model might use features like “age” or “income”, a CNN’s feature is a visual pattern like “vertical edge” or “curve”. The feature map is a spatial grid showing where that feature was detected across the image — high values where the pattern is present, low where it isn’t.
The core operation is straightforward: take a small matrix (the kernel), slide it across the input, and at each position compute the dot product — multiply every kernel value by the overlapping pixel, sum the products — and write the result into the feature map. The kernel can be any size (1×1, 5×5, 7×7) — older networks like AlexNet used 11×11 kernels — but 3×3 is the most common choice in modern architectures and what we’ll use throughout this article.
Let’s see it in practice. Say you click on the left edge of the square in the demo above — the kernel becomes a 3×3 patch with black (0) as background and white (255) where the edge line is. When this kernel slides to another left edge, the image patch looks the same — every position matches, so the products are large:
Now let’s apply the same kernel to the top edge. The white pixels are in different positions — the kernel’s white column overlaps mostly with black pixels:
Over a flat black region, every pixel is 0, so every product is 0 — no match at all. The output is only strong where the kernel and the image patch agree. Let’s unwrap what’s happening: we flatten both the 3×3 kernel and the 3×3 image region into 9-element vectors and compute their dot product. The more similar the two vectors, the larger the result — that’s why matching regions produce high values and mismatched ones don’t.
The same math explains blur too. The blur kernel has every value set to ⅑, so the weighted sum is just the average of the 9 pixels. On a left edge patch:
At this one position, the output pixel is ~85 — a gray value between black (0) and white (255). The kernel keeps moving, processing the image position by position, row by row. Every position near an edge gets a similar mixed value, softening the sharp 0-to-255 jump into a gradual transition. On a flat region (all black or all white), every pixel in the patch is the same, so the average equals the original — nothing changes there.
Output size
When a filter slides across the input without any padding, the output becomes smaller than the input — this is called the border effect. Consider a 5×5 input with a 3×3 kernel. The kernel starts at the top-left corner, covering rows 0-2 and columns 0-2. It slides right one pixel at a time — but it can only go to column 0, 1, or 2. Starting at column 3 would make the kernel cover columns 3-5, and column 5 doesn’t exist. Same vertically. So there are only 3 valid positions per axis, producing a 3×3 output from a 5×5 input:
Remember, at each position the 3×3 patch (9 values) gets compressed into a single output value via the weighted sum. That always happens. The output shrinks not because of that compression, but because there are fewer valid positions than input pixels — the kernel simply can’t start where it would hang off the edge. The general formula is: . For our 28×28 shapes (and MNIST digits, which are the same size): .
Padding
To counteract the border effect, you can add rows and columns of zeros around the input — this is called padding. For a 3×3 kernel, you add 1 pixel of zeros on each side, so a 28×28 input becomes a 30×30 padded input, and the output is back to 28×28. For a 5×5 kernel, you’d add 2 pixels on each side.
We add zeros, and not say 1, because zero times any kernel value is zero — the padded pixels contribute nothing to the dot product. Only the real pixels that overlap with the kernel affect the result. It’s like saying “there’s nothing outside the image boundary.”
Stride
So far we’ve assumed the kernel moves one pixel at a time — stride 1. But the distance between successive kernel positions is a parameter you can change. For example, with stride 2, the kernel jumps 2 pixels at a time, skipping every other position — both horizontally and vertically. A larger stride means fewer positions, which means a smaller output.
Consider the same 5×5 input from before, but now with stride 2. The kernel can only be placed at positions (0,0), (0,2), (2,0), and (2,2) — just 4 positions instead of 9, producing a 2×2 output. The width and height are each downsampled by a factor of 2:
Striding achieves spatial downsampling — reducing the width and height of feature maps. This is essential because it forces the network to distill local patterns into more compact representations, reduces the number of parameters in later layers, and gives deeper layers a wider “field of view” over the original image. It does this by simply skipping positions during convolution — the learned kernel does double duty as both pattern detector and downsampler.
However, there’s another technique used more commonly in classification networks: max-pooling. It also achieves spatial downsampling, but through a separate, parameter-free operation — instead of skipping positions during convolution, it applies a fixed rule (take the maximum) to shrink the feature map after convolution is done. We’ll cover it in the pooling section below.
Inside a convolutional layer
So far we’ve looked at convolution as a standalone operation — one kernel sliding across an image, producing one feature map. Now let’s look at how this operation is packaged into a layer that the network can train. In Keras, a convolutional layer is defined like this:
layers.Conv2D(2, kernel_size=3, activation='relu', padding='same')
Although we showed both kernels and output feature maps in the widgets above, the layer actually only stores the kernel values and biases — not the feature maps. The feature maps are computed on the fly during the forward pass by sliding the stored kernels across whatever input comes in. Feature maps are outputs, similar to how activations are outputs of dense layers — which store only weight matrices, not the activations themselves.
Both dense and convolutional layers take the entire input, but process it differently. Take a simplified 3×3 grid with a 2×2 kernel. A dense neuron has a unique weight per pixel (9 weights for our simple 3x3 grid, or 784 weights for a 28×28 image), computes one weighted sum, and produces one value.
A conv kernel has only a few shared weights, but slides across every position in the image, producing one value per position — an entire feature map. At each position, the layer extracts a local patch from the input, multiplies it element-wise with the kernel, sums all the products, and adds a bias.
The dense neuron sees all 9 pixels at once with 9 unique weights, producing one output. The conv kernel sees 4 pixels at a time (2×2), but visits every position — same 4 weights, reused everywhere. Fewer parameters, but the full image is still covered.
Here’s a from-scratch implementation of the convolution layer that shows the full operation:
class Conv2D:
def __init__(self, num_filters, kernel_size, padding='same'):
self.kernels = np.random.randn(num_filters, kernel_size, kernel_size) * 0.1
self.biases = np.zeros(num_filters)
self.padding = (kernel_size - 1) // 2 if padding == 'same' else 0
def forward(self, input):
# Pad input with zeros if padding='same'
if self.padding > 0:
p = self.padding
input = np.pad(input, ((p, p), (p, p)), mode='constant')
self.input = input # save padded input for backward
h, w = input.shape
k = self.kernels.shape[1]
out_h, out_w = h - k + 1, w - k + 1
self.z = np.zeros((self.kernels.shape[0], out_h, out_w))
for f in range(len(self.kernels)): # each filter
for r in range(out_h): # each row
for c in range(out_w): # each column
patch = input[r:r+k, c:c+k] # extract local patch
self.z[f, r, c] = np.sum(patch * self.kernels[f]) + self.biases[f]
self.out = np.maximum(0, self.z) # ReLU activation
return self.out # shape: (num_filters, out_h, out_w)
The weighted sum at each position, plus a bias term — that’s the entire operation. The bias (one per kernel) shifts the output up or down, giving the kernel a threshold for how strongly a pattern needs to match before it activates. It works exactly like a bias in a dense layer.
How the kernels learn: backpropagation in CNNs
The kernel values start random and get updated by backpropagation, just like in dense networks. The key difference is weight sharing. In a dense layer, each weight connects to one specific input and is used once per forward pass — so each weight’s gradient comes from a single input × error pair. In a conv layer, the same 9 weights are reused at every position across the image — so during backprop, each position contributes a gradient for the same weights. The following snippet conveys the idea:
# Dense layer: one weight, used once
dw = input * error # one gradient from one computation
# Conv layer: one weight, used at every position
dw[kr][kc] = sum over all (r, c):
input[r+kr][c+kc] * error[r][c] # same structure, summed across all positions
Here error is a grid of gradient values computed by the layers above during the backward pass — its exact values depend on the loss function and the layers in between, which we’ll cover in followings articles.
But the structure is the same in both cases: input × error. The difference is that the dense weight sees one such pair, while the conv weight accumulates them from every position where the kernel was applied.
Here’s how that accumulation looks in our Conv2D class:
class Conv2D:
# ... __init__ and forward from above ...
def backward(self, upstream_gradient):
# ReLU backward: zero gradient where activation was ≤ 0
grad = upstream_gradient * (self.z > 0)
k = self.kernels.shape[1]
out_h, out_w = grad.shape[1], grad.shape[2]
self.grad_kernels = np.zeros_like(self.kernels)
for f in range(len(self.kernels)):
for kr in range(k):
for kc in range(k):
for r in range(out_h): # sum over every position
for c in range(out_w):
self.grad_kernels[f][kr][kc] += self.input[r+kr][c+kc] * grad[f][r][c]
def update(self, lr):
self.kernels -= lr * self.grad_kernels
self.biases -= lr * self.grad_biases
Compare this to the forward method: the same nested loops, but instead of computing output values, we’re accumulating gradients. Every position where the kernel was applied contributes to the same gradient — that’s how 9 weights can learn from thousands of positions across the image.
Multiple kernels, multiple feature maps
A single kernel can only detect one type of pattern. A horizontal edge detector finds horizontal edges, but misses vertical edges, corners, and curves. To capture multiple aspects of the input, you need multiple kernels working in parallel. Here’s the same square processed by two different kernels — each produces a different feature map highlighting different structure:
This necessitates using multiple kernels per layer to produce multiple feature maps — one per kernel, each detecting a different pattern. So each convolutional layer produces as many feature maps as it has kernels — 2 in our case, but this is intentionally minimal.
Typical networks use 32 or 64 kernels in the first layer. Each kernel is a separate 3×3 grid (configurable) of learned values, and they’re all applied independently to the same input in parallel: kernel 1 slides across the whole image producing feature map 1, kernel 2 slides across producing feature map 2, and so on. Each kernel learns to detect a different pattern — this parallelism is how the network captures multiple aspects of the input simultaneously, rather than looking for just one thing at a time.
Here’s what we got after training our 2-filter CNN — the kernel values it converged on produce visibly different feature maps for squares vs circles. These are the feature maps from our single convolutional layer — in a deeper network, each layer would produce its own set of feature maps, with deeper layers capturing increasingly complex patterns:
Each filter contributes a different aspect to the final classification. Filter 1 (shown on the left) acts like a broad edge detector — for the square, it produces sharp bright and dark bands along each side; for the circle, a smooth gradient around the circumference. Filter 2 (shown on the right) responds to corners and direction changes — on the square, it fires at the four corners; on the circle, it produces an alternating pattern where the curve bends.
Together, the two filters create a distinctive signature for each shape: the square’s feature maps have sharp, localized activations at edges and corners, while the circle’s are smooth and evenly distributed around the ring. The dense layer at the end learns to read these signatures — “sharp corners” means square, “smooth ring” means circle.
What’s amazing is that we didn’t tell the filters what to detect — both started with different random values, and gradient descent pushed each toward whatever reduced the loss the most. They end up specializing because if both filters learned the same pattern, one would be redundant and wouldn’t help reduce the loss further. The pressure to minimize loss naturally drives the filters apart, each capturing different aspects of the input.
Stacking layers: from edges to shapes
Our square vs circle task is simple enough that a single convolutional layer is sufficient — 2 filters can separate the two classes. But for harder tasks like digit recognition or face detection, you need more depth.
The real power of CNNs comes from stacking convolutional layers. To see why, think about how you’d decompose a recognition problem by hand. If someone asked you to detect the digit “0”, you might break it into sub-problems: is there a top curve? A bottom curve? A left edge? A right edge? Each sub-network answers one question, and a final layer combines their outputs:
Each of those sub-problems can be decomposed further. “Top curve?” breaks down into: is there a top-left arc? A top-right arc? A horizontal edge at the peak? Each question gets simpler, closer to raw visual features:
This is exactly what stacked convolutional layers learn to do automatically — hierarchical feature extraction. The first layer finds edges, the second combines edges into curves and corners, deeper layers combine those into parts and shapes. The network doesn’t need to be told what features to look for — it learns the entire hierarchy from data.
And at each level, multiple kernels work in parallel — one kernel might learn to detect horizontal edges while another detects vertical edges, one might find right-angle corners while another finds smooth curves. This is why each layer has many kernels — the network needs to track many different patterns simultaneously at each stage of the hierarchy.
Multiple input channels
Everything above describes the first convolutional layer in the hierarchy, where each kernel receives a single 2D image as input. But deeper layers in the stack work with something different. The first convolutional layer takes a grayscale image — a single 2D grid 28×28. But the second convolutional layer’s input isn’t a flat 2D image anymore — it’s the output of the first layer, a stack of 2D feature maps, which makes it a 3D volume (28×28×2 in our case, or 28×28×32 in a typical network):
Those stacked frames are often referred to as channels. Each channel is one kernel’s projection of the layer above — for example, channel 1 might be the “edge” view of the input while channel 2 is the “corner” view. So “channel” and “feature map” are the same thing, just viewed from different perspectives: the conv layer outputs feature maps, and the next layer receives them as input channels. The word “channel” is used because it implies no ordering or hierarchy — they’re parallel components of the same data, each describing a different aspect of the same spatial region.
Since the input now has multiple stacked feature maps (multiple channels), the kernel needs to process them all. The kernel automatically mirrors the input structure — its depth (number of slices) is inferred to match the number of input channels. Each layer still can — and usually does — have multiple kernels, but now each kernel becomes 3D — its depth automatically matches the number of input channels.
With 2 input channels, each kernel becomes a stack of two 2×2 matrices, one per channel. At each position, the first matrix multiplies with channel 1, the second with channel 2, and all products get summed into a single one output value per position. Sliding this 3D kernel across all positions produces just one feature map as the output of one kernel — one value per position.
The diagram below shows two positions to illustrate how the same kernel produces different values that all contribute to that one feature map:
The operation is the same weighted sum we already know — just run once per channel, then sum the results. The diagram shows a 2×2 kernel sliding over a 3×3 input with 2 channels. At each position, we compute the weighted sum for each channel separately, then add the results:
As the network goes deeper, each kernel gets larger — a first-layer kernel is 3×3×1 (9 weights), but a second-layer kernel with 2 input channels is 3×3×2 (18 weights). The number of channels typically grows through the network (1 → 2 → 32 → 64), so kernels get deeper too — while the spatial size (3×3) usually stays the same.
Multi-channel input gives deeper layers the ability to combine patterns from the layer above. Think back to our model’s feature maps: filter 1 detected edges, filter 2 detected corners. If we had a second convolutional layer, its kernels would look at both feature maps at each position simultaneously. A position where filter 1 found a vertical edge and filter 2 found a horizontal edge — that’s a corner. The second layer’s kernel learns to combine these signals: “here’s a vertical edge, and there’s also a horizontal edge at the same spot — that must be a corner.” This is how deeper layers build complex features from simpler ones.
Pooling
After a convolutional layer detects features, we want to shrink each feature map — like downscaling an image, but instead of averaging pixels, we keep only the strongest signals. Each feature map is shrunk independently — pooling doesn’t merge channels. The depth stays the same: 28×28×2 becomes 14×14×2, not 14×14×1.
The pooling layer receives the stack of feature maps from the convolutional layer, shrinks each one spatially, and passes the same number of channels through to the next layer. No merging, no learning — just spatial compression. This serves two purposes: fewer values means fewer computations in later layers, and it gives the network spatial invariance — a feature detected at pixel (10, 12) and one at pixel (11, 12) both survive as the same pooled value, making the network robust to small shifts.
There are different pooling strategies — max, average, and others — but the most common is max pooling: take a 2×2 window and slide it across the feature map with stride 2 — since the stride matches the window size, each window covers a fresh region with no overlap. At each position, keep only the maximum value.
Pooling might sound like strided convolution — both slide a window and downsample. But they differ in two important ways. First, nothing is learned — pooling has no weights, no kernel, no parameters that get updated during training. Second, the operation itself is different — instead of a weighted sum (multiply-and-sum), pooling simply picks the max.
So pooling is purely a transformation — a fixed rule applied to the data, with nothing to learn. Unlike convolutional and dense layers, pooling layers store no weights, no biases, no parameters at all. That’s why they show 0 in the model’s parameter count. You can think of them as a “compression step” that sits between trainable layers, reducing the spatial size while preserving the important signals.
For our model, pooling takes the 28×28 feature map down to 14×14:
Max pooling keeps the strongest activation in each region. If an edge was detected somewhere in a 2×2 patch, the max keeps that signal — it doesn’t matter exactly which of the 4 pixels had the strongest response. This gives the network tolerance to small translations: if the shape shifts by one pixel, the same pooled region still captures the same feature. The tradeoff is that you lose exact position information — after pooling, you know a feature was detected somewhere in that 2×2 region, but not exactly which pixel.
Average pooling smooths instead of sharpening — it takes the mean of all values in the window. This produces gentler output but can dilute strong signals — a single strong edge activation gets averaged with weaker neighbors, reducing its impact. Max pooling is more common in early and middle layers where preserving strong responses matters. Average pooling sometimes appears at the very end of a network as global average pooling, which averages each entire feature map down to a single number — replacing the flatten + dense step entirely.
Here’s our MaxPool2D class — note it has no update method because there are no parameters to learn:
class MaxPool2D:
def __init__(self, size=2):
self.size = size
def forward(self, x):
self.input = x # save for backward
s = self.size
channels, h, w = x.shape
out = np.zeros((channels, h // s, w // s))
for c in range(channels):
for r in range(0, h, s):
for col in range(0, w, s):
out[c, r//s, col//s] = np.max(x[c, r:r+s, col:col+s])
return out
def backward(self, gradient):
s = self.size
out = np.zeros_like(self.input)
channels, h, w = self.input.shape
for c in range(channels):
for r in range(0, h, s):
for col in range(0, w, s):
patch = self.input[c, r:r+s, col:col+s]
max_idx = np.unravel_index(np.argmax(patch), patch.shape)
out[c, r+max_idx[0], col+max_idx[1]] = gradient[c, r//s, col//s]
return out
# No update() — pooling has no parameters
Flatten and classify
Notice how the feature maps stay separate through convolution and pooling — each is shrunk independently, never merged. The convolutional layers extract features, the pooling layers compress them, but at the end we still have a 3D volume (14×14×2 in our model). A dense layer can’t work with that — it needs a flat 1D vector.
That’s what Flatten layer does: it takes every value from every channel and lines them up into one long vector — all 14×14 = 196 values from channel 1, then all 196 from channel 2, giving a single 392-element vector. No computation, no learning — just reshaping.
This is the same reshape we did in the dense-only MNIST model when we flattened 28×28 into 784. Flatten itself doesn’t know or care what it’s reshaping — the operation is identical. The difference is what the layers before it have done: in the dense-only approach, there were no preceding layers, so Flatten received raw pixels. Here, convolution and pooling have already extracted and compressed spatial patterns, so the dense layer receives meaningful features rather than raw pixel values.
Then the Dense layer acts as the classifier. Since we only have two classes (square vs circle), this is binary classification — we need just one neuron with sigmoid activation. It connects to all 392 values, multiplies by learned weights, sums, adds a bias, and outputs a single number between 0 and 1: the probability that the input is a circle. Close to 0 = square, close to 1 = circle. For multi-class problems like MNIST (10 digits), you’d use 10 neurons with softmax instead — one per class.
This is where the network makes its final decision. The convolutional layers did the hard work — transforming raw pixels into meaningful features like “edge here,” “corner there.” The dense layer just draws a decision boundary in that 392-dimensional feature space, separating square-like patterns from circle-like ones.
Here’s the DenseLayer class — the same structure as the HiddenLayer from the previous article:
class DenseLayer:
def __init__(self, n_inputs, n_outputs):
self.W = np.random.randn(n_outputs, n_inputs) * np.sqrt(2.0 / n_inputs)
self.b = np.zeros(n_outputs)
def forward(self, x):
self.x = x
return self.W @ x + self.b
def backward(self, grad_output):
self.grad_W = np.outer(grad_output, self.x)
self.grad_b = grad_output
return self.W.T @ grad_output
def update(self, lr):
self.W -= lr * self.grad_W
self.b -= lr * self.grad_b
Building a CNN model for shapes
Now that we’ve covered each component, let’s build the model and training loop — the same way we did for the dense MNIST model, but now with convolutional layers.
We instantiate the layers we built throughout this article:
# 28×28 grayscale → 2 feature maps → pooling → flatten → 1 output
conv = Conv2D(num_filters=2, kernel_size=3) # 20 parameters (2×9 weights + 2 biases)
pool = MaxPool2D(size=2) # 0 parameters
dense = DenseLayer(n_inputs=392, n_outputs=1) # 393 parameters (392 weights + 1 bias)
Preparing the data
We generate 4,000 shapes (2,000 squares + 2,000 circles) with varying sizes and positions:
SIZE = 28
LINE_WIDTH = 2.0
rr, cc = np.mgrid[0:SIZE, 0:SIZE] + 0.5 # grid of pixel centers
def make_square():
img = np.zeros((SIZE, SIZE), dtype=np.float32)
s = np.random.randint(8, 22) # random side length
r = np.random.randint(0, SIZE - s + 1) # random position
c = np.random.randint(0, SIZE - s + 1)
for t in range(int(LINE_WIDTH)): # draw 2px thick edges
img[r+t, c:c+s] = 255; img[r+s-1-t, c:c+s] = 255
img[r:r+s, c+t] = 255; img[r:r+s, c+s-1-t] = 255
return img
def make_circle():
img = np.zeros((SIZE, SIZE), dtype=np.float32)
radius = np.random.uniform(4.0, 10.0)
margin = radius + 2.0
cy = np.random.uniform(margin, SIZE - margin)
cx = np.random.uniform(margin, SIZE - margin)
dist = np.sqrt((rr - cy)**2 + (cc - cx)**2) # vectorized distance
ring_dist = np.abs(dist - radius)
img[ring_dist <= LINE_WIDTH / 2] = 255 # solid ring
aa = (ring_dist > LINE_WIDTH / 2) & (ring_dist <= LINE_WIDTH / 2 + 0.5)
img[aa] = 255 * (LINE_WIDTH / 2 + 0.5 - ring_dist[aa]) / 0.5 # anti-aliasing
return img
N = 2000
squares = np.array([make_square() for _ in range(N)])
circles = np.array([make_circle() for _ in range(N)])
Why np.mgrid makes circle generation fast
Notice make_circle uses np.mgrid instead of a nested Python loop over all 28×28 pixels. np.mgrid creates a grid of all pixel coordinates at once, then NumPy computes the distance from the center for all 784 pixels in a single vectorized call — no Python loop at all. For 2,000 circles, this is the difference between seconds and minutes.
Then normalize and split:
X = np.concatenate([squares, circles]) / 255.0 # normalize to [0, 1]
y = np.concatenate([np.zeros(N), np.ones(N)]) # 0 = square, 1 = circle
X_train, X_test = X[:3200], X[3200:]
y_train, y_test = y[:3200], y[3200:]
Notice we don’t flatten the images — unlike the dense MNIST model where we reshaped to 784 vectors, the conv layer needs the 2D spatial structure intact.
The output activation
Our model outputs a single number — the probability that the input is a circle. To convert the dense layer’s raw output (which can be any value) into a probability between 0 and 1, we use sigmoid:
def sigmoid(x):
return 1 / (1 + np.exp(-x))
This is the same function we’d use in logistic regression — it squashes any value into the (0, 1) range. For multi-class problems like MNIST, we’d use softmax instead.
The loss function
To measure how wrong the prediction is, we use binary cross-entropy — the same idea as the cross-entropy loss from the MNIST article, but adapted for two classes instead of ten:
def binary_cross_entropy(predicted, label):
# label is 0 (square) or 1 (circle)
# predicted is the sigmoid output (probability of circle)
return -label * np.log(predicted) - (1 - label) * np.log(1 - predicted)
It measures how far the predicted probability is from the true label: if the model says 0.95 for a circle and it is a circle, the loss is small; if it says 0.3, the loss is large.
The training loop
The training loop follows the same 4-step process from the previous articles — forward pass, loss, backpropagation, gradient descent. The forward pass chains all our layers together:
def forward(image):
activated = conv.forward(image) # 28×28 → 28×28×2 (conv + ReLU)
pooled = pool.forward(activated) # 28×28×2 → 14×14×2
flat = pooled.reshape(-1) # 14×14×2 → 392
output = sigmoid(dense.forward(flat)) # 392 → 1
return output
The backward pass mirrors it in reverse — from the loss gradient back through dense, un-flatten, pooling, and conv. We’ll dive deep into why output - label is the correct starting gradient for sigmoid + binary cross-entropy in a following article:
def backward(output, label):
grad = output - label # sigmoid + BCE gradient
grad = dense.backward(grad) # dense layer
grad = grad.reshape(14, 14, 2) # un-flatten
grad = pool.backward(grad) # pooling: route to max positions
conv.backward(grad) # conv: ReLU + accumulate across positions
With forward and backward defined, the training loop is straightforward:
def train(X_train, y_train, epochs=20, lr=0.25, batch_size=32):
for epoch in range(epochs):
indices = np.random.permutation(len(X_train))
for i in range(0, len(X_train), batch_size):
batch_idx = indices[i:i+batch_size]
bs = len(batch_idx)
# Accumulate gradients over the mini-batch
acc_conv_k = np.zeros_like(conv.kernels)
acc_conv_b = np.zeros_like(conv.biases)
acc_dense_W = np.zeros_like(dense.W)
acc_dense_b = np.zeros_like(dense.b)
for idx in batch_idx:
# 1. Forward pass
output = forward(X_train[idx])
# 2. Loss
loss = binary_cross_entropy(output, y_train[idx])
# 3. Backpropagation (computes per-sample gradients)
backward(output, y_train[idx])
# Accumulate
acc_conv_k += conv.grad_kernels
acc_conv_b += conv.grad_biases
acc_dense_W += dense.grad_W
acc_dense_b += dense.grad_b
# 4. Average gradients and update weights
conv.grad_kernels = acc_conv_k / bs
conv.grad_biases = acc_conv_b / bs
dense.grad_W = acc_dense_W / bs
dense.grad_b = acc_dense_b / bs
conv.update(lr)
dense.update(lr)Training results
We trained on 4,000 generated shapes (2,000 squares + 2,000 circles). Here’s how training progresses over 20 epochs — loss drops and accuracy climbs on both training and validation sets:
The model reaches 100% test accuracy — and the learned filters are the ones you saw in the feature map demo above.
For reference, here’s the equivalent in Keras — the entire model and training loop above collapses into a few lines:
model = keras.Sequential([
layers.Input(shape=(28, 28, 1)),
layers.Conv2D(2, kernel_size=3, activation='relu', padding='same'),
layers.MaxPooling2D(pool_size=2),
layers.Flatten(),
layers.Dense(1, activation='sigmoid'),
])
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
model.fit(X_train, y_train, epochs=20, batch_size=32, validation_split=0.1)
Scaling up: from shapes to MNIST
Our shape classifier has just 413 parameters because the task is simple. Real image classification needs more capacity. Here’s a typical CNN for MNIST digit recognition — the same principles, just bigger:
model = keras.Sequential([
layers.Input(shape=(28, 28, 1)),
layers.Conv2D(32, kernel_size=3, activation='relu'), # 28×28×1 → 26×26×32
layers.MaxPooling2D(pool_size=2), # 26×26×32 → 13×13×32
layers.Conv2D(64, kernel_size=3, activation='relu'), # 13×13×32 → 11×11×64
layers.MaxPooling2D(pool_size=2), # 11×11×64 → 5×5×64
layers.Flatten(), # 5×5×64 → 1600
layers.Dense(128, activation='relu'), # 1600 → 128
layers.Dense(10, activation='softmax'), # 128 → 10
])The differences from our toy model:
- 28×28 input same size — but grayscale handwritten digits instead of simple geometric shapes
- 32 and 64 filters instead of 2 — many more patterns to detect
- Two conv+pool blocks instead of one — hierarchical feature extraction
- 10-class softmax instead of binary sigmoid — 10 digits to distinguish
- ~225,000 parameters instead of 359 — much more capacity
But the building blocks are identical: slide small filters to build feature maps, pool to reduce size, flatten, classify. Here’s how training progresses over 5 epochs:
This CNN reaches ~99% test accuracy on MNIST in 5 epochs — a clear improvement over the ~97% we got with dense layers alone.
Stay up to date
Get notified when I publish new deep dives.