Training a real network on MNIST
In the previous article, we looked at every concept behind training a neural network — forward passes, loss functions, backpropagation, gradient descent, learning rates, mini-batches, vanishing gradients. We used a toy example of fitting a line with 2 parameters to explain the mechanics — given an input , predict . That was regression: the network produces a single number that can take any value on a continuous scale — like a price or a temperature — and we measured how far off it was with Mean Squared Error (the loss function we picked for that task).
Here, we’ll do something different — classification. Instead of outputting one number, the network will output a probability for each category and pick the highest one: look at a handwritten digit and decide whether it’s a 0, 1, or 2. We’ll use the classic MNIST dataset to put all of the previous article’s concepts into practice — building a real network, training it on real data, and seeing the theory come alive.
We’ll use Keras as the primary framework to build and train our networks. Keras is a high-level API that lets you define and train models in a few lines of code, and it can run on top of all the popular backends — PyTorch, TensorFlow, and JAX. We’ll also provide NumPy implementations alongside it — writing the forward pass, loss, backpropagation, and gradient descent by hand — so you can see exactly what Keras is doing under the hood. NumPy is a popular Python library for numerical computation — it lacks GPU and autodifferentiation support, but all the major frameworks borrow its API style and interoperate with it, and datasets like MNIST typically come as NumPy arrays.
There are also two ways to follow along:
- Read the code and results — every code block has its output right below it. You can read through the article and see what each experiment produces without running anything.
- Train live — run real training on a cloud GPU and watch the charts update with actual results. Click the ☁ GPU button in the bottom-right corner to set up a cloud GPU.
The dataset
MNIST is a collection of 70,000 handwritten digits (0–9) — 60,000 for training and 10,000 for testing. Each image is 28×28 pixels, grayscale. To keep things simple and focused on the training concepts, we’ll work with just three digits: 0, 1, and 2. The full dataset has roughly 6,000 training images per digit, so keeping only three gives us about 18,600 training images and 3,100 test images — plenty to train a network, and every widget and code example stays compact.
You’ll notice that accuracy on 3 digits is significantly higher than on the full 10 — our network hits ~99% on 2 while the same architecture reaches ~97% on all digits. This isn’t surprising: with fewer classes the problem is easier (random guessing is 33% vs 10%), the digits we picked look quite distinct from each other (compare that to telling apart 3 and 8, or 4 and 9), and the network’s capacity is spread across fewer output neurons. If you’d like to see the same experiments on the full 10-digit MNIST, there’s a companion notebook that runs through all of them.
Here are a few examples:
The images and labels are stored as NumPy arrays:
Let’s look at the raw data.
# 18,623 images, each a matrix of 28×28 pixels
>>> train_images.shape
(18623, 28, 28)
# 18,623 labels: first image is "0", second is "1", ...
>>> train_labels
array([0, 1, 2, ..., 0, 2, 1], dtype=uint8)
# 3,147 test images, same 28×28 matrices
>>> test_images.shape
(3147, 28, 28)
# 3,147 labels for test images
>>> test_labels
array([2, 1, 0, ..., 0, 1, 2], dtype=uint8)
Each value in the 28×28 grid is a pixel intensity from 0 (black) to 255 (white). The grid is the image — each number maps directly to a shade of gray. You can see this by editing individual pixels — click any pixel on the image or change the row/column indices, then type a new value and watch it change:
Try setting a pixel to 255 (white in the raw data, which renders as black ink) or 0 (black in the raw data, which renders as white background). The image is nothing more than these 784 numbers — change a number, change a pixel.
Why 0–255? Each pixel is stored as a single byte — 8 bits — which can represent values. So 0 means black, 255 means white, and everything in between is a shade of gray. This isn’t specific to MNIST — it’s how virtually all digital images store pixel intensity. (Color images use 3 bytes per pixel — one each for red, green, and blue — but MNIST is grayscale, so one byte is enough.)
Each image comes with a label — the digit it represents (0, 1, or 2). Together, these two arrays — images and labels — are everything the network will learn from. The task is simple: given a 28×28 image, predict which digit it is. That’s a classification problem with 3 classes — the network we’ll build is called a multiclass classifier. In classification, a class is one of the possible labels — one category the model can choose from. For our subset, there are 3 classes (digits 0, 1, 2) with roughly 6,000 examples per digit.
To feed an image into our network, we flatten the 28×28 grid into a single vector of 784 numbers and normalize the pixel values from to :
import numpy as np
# Flatten 28×28 grid into a single vector
image.reshape(784) # [0, 0, 0, ..., 156, 252, 128, ..., 0, 0] — 784 values
# Normalize pixel values from [0, 255] to [0, 1]
image.reshape(784) / 255.0 # [0.0, 0.0, 0.0, ..., 0.61, 0.99, 0.50, ..., 0.0, 0.0]
Normalization isn’t specific to MNIST and is very common in machine learning — it aligns input values with the scale of the weights. Weights are typically initialized to small random numbers around 0, so in our task, inputs go up to 255 and the weighted sums become huge, leading to large activations, large gradients, and unstable training. Dividing by 255 puts everything in — a range where small random weights produce reasonable activations from the start.
Let’s load the dataset. MNIST is so standard that Keras includes it as a built-in dataset:
import keras
import numpy as np
(train_images, train_labels), (test_images, test_labels) = keras.datasets.mnist.load_data()
# Filter to digits 0, 1, 2 only
train_mask = np.isin(train_labels, [0, 1, 2])
test_mask = np.isin(test_labels, [0, 1, 2])
# Flatten 28×28 → 784 and normalize to [0, 1]
X_train = train_images[train_mask].reshape(-1, 784).astype("float32") / 255.0
X_test = test_images[test_mask].reshape(-1, 784).astype("float32") / 255.0
y_train = train_labels[train_mask]
y_test = test_labels[test_mask]
print(f"Training: {X_train.shape[0]} images, {X_train.shape[1]} pixels each")
print(f"Test: {X_test.shape[0]} images")
# Training: 18623 images, 784 pixels each
# Test: 3147 images
We now have 18,623 training images (each a vector of 784 values between 0 and 1) and their labels (each 0, 1, or 2). The test set is separate — we’ll use it only to check how well the model generalizes to digits it has never seen during training.
From regression to classification
In regression, the network outputs a single number. In classification, it needs to choose from a set of categories — in our case, digits 0, 1, or 2. The network needs to say “I think this is a 2” with some level of confidence.
We handle this by giving the output layer 3 neurons — one per digit. Each neuron produces a score called a logit (the name comes from “log-odds” in logistic regression, but in practice it just means “raw score before softmax”) indicating how strongly the network believes the input is that digit. Higher score = more confidence. For a given input image, the output might look like:
The image goes in, the network processes it through its layers, and out come 3 numbers — one per digit. Digit 2 got the highest score (4.2), so that’s the network’s prediction. In code:
logits = [1.5, -0.3, 4.2]
# 0 1 2
# The network's guess: digit 2 (highest logit = 4.2)
prediction = np.argmax(logits) # 2
But raw logits aren’t probabilities — they can be any number, positive or negative, and they don’t sum to 1. To turn them into a proper probability distribution, we use the softmax function — one of the most common operations in deep learning, used everywhere from image classifiers to LLMs like ChatGPT (where it converts raw scores over vocabulary words into the probability of the next token):
Softmax works in three steps:
- Raise to the power of each logit — , , . Since for any , this turns every value — even negative ones — into a positive number.
- Sum all the positive values —
- Divide each value by that sum — , , etc.
This normalizes the list so it adds up to exactly 1 — a proper probability distribution.
Notice that the original logits (1.5, -0.3, 4.2) differ by just a few points. But after exponentiation, the gaps become massive: while and . A logit difference of 2.7 (between 4.2 and 1.5) turns into a 15× difference in the exponentiated values. After dividing by the sum, digit 2 ends up with 92.7% of the probability — the exponential effectively turns “a bit higher” into “almost certain.”
Why is it called “softmax”? Because it’s a softer version of max. A hard max would just pick the largest value and ignore everything else — [0, 0, 1]. Softmax does almost the same thing when one logit dominates (like our 4.2 above — 92.7% is close to 100%). But when the logits are similarly large, the other values get meaningful weight too. Compare:
| logits | hard max | softmax |
|---|---|---|
| [1.0, 8.5, 1.0] | [0, 1, 0] | [0.1%, 99.8%, 0.1%] |
| [3.0, 3.5, 3.0] | [0, 1, 0] | [24.4%, 51.2%, 24.4%] |
| [3.0, 3.0, 3.0] | undefined (tie) | [33.3%, 33.3%, 33.3%] |
Hard max gives a binary answer — winner takes all. Softmax gives a smooth distribution that changes continuously as you vary the inputs. When the model is uncertain (logits close together), softmax reflects that uncertainty. When the model is confident (one logit much larger), softmax approaches hard max. That smooth behavior is also what makes backpropagation work — you can compute gradients through softmax because it’s differentiable everywhere.
You can control just how soft softmax is using a temperature parameter . The formula becomes:
Dividing by before exponentiating scales the differences between logits. Try dragging the T slider:
- T → 0 (low temperature) — the differences get amplified, the distribution becomes sharp. The highest logit gets nearly 100%. This approaches hard max.
- T = 1 — standard softmax. Dividing by 1 changes nothing, so this is equivalent to not having at all.
- T → ∞ (high temperature) — the differences get squashed, all logits become similar, and the distribution approaches uniform (33.3% each).
If you’ve used ChatGPT or other LLMs, you’ve seen this parameter — it’s the same “temperature” slider. LLMs use softmax to turn their output logits into a probability distribution over the next word. Low temperature makes the model pick the most likely word almost every time (deterministic, repetitive). High temperature spreads probability across many words (creative, sometimes nonsensical). The temperature parameter controls the tradeoff.
This is how we would implement softmax in Python:
def softmax(logits):
# Subtract max for numerical stability (doesn't change the result,
# but prevents overflow when computing e^z for large values)
exp = np.exp(logits - np.max(logits))
return exp / np.sum(exp)
probs = softmax(logits)
# [0.063, 0.010, 0.927]
# 0 1 2
# The highest probability is 92.7% for digit 2
Why subtract the max?
Softmax computes , and the exponential function grows extremely fast. If a logit is 1000, then overflows to infinity in floating point. Subtracting the maximum logit from all values before exponentiating shifts the largest value to 0, so is the largest exponent we compute. The math works out to the same probabilities:
The cancels out, but we’ve avoided computing any dangerously large exponentials. This is a standard numerical stability trick you’ll see in every softmax implementation.
The loss function
Now we need a loss function. In the previous article, we used Mean Squared Error — the squared difference between prediction and target, averaged over all data points. That made sense for regression: the prediction was one number, the target was one number, and we wanted them to be close.
For classification, we use cross-entropy loss. The idea is simple: look at the probability the model assigned to the correct class, and penalize it for being low. If the true label is 3 and the model says “60% chance it’s a 3,” the loss is . If the model were more confident — say 95% — the loss would be .
If the model assigned only 1% to the correct class, the loss would be — much higher.
In Python that looks like this:
def cross_entropy_loss(probs, label):
return -np.log(probs[label])
# Model is confident and correct → low loss
cross_entropy_loss([0.05, 0.01, 0.94], label=2) # 0.062
# Model is uncertain → moderate loss
cross_entropy_loss([0.30, 0.30, 0.40], label=2) # 0.916
# Model is confident but wrong → high loss
cross_entropy_loss([0.80, 0.15, 0.05], label=2) # 2.996
Why ? Because converts probabilities (which are between 0 and 1) into a nice loss scale: (perfect prediction, zero loss), and approaches as approaches 0 (terrible prediction, huge loss). The negative sign flips it so the loss is positive. Drag the sliders below to see how the loss changes as you adjust the probabilities — watch the yellow dot move along the curve:
Try dragging digit 2’s slider (the true class) to the right — as its probability approaches 100%, the loss drops to nearly zero. Drag it to the left — as the probability drops, the loss shoots up steeply. That steepness means the gradient is also large — so the model gets a strong signal to correct itself when it’s confidently wrong, and a gentle nudge when it’s already close to the right answer.
The gradient
In the previous article, we saw that training requires computing the gradient of the loss with respect to each parameter — the derivative that tells us which direction to nudge each weight to reduce the error. For our line-fitting example, the gradient of MSE was dw = 2 * mean(error * x). Now we need the gradient of cross-entropy loss with respect to the logits — how should each logit change to reduce the loss?
The gradient of cross-entropy combined with softmax has a remarkably clean form. For each output neuron :
Or more compactly: , where is the one-hot vector (all zeros except a 1 at the correct class). This is as simple as it gets — the gradient is just the difference between what the model predicted and what the answer was.
def cross_entropy_gradient(probs, label):
grad = probs.copy()
grad[label] -= 1 # subtract 1 from the correct class
return grad
# If probs = [0.06, 0.01, 0.93] and label = 2:
# grad = [0.06, 0.01, -0.07]
# ↑ 0.93 - 1 = -0.07
The gradient tells each output neuron which way to adjust. The correct class gets a negative gradient (push its probability up), and every wrong class gets a positive gradient (push its probability down). The magnitudes are proportional to how wrong each probability is — a confident wrong answer gets a large correction.
Derivation of the softmax + cross-entropy gradient
Let’s trace through the calculus to see where comes from.
The loss for a single example is where is the correct class and .
We want — how the loss changes when we nudge logit . This requires the chain rule through softmax.
Case 1: (the correct class)
where we used the fact that — the derivative of softmax with respect to its own logit.
Case 2: (a wrong class)
where we used for — the cross-derivative of softmax.
Combining both cases: , where if and otherwise. That’s the one-hot encoding of the label.
Mini-batches and epochs
In the previous article, we introduced stochastic gradient descent (SGD): instead of computing the gradient over the entire dataset before making one update, we split the data into small mini-batches and update the weights after each one. With 5 data points and a batch size of 2, we got 3 updates per pass through the data.
Now we have 18,623 images. With a batch size of 32, that’s mini-batches — meaning the weights get updated 581 times per pass through the data. Each pass through the full dataset is called an epoch. At the start of each epoch, we shuffle the data so the mini-batches are different every time.
The mechanics are identical to what we covered in the previous article — just more data, more batches, more updates. The same tradeoffs apply: smaller batches give noisier but more frequent updates; larger batches give smoother but fewer updates.
Activation function
In the previous article, we saw that without an activation function, stacking layers is pointless — a chain of linear operations is still just one linear operation. The activation function introduces nonlinearity, which is what lets a network learn curves and complex patterns instead of just straight lines.
We’ll use ReLU (Rectified Linear Unit) for our hidden layers: . It passes positive values through unchanged and clips negatives to zero. ReLU is the default choice in modern networks because its derivative is either 0 or 1, which keeps gradients healthy as they flow backward through many layers. The output layer will use softmax instead — we’ve already covered how that converts logits into probabilities.
Building the network
Let’s build a neural network that takes 784 inputs (pixels) and outputs 3 class probabilities (digits 0, 1, 2). We’ll use the same Layer structure from the previous article — a weight matrix , a bias vector , and an activation function:
But now we need layers that can also do the backward pass — computing gradients and passing them to the previous layer. In the previous article, we derived the chain rule for a single neuron. For a full layer, the same principle applies, just in matrix form:
class Layer:
def __init__(self, n_inputs, n_neurons):
# He initialization: scale weights by sqrt(2/n_inputs)
# This keeps activations in a reasonable range with ReLU
self.W = np.random.randn(n_neurons, n_inputs) * np.sqrt(2.0 / n_inputs)
self.b = np.zeros(n_neurons)
def forward(self, x):
self.x = x # save for backward pass
self.z = self.W @ x + self.b # weighted sum
self.out = np.maximum(0, self.z) # ReLU activation
return self.out
def backward(self, grad_output, lr):
# ReLU derivative: 1 if z > 0, else 0
grad_z = grad_output * (self.z > 0)
# Gradients for this layer's parameters (chain rule)
grad_W = np.outer(grad_z, self.x) # ∂loss/∂W = grad_z ⊗ x
grad_b = grad_z # ∂loss/∂b = grad_z
# Gradient to pass to the previous layer (chain rule continues)
grad_input = self.W.T @ grad_z # ∂loss/∂x = Wᵀ · grad_z
# Gradient descent: update parameters
self.W -= lr * grad_W
self.b -= lr * grad_b
return grad_input
This is the same 4-step training loop from the previous article, embedded into the layer:
- Forward pass: compute , apply ReLU
- Backward pass: receive the gradient from the next layer, compute local gradients using the chain rule
- Parameter gradients: (outer product) and
- Gradient descent: subtract
lr × gradientfrom each parameter
The grad_input at the end is the gradient signal passed backward to the previous layer — this is how the chain rule “flows” through the network, exactly like the diagram from the previous article:
Forward: x ──▶ Layer 1 ──▶ Layer 2 ──▶ Output ──▶ Loss
Backward: x ◀── Layer 1 ◀── Layer 2 ◀── Output ◀── ∂L
Each layer receives a gradient from the right, computes its own parameter gradients (to update and ), and passes the remaining gradient to the left.
Where do the matrix gradient formulas come from?
In the previous article, we derived the gradient for a single neuron: . For a full layer, we have a weight matrix and an input vector , and we need the gradient of the loss with respect to every element of .
The forward pass computes . Each element .
By the chain rule:
This is exactly the outer product — a matrix where element is .
For the bias: , so .
For the input (to pass backward):
That’s the matrix-vector product — the gradient signal passed to the previous layer.
Now let’s also create an output layer that uses softmax instead of ReLU (since the last layer needs to produce probabilities, not ReLU activations):
class OutputLayer:
def __init__(self, n_inputs, n_classes):
self.W = np.random.randn(n_classes, n_inputs) * np.sqrt(2.0 / n_inputs)
self.b = np.zeros(n_classes)
def forward(self, x):
self.x = x
self.z = self.W @ x + self.b
# Softmax instead of ReLU
exp = np.exp(self.z - np.max(self.z))
self.probs = exp / np.sum(exp)
return self.probs
def backward(self, label, lr):
# Combined softmax + cross-entropy gradient: p - y
grad_z = self.probs.copy()
grad_z[label] -= 1
# Same gradient formulas as Layer
grad_W = np.outer(grad_z, self.x)
grad_b = grad_z
grad_input = self.W.T @ grad_z
self.W -= lr * grad_W
self.b -= lr * grad_b
return grad_input
Now we can assemble a network. Let’s start with a simple architecture — one hidden layer of 128 neurons:
# 784 inputs → 128 hidden neurons → 3 output classes
layer1 = Layer(784, 128)
output = OutputLayer(128, 3)
That’s parameters in the first layer and in the output layer — 100,867 parameters total. Compared to the 2 parameters ( and ) in our line-fitting example from the previous article, this is 50,000 times more. Yet the training algorithm is identical.
The same network in Keras
The Layer class we just wrote is Dense in Keras, and our OutputLayer with softmax is Dense with activation="softmax":
model = keras.Sequential([
keras.layers.Dense(128, activation="relu", input_shape=(784,)), # our Layer(784, 128)
keras.layers.Dense(3, activation="softmax"), # our OutputLayer(128, 3)
])
model.summary()
# Total params: 100,867 — same number we counted by hand
Every Dense layer does the exact same computation as our Layer class: . Keras just handles the weight initialization, forward pass, gradient computation, and parameter updates automatically.
Training
The training loop is the same 4-step process from the previous article — forward pass, loss, backpropagation, gradient descent — repeated for every image in every mini-batch in every epoch:
def train(layers, output_layer, X_train, y_train, X_test, y_test,
epochs=5, lr=0.1, batch_size=32):
n = X_train.shape[0]
history = {"train_loss": [], "train_acc": [], "test_acc": []}
for epoch in range(epochs):
# Shuffle training data at the start of each epoch
indices = np.random.permutation(n)
X_shuffled = X_train[indices]
y_shuffled = y_train[indices]
epoch_loss = 0.0
correct = 0
for i in range(0, n, batch_size):
X_batch = X_shuffled[i:i+batch_size]
y_batch = y_shuffled[i:i+batch_size]
for x, label in zip(X_batch, y_batch):
# 1. Forward pass
h = x
for layer in layers:
h = layer.forward(h)
probs = output_layer.forward(h)
# 2. Loss computation
loss = -np.log(probs[label] + 1e-10)
epoch_loss += loss
correct += (np.argmax(probs) == label)
# 3-4. Backpropagation + gradient descent
grad = output_layer.backward(label, lr)
for layer in reversed(layers):
grad = layer.backward(grad, lr)
# Track metrics
train_loss = epoch_loss / n
train_acc = correct / n
test_acc = evaluate(layers, output_layer, X_test, y_test)
history["train_loss"].append(train_loss)
history["train_acc"].append(train_acc)
history["test_acc"].append(test_acc)
print(f"Epoch {epoch+1}/{epochs} — loss: {train_loss:.4f}, "
f"train acc: {train_acc:.3%}, test acc: {test_acc:.3%}")
return history
def evaluate(layers, output_layer, X, y):
correct = 0
for x, label in zip(X, y):
h = x
for layer in layers:
h = layer.forward(h)
probs = output_layer.forward(h)
correct += (np.argmax(probs) == label)
return correct / len(y)
This is the same SGD loop from the previous article: shuffle the data, split into mini-batches, and for each example run the forward pass, compute the loss, backpropagate the gradients, and update the weights. The only difference is that instead of 5 data points and 2 parameters, we have 18,623 images and 100,867 parameters.
Let’s train our network:
layer1 = Layer(784, 128)
output = OutputLayer(128, 3)
history = train([layer1], output, X_train, y_train, X_test, y_test,
epochs=5, lr=0.1, batch_size=32)
# Epoch 1/5 — loss: 0.4612, train acc: 87.28%, test acc: 93.18%
# Epoch 2/5 — loss: 0.2070, train acc: 94.08%, test acc: 95.11%
# Epoch 3/5 — loss: 0.1543, train acc: 95.53%, test acc: 96.08%
# Epoch 4/5 — loss: 0.1229, train acc: 96.40%, test acc: 96.59%
# Epoch 5/5 — loss: 0.1012, train acc: 97.03%, test acc: 96.85%
Here’s the same thing in Keras — the entire training loop above collapses into three lines:
model.compile(
optimizer=keras.optimizers.SGD(learning_rate=0.1), # gradient descent with lr=0.1
loss="sparse_categorical_crossentropy", # cross-entropy loss
metrics=["accuracy"],
)
history = model.fit(
X_train, y_train,
epochs=5,
batch_size=32,
validation_data=(X_test, y_test),
)
# Epoch 1/5 — loss: 0.4577 — accuracy: 0.8732 — val_accuracy: 0.9312
# Epoch 2/5 — loss: 0.2058 — accuracy: 0.9411 — val_accuracy: 0.9518
# Epoch 3/5 — loss: 0.1530 — accuracy: 0.9558 — val_accuracy: 0.9601
# Epoch 4/5 — loss: 0.1218 — accuracy: 0.9645 — val_accuracy: 0.9662
# Epoch 5/5 — loss: 0.1004 — accuracy: 0.9708 — val_accuracy: 0.9688
The numbers match because the algorithm is the same — compile sets up the loss function and optimizer (step 2–4 of our loop), and fit runs the forward pass, loss, backpropagation, and gradient descent for every mini-batch in every epoch (the full loop). "sparse_categorical_crossentropy" is the cross-entropy loss we derived above — “sparse” because we pass integer labels (3) instead of one-hot vectors ([0,0,1]). SGD(learning_rate=0.1) is the same plain gradient descent update: w = w - lr * dw.
Here’s what the training looks like over 5 epochs — the left chart shows the loss dropping as the model learns, the right chart shows accuracy climbing. The green dots are test accuracy, measured at the end of each epoch on digits the model has never seen:
Logs
99% test accuracy after 5 epochs — the network correctly classifies nearly all of the 3,147 test digits it has never seen during training. Not bad for a simple 2-layer network.
What happens during training
When we call model.fit(), what actually happens? The model takes each image from the training set, one at a time, and runs it through the same 4 stages we described in the previous article:
-
Forward pass — the image (784 pixel values) flows through the network’s layers. Each layer computes — multiply by weights, add bias, apply ReLU. The final layer applies softmax and outputs 3 probabilities — one per digit.
-
Loss — we look at the probability the model assigned to the correct digit and compute . If the model is confident and right (probability near 1.0), the loss is small. If the model is wrong or uncertain, the loss is large.
-
Backpropagation — starting from the loss, we compute the gradient for every weight in the network. The output gradient is simply (probabilities minus the one-hot label). This gradient flows backward through each layer using the chain rule, telling each weight which direction to move.
-
Weight update — each weight is nudged by
lr × gradient. The learning rate controls how big the nudge is. After this step, the model is slightly better at classifying this particular image.
Then we move to the next image and repeat. After going through all ~18,600 images once, that’s one epoch. We typically train for several epochs — each pass through the data makes the model a little better.
In practice, we don’t process one image at a time — we process a mini-batch of 32 images together and average their gradients before updating the weights (as we discussed in the mini-batches section). But the concept is the same: forward pass → loss → backprop → update, repeated for every batch.
The widget below lets you watch this happen for individual images. Each step shows one image going through the full pipeline — you can see the model’s prediction, the loss, the gradients, and how the weights change. At step 0, the model is random and mostly wrong. As you step forward, it starts getting more predictions right and the loss drops:
Notice how the loss starts high (the model is guessing randomly) and drops quickly in the first few steps — the gradients are large because the model is very wrong, so the weight updates are large. As the model improves, the loss decreases, the gradients shrink, and the updates get smaller. This is the same self-regulating behavior we saw in the previous article: gradient descent automatically takes big steps when it matters most and small steps when it’s close to the answer.
What we’ve covered
We’ve taken every concept from the previous article and applied it to a real problem — classifying handwritten digits with 100,000+ parameters instead of fitting a line with 2:
- Classification vs regression: softmax converts raw logits into probabilities, cross-entropy measures how wrong those probabilities are
- Forward pass: input flows through layers, each computing — same formula, now in matrix form
- Backpropagation: the gradient of softmax + cross-entropy simplifies to , and the chain rule flows backward through layers exactly as we derived by hand
- Training loop: shuffle, split into mini-batches, forward pass → loss → backprop → update — repeated for every batch in every epoch
The algorithm is identical to what we built in the previous article. Keras automates it, but under the hood it’s the same computation our NumPy code performs step by step.
But we used the default settings throughout — one hidden layer of 128 neurons, learning rate 0.1, batch size 32 — and only 3 digit classes. What happens if we change them? In the next article, we’ll systematically vary each piece — learning rate, batch size, network depth, activation functions — and scale up to all 10 digits. We’ll also explore overfitting, early stopping, and learn how to diagnose when training goes wrong.
Stay up to date
Get notified when I publish new deep dives.