How neural networks learn: deep dive into backpropagation and gradient descent

A neural network is just a pile of numbers — millions or billions of parameters (weights and biases) organized into layers. Different arrangements of layers give you different architectures (CNNs, transformers, etc.), but at the core they’re all parameters that determine what the network does. An untrained network produces random noise. A trained one recognizes faces, translates language, or writes code. The difference is the values of those parameters.

Training is the process of finding the right values for those parameters. The network makes a prediction, measures how wrong it is, and then adjusts its parameters to be less wrong next time. The algorithm that figures out which direction to adjust each parameter is called backpropagation, and the algorithm that actually applies the adjustment is called gradient descent. Together, they’re the engine behind virtually all modern deep learning.

Here’s what that looks like on the simplest possible example — a model learning to fit a line. Click Step a few times and watch:

step 0 · loss = 28.00

Each click runs one round of the same algorithm that trains every neural network — from the smallest classroom example to GPT. The model makes a prediction, measures how wrong it is (the red dashed lines), computes which direction to adjust its parameters, and nudges them a little. That’s it. Repeat enough times and the line locks onto the data.

This article is about understanding exactly what happens inside each of those steps. We’ll start from a single neuron and build all the way up to multi-layer networks — with interactive widgets and Python code at every step so you can experiment with the concepts directly.

This is a long article — it’s not meant to be read in one sitting. Take it in sections: start with the basics (what a neuron is, how loss works), play with the widgets until they click, then come back for the calculus and chain rule when you’re ready. Each section builds on the previous one, so if something feels unclear, it’s worth going back and re-reading the section before it.

A neural network is just parameters

A neural network is built from layers, and each layer is built from neurons. A neuron is the smallest unit of computation — it takes some inputs, does a simple calculation, and produces one output. Stack many neurons side by side and you get a layer. Stack several layers end to end — where the output of one feeds into the input of the next — and you get a neural network. The entire network, no matter how large, is just these same small pieces repeated and connected. So to understand the whole thing, we can start by understanding one neuron.

Let’s use the widget below to play with this and build intuition. It’s a single neuron with 3 inputs. On the left you control the input values, on the right you set the neuron’s parameters — its weight vector, bias, and activation function. The diagram updates live as you change anything.

w₁x₁ + w₂x₂ + w₃x₃ + b → activation(sum) → output
0.5·1 + 0.5·0 + 0.5·1 + 0.0 = 1.0 → perceptron(1.0) = 1
inputs
1.0
0.0
1.0
neuron parameters
0.5
0.5
0.5
0.0

What you can see on the widget is a neuron multiplying each input by a corresponding weight, adding them up with a bias, and passing the result through an activation function.

So mathematically wise we’re doing this:

output=f(w1x1+w2x2+w3x3+b)\text{output} = f(w_1 x_1 + w_2 x_2 + w_3 x_3 + b)

where ff is the activation function. Each input xix_i gets multiplied by its corresponding weight wiw_i, the products are summed together with a bias bb, and the result is passed through the activation function ff. That’s all a neuron does — multiply, sum, activate.

Try a few configurations to build intuition:

  • A weight controls how much an input matters. Set x₁ = 1.0 and the rest to 0. Now drag w₁ — the output responds directly. Set w₁ = 0 and that input is completely ignored, no matter its value.
  • Negative weights are inhibitory. Set x₁ = 1.0, w₁ = -1.5, everything else to 0. The weighted sum goes negative — with the perceptron activation, the output is 0. The neuron is actively suppressing that input.
  • The bias shifts the decision boundary. With all inputs at 0, only the bias determines the sum. A positive bias means the neuron fires even with no input. A negative bias means the inputs have to “overcome” it before the neuron activates.
  • The activation function shapes the output. Switch from Perceptron (hard 0/1) to Sigmoid — now the output is a smooth value between 0 and 1. Try ReLU — it passes positive values through unchanged and clips negatives to zero.

Why do activation functions matter? Without them, every neuron is just a linear function (multiply and add), and stacking linear functions together still gives you a linear function — no matter how many layers you add. Activation functions introduce nonlinearity, which is what allows neural networks to learn curves, edges, and complex patterns instead of just straight lines. In fact, a neural network with enough neurons and nonlinear activations can approximate virtually any function — this is known as the universal approximation theorem. The activation function is what makes that possible.

The weights and bias are the neuron’s parameters — the values it needs to learn. A neuron stores one weight per input — in our widget above, that’s a vector of 3 values. The weights determine how much each input matters, and the bias shifts the result up or down. Together, these parameters define what the neuron responds to. Different weights and biases make the same neuron detect completely different patterns in its inputs.

Since the multiply-and-sum is just a dot product, this is usually written in vector form:

output=f(wx+b)\text{output} = f(\mathbf{w} \cdot \mathbf{x} + b)

Here x\mathbf{x} is the vector (list) of all inputs — e.g. [x1,x2,x3][x_1, x_2, x_3] — and w\mathbf{w} is the vector of all weights — e.g. [w1,w2,w3][w_1, w_2, w_3]. The dot product wx\mathbf{w} \cdot \mathbf{x} multiplies each pair and sums the results: w1x1+w2x2+w3x3w_1 x_1 + w_2 x_2 + w_3 x_3.

In Python, that might look like this:

import numpy as np

class Neuron:
    def __init__(self, n_inputs):
        self.w = np.random.randn(n_inputs)   # w — weight vector, e.g. [w₁, w₂, w₃]
        self.b = 0.0                         # b — bias

    def forward(self, x):                    # x — input vector, e.g. [x₁, x₂, x₃]
        z = np.dot(self.w, x) + self.b       # w · x + b — dot product + bias
        return max(0, z)                     # f(z) — activation function (ReLU)

# Create a neuron with 3 inputs and run it
neuron = Neuron(3)
output = neuron.forward(np.array([1.0, 0.5, 0.7]))

A layer is a bunch of neurons

Stack many of these neurons together and you get a layer. Put several layers together and you get a neural net. Here’s a small one — 2 inputs, two hidden layers of 3 neurons each, and 1 output. They’re called “hidden” because you only see the inputs going in and the output coming out — the layers in between are internal to the network, invisible from the outside:

x₁ x₂ y input layer 1 layer 2 output

We saw about that every neuron holds a set of weights for all its inputs and a bias, wrapped in the activation function:

output=f(w1x1+w2x2+w3x3+b)\text{output} = f(w_1 x_1 + w_2 x_2 + w_3 x_3 + b)

A layer typically stores all of them together in a weight matrix (WW) — one row of weights per neuron. For the network depicted above, layer1.W is a 3×2 matrix — 3 neurons, each with 2 weights, since each neuron gets 2 inputs:

layer1.W = [[ 0.4, -0.2],    ← neuron 0: weights for x₁, x₂
            [ 0.1,  0.7],    ← neuron 1: weights for x₁, x₂
            [-0.3,  0.5]]    ← neuron 2: weights for x₁, x₂

For a single neuron, we had a dot product between one weight vector and the input: f(wx+b)f(\mathbf{w} \cdot \mathbf{x} + b). For a full layer, we stack all the weight vectors into a matrix WW and all the biases into a vector b\mathbf{b}, so the same operation applies to every neuron at once:

output=f(Wx+b)\text{output} = f(W\mathbf{x} + \mathbf{b})

When we compute WxW\mathbf{x}, each row of WW gets dot-producted with the input — that’s one neuron’s weighted sum. The matrix multiply does all of them in a single operation.

single neuron (dot product) w₁ w₂ w₃ w · x₁ x₂ x₃ x = out one row → one output stack 3 neurons full layer (matrix multiply) 0.4 -0.2 ← n₀ 0.1 0.7 ← n₁ -0.3 0.5 ← n₂ W @ x₁ x₂ x = w₀·x w₁·x w₂·x ← neuron 0 ← neuron 1 ← neuron 2

In Python, that looks like this:

import numpy as np

class Layer:
    def __init__(self, n_inputs, n_neurons):
        # W is a matrix where each ROW is one neuron's weights.
        # Shape: (n_neurons, n_inputs) — so W[0] is neuron 0's weights,
        # W[1] is neuron 1's weights, etc.
        self.W = np.random.randn(n_neurons, n_inputs)
        self.b = np.zeros(n_neurons)  # b — bias vector, one per neuron

    def forward(self, x):
        # W @ x multiplies every neuron's weight row by the input,
        # computing all dot products at once
        return np.maximum(0, self.W @ x + self.b)  # ReLU activation

# Build the network from the diagram above
layer1 = Layer(2, 3)   # 2 inputs  → 3 neurons  (6 weights + 3 biases = 9)
layer2 = Layer(3, 3)   # 3 inputs  → 3 neurons  (9 weights + 3 biases = 12)
output = Layer(3, 1)   # 3 inputs  → 1 neuron   (3 weights + 1 bias   = 4)

# Forward pass — each layer's output feeds into the next
x = np.array([0.5, 0.8])
h1 = layer1.forward(x)       # input → hidden layer 1
h2 = layer2.forward(h1)      # hidden layer 1 → hidden layer 2
y  = output.forward(h2)      # hidden layer 2 → output

When we compute W @ x, each row gets dot-producted with the input — all neurons in the layer computed in one operation. That’s why neural networks use linear algebra, and why GPUs (built for matrix math) make training fast.

Every weight and every bias is a parameter. Count them up in the network above — layer 1 has 9, layer 2 has 12, the output has 4 — and you get 25 parameters total. That’s a tiny network. GPT-2 had about 1.5 billion parameters; GPT-3 had 175 billion. Research on scaling laws has shown that model performance tends to improve predictably as you increase model size, training data, and compute — which is why the field keeps pushing these numbers higher. Though there are signs that simply adding more parameters is reaching diminishing returns, and the focus is shifting toward better training data, more efficient architectures, and techniques like reasoning and chain-of-thought that get more out of existing model sizes.

To get a sense of the compute involved: training GPT-3 required roughly 3 × 10²³ operations. At 1 billion operations per second, that’s 10 million years on a single processor. Thousands of GPUs working in parallel compressed it to weeks.

How does a network learn?

Training a neural network means finding the right values for all of these parameters — every weight in every layer’s matrix, every bias in every layer’s vector. At the start of training, these parameters are initialized to small random values — the network literally knows nothing. Training is the process of iteratively adjusting these random numbers until they produce useful predictions.

The general single training loop that repeats every iteration looks like this:

StageWhat it does
1. Forward passRun inputs through every layer of the network, multiplying by weights and applying activations, to produce a prediction
2. Loss computationCompare the prediction to the actual target value using a loss function (e.g. MSE) that reduces all the errors to a single number — how wrong is the model?
3. BackpropagationWalk backwards through the network using the chain rule to compute a gradient for each parameter — which direction should it move, and by how much?
4. Gradient descentUpdate each parameter by subtracting a small fraction (learning rate) of its gradient, nudging the whole network toward lower loss

These four stages split into two big parts. The forward pass (stage 1) runs inputs through the network to produce a prediction. When a model is deployed for inference — making predictions in production — only the forward pass runs. No weights are updated.

Training (stages 2, 3, and 4) is everything that happens after the prediction: measuring the error, figuring out which way to adjust each parameter, and applying the update. These three stages work together, and they’re the focus of this article.

The process has two directions — data flows forward to produce a prediction, then gradients flow backward to update the weights:

  Forward pass: each layer receives activations, passes output →

              activations    activations    activations
  Input ─────────▶ Layer 1 ─────────▶ Layer 2 ─────────▶ Output ──▶ Loss


  Backward pass: each layer receives gradient signal, passes it ←

              gradients      gradients      gradients
  Input ◀───────── Layer 1 ◀───────── Layer 2 ◀───────── Output ◀── ∂L
              ↓ ∂L/∂W₁           ↓ ∂L/∂W₂           ↓ ∂L/∂W₃
           (own weight       (own weight         (own weight
            gradients)        gradients)          gradients)

Notice the symmetry: in the forward pass, each layer receives activations from the previous layer and passes its output forward. In the backward pass, each layer receives a gradient signal from the next layer and passes it backward. In both directions, each layer needs an input from its neighbor to do its work.

A simple example: fitting a line

To understand how this works, we’ll strip everything down to the simplest possible network: one neuron, one input, one weight, one bias. Once you see how gradient descent works with 2 parameters (one weight and one bias), the leap to 25 or 25 billion is just scale.

Remember what a single neuron computes: f(wx+b)f(w \cdot x + b), where xx is a vector of inputs and ww is a vector of weights. Strip it down to one input and ignore the activation function, and you get f(x)=wx+bf(x) = wx + b — the basic linear function you learned in school. That’s not a simplification; it’s literally what the neuron does before the nonlinearity through activation functions. So training a single neuron to fit a line is the purest version of the problem.

Suppose someone hands you five points and says they come from a linear function f(x)=wx+bf(x) = wx + b, and asks you to find w and b:

x (input)-2-1012
y (output)-3-1135

This is the opposite of what you did in school. In algebra class, you’re given an equation like y=3x+5y = 3x + 5 and asked to “solve for xx” — the function parameters of w=3w=3 and b=5b=5 are known, you’re finding the input xx. In our case we’re not solving for xx. We’re solving for the parameters that define the function itself — ww (the slope, how steep the line is) and bb (the intercept, where it crosses the y-axis).

But why search for a function if we already have the data? Because the whole point is to handle inputs you’ve never seen before. If someone asks “what’s the output when x = 1.5?” and 1.5 isn’t in your data, a table can’t help. But if you’ve discovered that the underlying function is f(x) = 2x + 1, you can instantly answer: 4. That’s generalization — the ability to make correct predictions on new, unseen inputs.

So let’s put those points on the graph (green dots) and try to find w and b manually by adjusting the sliders. Use the loss value to guide you — drag them and see if you can get the loss to zero. You’ll find that w=2w = 2 and b=1b = 1 bring the loss to zero — those are the exact parameters that generated the data, giving us y=2x+1y = 2x + 1:

0.0
0.0

We were using the loss to guide our search — but what is it? The loss is a single number that tells you how wrong the model is overall. When the loss is high, the predictions are far from the data. When it’s zero, the model fits perfectly.

As you drag the sliders, notice the red dashed lines — those are the individual errors at each data point, showing how far off the prediction is from the actual value. We compute an error for each point as error=predictionactualerror = prediction - actual.

When training the model, we need a single number that tells us how wrong the prediction is overall. That’s the loss — a function that takes all the errors and reduces them to one score. There are many loss functions in machine learning, each suited to different tasks:

  • Mean Squared Error (MSE) — for regression (predicting numbers). Squares each error and averages them.
  • Cross-Entropy — for classification (predicting categories). Measures how far predicted probabilities are from the true labels.
  • Mean Absolute Error (MAE) — like MSE but uses absolute values instead of squares, less sensitive to outliers.

Since we’re fitting a line — a task known as regression (predicting a continuous number) — we’ve used Mean Squared Error (MSE): take each error, square it, then average them all. Squaring does two things — it makes all errors positive (so they don’t cancel each other out), and it punishes large errors much more than small ones:

MSE=1ni=1n(yprediyactuali)2=(ypred1yactual1)2+(ypred2yactual2)2++(yprednyactualn)2nMSE = \frac{1}{n} \sum_{i=1}^{n} (y_{\text{pred}_i} - y_{\text{actual}_i})^2 = \frac{(y_{\text{pred}_1} - y_{\text{actual}_1})^2 + (y_{\text{pred}_2} - y_{\text{actual}_2})^2 + \cdots + (y_{\text{pred}_n} - y_{\text{actual}_n})^2}{n}

Let’s apply this formula to our data. Say your current guess is w=3w = 3, b=1b = 1, so f(x)=3x+1f(x) = 3x + 1. For each of our 5 data points, we compute the prediction, the error (how far off), and the squared error:

xxyactualy_{\text{actual}}ypred=3x+1y_{\text{pred}} = 3x + 1errorerror²
-2-3-5-24
-1-1-2-11
01100
13411
25724
mean →loss = 2.0

Square each error (so negatives don’t cancel out), then average. The result is one number: a loss of 2.0 means our predictions are off by about 1.4 on average (21.4\sqrt{2} \approx 1.4). Higher means worse fit, zero means perfect. When we set w=2w = 2 and b=1b = 1 in the widget above, the loss drops to zero because those are the exact parameters that generated the data.

We’ve managed to find the parameters manually for our simple 2-parameter function, but imagine doing this with 25 parameters, let alone millions. Manual tuning is impossible at that scale — no human could explore a space of millions of dimensions. We need a systematic way to look at the loss and mathematically figure out which direction to nudge each parameter to make it smaller. That’s exactly what backpropagation and gradient descent do together: backpropagation computes which way to adjust each parameter, and gradient descent takes a small step in that direction. We repeat that process until the loss is minimal.

For a simple linear model, there’s an exact formula — the Normal Equation — that gives you the perfect w and b in one shot, no iteration needed. But it only works for linear models. The moment you have nonlinear activations, multiple layers, and millions of parameters, there is no formula. Gradient descent doesn’t need one — it only needs to measure the error and compute which direction to nudge. It works for any differentiable model, which is why it’s the universal training algorithm.

The training loop, step by step

Let’s first see how this automatic algorithm works. The widget below lets you run backpropagation and gradient descent for our task step by step and watch everything happen:

  • Left chart: the data points (green dots) and the model’s prediction line (blue) based on the current values of ww and bb. Red dashed lines show the error at each point — the difference between the prediction and the actual value. These errors are squared and averaged to produce the MSE loss.
  • Right chart: the loss plotted over each step — this is how you monitor whether training is working. A steadily decreasing curve means the model is learning; if it plateaus or spikes, something needs adjusting.

Click Step to run one gradient descent update, or Step x10 to run ten at once.

Backpropagation

-3.0
3.0

Gradient Descent

0.10

Computation (step 0)

After one step, the line is still wrong, but it’s less wrong. Run it again. And again. Each step makes the error smaller, the gradients shrink, and the line creeps closer to the target. Expand the “Computation” section to see the numbers at each step. For now, keep the learning rate at its default (0.1) — we’ll explore what it does and how to choose it in the next section.

As you keep clicking Step, watch the right chart — the loss drops steeply at first (the model corrects its worst mistakes quickly, because the gradients are large), then flattens out as it gets close to the answer (smaller errors mean smaller gradients, so each step does less). This is called convergence — the model settling into the right parameters.

This happens even though the learning rate stays constant the whole time. Remember, the gradient is dw = 2 * mean(error * x) — it’s computed from the errors. As the model gets closer to the correct answer, the errors shrink, which makes the gradient smaller, which makes the update lr * dw smaller. The learning rate doesn’t change, but the steps get smaller automatically because there’s less error to correct. This self-regulating behavior is the signature of gradient descent: it’s fast when it matters most, then carefully fine-tunes without you having to change anything.

Each click of “Step” runs one full training iteration — four steps matching the training loop we described earlier that looks like this in Python:

 # 1. forward pass
y_pred = w * x + b

# 2. loss computation
error  = y_pred - y
loss   = np.mean(error ** 2)

# 3. backpropagation
dw = 2 * np.mean(error * x)
db = 2 * np.mean(error)

# 4. gradient descent
w = w - lr * dw
b = b - lr * db

Steps 1 and 2 are the forward pass and loss. The forward pass runs each input through y = wx + b to get a prediction (you can see the per-input predictions in the Computation panel). The loss computation then measures how wrong we are — the difference between each prediction and actual value, squared and averaged into a single number (MSE). We covered how MSE works in the previous section.

Steps 3 and 4 are where the learning happens. Backpropagation (step 3) computes a gradient for each parameter — which direction to nudge it and by how much. Gradient descent (step 4) applies those gradients, subtracting a small fraction (the learning rate) from each parameter. Let’s briefly touch on the learning rate — the lr in step 4 — before diving into backpropagation, which is where most of the complexity lives.

Computing learning rate

In the computation section above, you can see that gradient descent updates parameters like this:

w = w - lr * dw
b = b - lr * db

The gradient (dw, db) tells us which direction to move each parameter and by how much relative to the others. But how far should we actually step? That’s what the learning rate (lr) controls — it scales every gradient before applying it.

The direction is always correct — but the step size can be wrong. If lr is too small, each step barely moves and training takes forever. If lr is too large, you overshoot the minimum, landing on the other side where the loss is worse than before. Think of the learning rate as a confidence knob:

  • Too small (try 0.01) — each step is tiny. The model inches toward the answer, taking hundreds of steps to get there. Safe but painfully slow.
  • Just right (try 0.1) — the model takes confident steps, converging in 20-30 steps. The loss drops quickly at first, then fine-tunes.
  • A bit too large (try 0.5) — the model overshoots the minimum, bouncing back and forth across it. But each overshoot lands closer to the bottom where gradients are smaller, so the bounces shrink and it still converges — just with a zigzag path and more steps than lr = 0.1.
  • Too large (try 1.0) — the overshooting gets more extreme. Each step lands far from the minimum, where the gradient is still large, causing another large step. It may still converge, but it’s jittery and wasteful.
  • Way too large (try 1.5) — the overshooting is so extreme that each step lands somewhere worse than before. The gradient gets larger, not smaller, so the next step is even bigger — a feedback loop that sends the loss spiraling upward. This is called divergence.

Try it yourself — change the learning rate and click Step x10 to see the effect:

0.10

There’s no formula for the “right” learning rate. In practice, most people start with a common default (0.001 or 0.0001), use an adaptive optimizer like Adam that automatically adjusts the step size per parameter based on how its gradients have been behaving, and apply a learning rate schedule that starts large (big steps to get roughly close) and shrinks during training (small steps to fine-tune). Almost everyone uses Adam or a variant instead of plain gradient descent.

All of these are refinements on top of the same 4-stage training loop. The core algorithm doesn’t change.

Computing backpropagation

In the computation section above, you can see that backpropagation computes the gradients like this:

dw = 2 * np.mean(error * x)
db = 2 * np.mean(error)

There’s a lot packed in these two lines. Why do we multiply error by x for dw but not for db? Where does the 2 come from? What does mean have to do with anything? Let’s unpack it step by step.

Remember, the loss is computed from the predictions, and the predictions depend on w and b. The loss is ultimately a function of the parameters — change w or b, and the loss changes. Drag w or b in the widget below and watch the loss change — the white dot moves along the curve, showing exactly where you are on the loss landscape:

-3.0
3.0

Try dragging w — the dot moves along the left curve, but the right curve reshapes. Why? The right chart asks “for every possible b, what’s the loss?” — with w fixed at whatever the slider says. When you change w, you change that fixed value, which changes the errors at every b, producing a completely different curve. The same happens in reverse: drag b and the left curve reshapes. The best value for w depends on where b is, and vice versa — they’re coupled.

The gradient formulas come from taking the derivative of these curves — measuring how much the loss changes when you nudge each parameter by a tiny amount. So, 2 * mean(error * x) is simply the derivative of the loss function with respect to w.

To understand how we get from the loss function to 2 * mean(error * x), we need three concepts that build on each other:

  1. Derivatives — what it means to measure how a function changes
  2. The chain rule — how to compute derivatives when functions are chained together
  3. Partial derivatives and gradients — how to handle multiple parameters at once

By the end, we’ll trace exactly where every piece of that formula comes from. Let’s start with what a derivative actually is.

The formula we just saw — 2 * mean(error * x) — is specific to MSE loss with a linear model. Different loss functions and architectures produce different gradient formulas — but the underlying math principles are always the same. For our simple model we can derive the formula by hand; for deep networks with millions of parameters, frameworks like PyTorch and TensorFlow compute derivatives automatically using autograd (automatic differentiation).

The derivative: slope at a point

A derivative answers one question: if I nudge this input a tiny bit, how much does the output change? Think of it as the slope of a curve at a single point. If you’re standing on a hill, the derivative tells you how steep the ground is under your feet — and in which direction it goes downhill.

Take a simple function like f(x) = x². Drag the point x along the curve and watch how the slope and the derivative changes:

At x=1.0 derivative is 2.0
1.0
0.80

Computation

Set x = 2 and dx = 0.5 in the widget. The yellow line (dxdx) is a nudge to the input, the green line (dfdf) is how much the output changes in response. The Computation section below the chart shows how these combine into the derivative.

First, we evaluate the function at our point: f(2)=4f(2) = 4. Then we nudge the input by dx and evaluate again: f(2.5)=6.25f(2.5) = 6.25. The difference tells us how much the output changed: df=6.254=2.25df = 6.25 - 4 = 2.25. Dividing by the nudge gives us the rate of change: df/dx=2.25/0.5=4.5df/dx = 2.25 / 0.5 = 4.5.

That ratio (4.5) is approximately the derivative at x = 2 — it tells you the rate: at this point, the output changes about 4x faster than the input. It’s not exactly 4 because dx = 0.5 is still a large nudge. Now let’s reduce the dx — try dragging it down to 0.1:

  • f(2)=4f(2) = 4
  • f(2.1)=4.41f(2.1) = 4.41
  • df=0.41df = 0.41
  • df/dx=0.41/0.1=4.1df/dx = 0.41 / 0.1 = 4.1 — closer to 4

As dx shrinks, the ratio converges to the exact derivative.

That’s the whole idea — the derivative is what df/dxdf/dx approaches as dxdx shrinks toward zero: the exact rate of change at a single point.

The general formula looks complex, but it’s actually exactly what we just did:

f(x)=limdx0f(x+dx)f(x)dxf'(x) = \lim_{dx \to 0} \frac{f(x + dx) - f(x)}{dx}

f(x+dx)f(x)f(x + dx) - f(x) is the change in output (dfdf). Divide by dxdx to get the ratio. The limdx0\lim_{dx \to 0} part just means “shrink dxdx toward zero” — exactly what you did with the slider, watching the ratio converge to the exact value.

For f(x)=x2f(x) = x^2, we can work it out:

f(x+dx)=(x+dx)2=x2+2xdx+dx2f(x + dx) = (x + dx)^2 = x^2 + 2x \cdot dx + dx^2 f(x+dx)f(x)=2xdx+dx2f(x + dx) - f(x) = 2x \cdot dx + dx^2 f(x+dx)f(x)dx=2x+dx\frac{f(x + dx) - f(x)}{dx} = 2x + dx

As dx0dx \to 0, that’s just 2x2x. So dfdx=2x\frac{df}{dx} = 2x.

That’s the whole point of calculus — it eliminates the need to pick a dxdx. The widget shows why you can trust the exact formula: no matter what dxdx you pick, the ratio trends toward 2x2x as dxdx shrinks. So we skip the shrinking and just use 2x2x.

If you want to build a deeper understanding of derivatives, The Essence of Calculus by 3Blue1Brown is the best explanation out there. The entire series is worth watching — it builds the intuition that textbooks often skip.

The chain rule: computing derivatives of combined functions

We know how to take the derivative of a simple function like f(x)=x2f(x) = x^2. But what happens when one function feeds into another? That’s called function composition — and it’s exactly what our computation does:

y_pred = w * x + b             # prediction
error  = y_pred - y            # how far off
loss   = np.mean(error ** 2)   # squared error, averaged

We want to figure out how to nudge w to reduce the loss, but only f1f_1 contains w. We need to compute the derivative of the loss with respect to w, but loss (f3f_3) doesn’t take w as a parameter — it takes error. And error (f2f_2) doesn’t take w either — it takes y_pred. Only y_pred (f1f_1) finally takes w.

So we can see that computing the loss from w isn’t one function — it’s a chain of three functions, each feeding its output into the next:

wf1y_predf2errorf3error2w \xrightarrow{f_1} y\_pred \xrightarrow{f_2} error \xrightarrow{f_3} error^2

Spelled out:

  • f1(w)=wx+bf_1(w) = w \cdot x + b — the model’s prediction
  • f2(y_pred)=y_predyf_2(y\_pred) = y\_pred - y — how far off we are
  • f3(error)=error2f_3(error) = error^2 — the squared error (what we want to minimize)

The loss is f3(f2(f1(w)))f_3(f_2(f_1(w))) — three functions nested inside each other.

We can find the derivative of each individual function, but how do we combine them to get the derivative of the whole chain? The answer is the chain rule: multiply the local derivatives together.

d(loss)dw=f1f2f3\frac{d(\text{loss})}{dw} = f'_1 \cdot f'_2 \cdot f'_3

Remember the derivative formula:

f(x)=limdx0f(x+dx)f(x)dxf'(x) = \lim_{dx \to 0} \frac{f(x + dx) - f(x)}{dx}.

The numerator f(x+dx)f(x)f(x + dx) - f(x) is the change in the output — dfdf. The denominator dxdx is the change in the input. So the whole thing is dfdx\frac{df}{dx} — “change in ff divided by change in xx”. This is another way to write f(x)f'(x). The output goes on top, the input goes on the bottom:

  • f1f'_1: output is y_predy\_pred, input is wwd(y_pred)dw\frac{d(y\_pred)}{dw}
  • f2f'_2: output is errorerror, input is y_predy\_predd(error)d(y_pred)\frac{d(\text{error})}{d(y\_pred)}
  • f3f'_3: output is error2error^2, input is errorerrord(error2)d(error)\frac{d(\text{error}^2)}{d(\text{error})}

Using this notation, the chain rule expands to:

d(loss)dw=f1f2f3=d(y_pred)dwd(error)d(y_pred)d(error2)d(error)\frac{d(\text{loss})}{dw} = f'_1 \cdot f'_2 \cdot f'_3 = \frac{d(y\_pred)}{dw} \cdot \frac{d(\text{error})}{d(y\_pred)} \cdot \frac{d(\text{error}^2)}{d(\text{error})}

Why multiplication? Because each function is nested inside the next — the output of one becomes the input of another. Think of it as a chain of nudges: if you nudge w by a tiny amount, y_pred changes by x times that nudge. Then error changes by 1 times whatever y_pred changed. Then error² changes by 2·error times whatever error changed. Each link in the chain scales the nudge — and scaling compounds by multiplication.

The three ways to combine functions

There are three fundamental ways to combine two functions f(x)f(x) and g(x)g(x), and each has its own rule for how derivatives combine:

  1. Addition: h(x)=f(x)+g(x)h(x) = f(x) + g(x) — derivatives add. If ff changes by 3 and gg changes by 5, the sum changes by 8. This is the sum rule: h(x)=f(x)+g(x)h'(x) = f'(x) + g'(x).

  2. Multiplication: h(x)=f(x)g(x)h(x) = f(x) \cdot g(x) — it’s more complex because both factors can change. This is the product rule: h(x)=f(x)g(x)+f(x)g(x)h'(x) = f'(x) \cdot g(x) + f(x) \cdot g'(x). You have to account for each function changing while the other is held constant.

  3. Composition (nesting): h(x)=f(g(x))h(x) = f(g(x)) — the output of gg feeds into ff. Derivatives multiply. This is the chain rule: h(x)=f(g(x))g(x)h'(x) = f'(g(x)) \cdot g'(x). A nudge to xx gets scaled by gg', then that scaled change gets scaled again by ff'.

Our loss computation is a composition — f3(f2(f1(w)))f_3(f_2(f_1(w))) — which is why we multiply the derivatives. If the functions were added or multiplied together instead, we’d use the corresponding rule. In practice, neural networks use all three: addition (bias terms), multiplication (weights times inputs), and composition (layers feeding into each other). Backpropagation applies whichever rule matches each operation.

Okay, so let’s compute the combined derivative of our loss chain. We showed how to find the derivative of x2x^2, which gave us 2x2x. The same approach works for simpler functions: the derivative of ax+bax + b is just aa (a constant multiplier), and the derivative of xcx - c is 11 (subtraction of a constant doesn’t change the rate). This makes calculating each individual derivative straightforward:

  • To compute f1=d(y_pred)dwf'_1 = \frac{d(y\_pred)}{dw}, we use the fact that the derivative of ax+bax + b is aa. Since y_pred=wx+by\_pred = w \cdot x + b, the derivative is x.
  • To compute f2=d(error)d(y_pred)f'_2 = \frac{d(\text{error})}{d(y\_pred)}, we use the fact that the derivative of xcx - c is 11. Since error=y_predyerror = y\_pred - y, the derivative is 1.
  • To compute f3=d(error2)d(error)f'_3 = \frac{d(\text{error}^2)}{d(\text{error})}, we use the fact that the derivative of x2x^2 is 2x2x. Since the function is error2error^2, the derivative is 2 · error.

Which gives us:

d(loss)dw=f1f2f3=x1(2error)=2errorx\frac{d(\text{loss})}{dw} = f'_1 \cdot f'_2 \cdot f'_3 = x \cdot 1 \cdot (2 \cdot error) = 2 \cdot error \cdot x

That’s 2errorx2 \cdot error \cdot x for a single data point.

“Let’s trace it with real numbers. With w = 3, b = 1, take data point x = 2, y = 5:

w = 3
  ↓  × x = ×2
y_pred = 3·2 + 1 = 7
  ↓  × 1
error = 7 - 5 = 2
  ↓  × 2·error = ×4
error² = 4

Chain rule: f1f2f3=214=8f'_1 \cdot f'_2 \cdot f'_3 = 2 \cdot 1 \cdot 4 = 8. That means if we nudge w by 1, this data point’s squared error changes by 8.

But we have 5 data points, not one. Since MSE averages the squared errors over all data points, we need to average the derivatives too. For each point we compute 2errorx2 \cdot error \cdot x:

xxyyypred=3x+1y_{pred} = 3x + 1errorerror2errorx2 \cdot error \cdot x
-2-3-5-22(2)(2)=82 \cdot (-2) \cdot (-2) = 8
-1-1-2-12(1)(1)=22 \cdot (-1) \cdot (-1) = 2
0110200=02 \cdot 0 \cdot 0 = 0
1341211=22 \cdot 1 \cdot 1 = 2
2572222=82 \cdot 2 \cdot 2 = 8

Average them: 8+2+0+2+85=4\frac{8 + 2 + 0 + 2 + 8}{5} = 4. So dw = 4 — the gradient tells us the loss increases when we increase w, so we should decrease it. (And indeed, the true value is w = 2, which is lower than our guess of 3.)

In math notation, that’s:

d(loss)dw=1ni=1n2errorixi=21ni=1nerrorixi\frac{d(\text{loss})}{dw} = \frac{1}{n} \sum_{i=1}^{n} 2 \cdot error_i \cdot x_i = 2 \cdot \frac{1}{n} \sum_{i=1}^{n} error_i \cdot x_i

And in Python:

dw = 2 * np.mean(error * x)

The widget below lets you trace this chain for each data point. Click the different x= buttons to see how the local derivatives change — notice how the chain rule gives a different value for each point, because x and error are different:

Tracing d(loss)/dw for data point:
w
y_pred = w·x + b
error = y_pred - y
error²
 
 
-3.0
3.0

For db it’s the same chain, except f1f'_1 is different: since y_pred=wx+by\_pred = w \cdot x + b, the derivative with respect to b is just 1 (instead of x). So:

d(loss)db=f1f2f3=11(2error)=2error\frac{d(\text{loss})}{db} = f'_1 \cdot f'_2 \cdot f'_3 = 1 \cdot 1 \cdot (2 \cdot error) = 2 \cdot error

And in our Python code this looks like this:

db = 2 * np.mean(error)

This is why the chain rule matters: any time the loss is computed through a sequence of operations (and it always is), you need it to trace back how each parameter affected the final result. For our 2-parameter model the chain has 3 steps. A deep neural network might have hundreds — one for each layer — but the principle is identical: each layer is one more function in the composition, one more local derivative to multiply.

Applying derivatives to our loss function

Now that we know to compute individual derivative and combine them with the chain rule, let’s apply this knowledge to our problem. The “curve” we want to minimize is our loss function — the formula we chose to measure error. We know this function exactly:

loss(w,b)=1ni=1n(wxi+byi)2loss(w, b) = \frac{1}{n}\sum_{i=1}^{n}(w \cdot x_i + b - y_i)^2

loss = np.mean((w * x + b - y) ** 2)

What we don’t know is which values of w and b make it smallest. The derivative helps us find out: it tells us if I increase w by a tiny amount, does the loss go up or down? And how fast?

The data (x and y) is fixed — it’s our training data. If we also hold b constant for now (say b = 3), then the loss becomes a function of w alone, and we can plot it as a simple curve. For example, at w = 0:

y_pred = w * x + b                  # 0 * [-2,-1,0,1,2] + 3 = [3, 3, 3, 3, 3]
error  = y_pred - y                 # [3,3,3,3,3] - [-3,-1,1,3,5] = [6, 4, 2, 0, -2]
loss   = np.mean(error ** 2)        # mean([36, 16, 4, 0, 4]) = 12.0

That gives us one point on the curve: (w=0, loss=12). Do this for every w from -5 to 5 (keeping b = 3 fixed) and we get the full picture — the loss as a function of w alone:

0.0 (b = 3.0 fixed)
 

The x-axis is w (the parameter value we’re testing), the y-axis is the loss (how wrong the model is at that w). The result is a parabola — and its lowest point is at w = 2, where the loss drops to zero. Drag the w slider and watch the calculation below — for each of the 5 data points from our training set above (x = [-2, -1, 0, 1, 2]), it computes a prediction, error, and squared error, then averages them into a single loss value. That’s the white dot on the curve. (The range -5 to 5 is arbitrary — just wide enough to show the shape and include the minimum. We could sweep -100 to 100, but the curve would be too zoomed out to see the detail.)

We can do the same for b — this time fixing w = 2 and varying b from -5 to 5:

3.0 (w = 2.0 fixed)
 

The same parabola shape, but now the x-axis is b. The minimum is at b = 1, where the loss drops to zero. Together, w = 2 and b = 1 are the exact parameters that generated our data — y = 2x + 1.

For both w and b calculations, we computed the loss at many values across the range to plot the full curve. This helps build intuition — but in practice, you’d never do this. With 2 parameters, trying every combination is trivial. But a real neural network has millions of parameters. To plot the loss landscape, you’d need to try all of them in all combinations — impossibly expensive. That’s why we need the derivative: instead of mapping out the entire curve to find the minimum, we compute the slope at a single point and step downhill. We never see the full picture. We just feel the ground under our feet.

Partial derivatives

Notice what we just did: to understand how the loss depends on w, we froze b and varied w alone. To understand how it depends on b, we froze w and varied b alone. This is exactly what a partial derivative is — the derivative of the loss with respect to one parameter, while holding all others fixed:

  • ∂loss/∂w — how the loss changes when you nudge w (with b frozen)
  • ∂loss/∂b — how the loss changes when you nudge b (with w frozen)

Each of the two curves above is a slice through the loss landscape along one parameter. The slope of that curve at any point is the partial derivative:

dw = 2 * np.mean(error * x)        # ∂loss/∂w — how loss changes with w
db = 2 * np.mean(error)            # ∂loss/∂b — how loss changes with b

The widget below is a combined view of the two curves we saw above, now showing the partial derivatives in action. The left chart varies w (holding b fixed) — the right chart varies b (holding w fixed). On each chart, the white dot is where you are now, the blue dashed line is the tangent (its slope is the partial derivative), and the green arrow shows which direction to move to reduce the loss.

-3.0 (b = 3.0 fixed)
 
3.0 (w = -3.0 fixed)
 

Drag the sliders and watch what happens:

  • Far from the minimum — the curve is steep, the tangent tilts sharply, and the derivative is a large number. Gradient descent takes big steps here.
  • Near the minimum — the curve flattens out, the tangent is nearly horizontal, and the derivative is close to zero. Steps get tiny — the model is fine-tuning.
  • At the minimum — the tangent is perfectly flat. The derivative is zero. There’s nowhere to go — you’ve arrived.

Notice something interesting: when you drag w, the right chart reshapes — and vice versa. Why?

The left chart asks: “for every possible w, what’s the loss?” — with b fixed at whatever the slider says. When you drag w, you’re just picking which point on that curve to stand on. The curve itself doesn’t change because b hasn’t changed.

But the right chart asks: “for every possible b, what’s the loss?” — with w fixed. When you drag w, you change the fixed w that’s used to compute every point on the right curve. Different w means different errors at every b, which means a completely different curve. (And vice versa — dragging b reshapes the left chart but just moves the dot on the right.)

This is why we compute both partial derivatives from the same errors before updating anything. The best direction to nudge w depends on where b currently is, and vice versa — so you measure both slopes first, then move both parameters.

From derivative to gradient

The gradient is simply the vector of all partial derivatives bundled together: [dw, db]. It points in the direction of steepest increase in loss. So we move in the opposite direction — that’s why the update rule subtracts: w = w - lr * dw.

The two charts above are really just 2D slices of a 3D surface. With two parameters, we can visualize the full loss landscape — w on one axis, b on another, and loss as the height. Even though our model is linear (y = wx + b), the loss function is quadratic — a bowl shape — because MSE squares the errors. Different loss functions (like cross-entropy) produce different landscape shapes. This bowl is called a convex loss landscape — there’s only one valley, so no matter where you start, every direction downhill leads to the same bottom. Try clicking Step (both) from different starting points — you’ll always end up at w ≈ 2, b ≈ 1. Deep neural networks have more complex, non-convex landscapes with multiple valleys, but the same gradient descent algorithm still works remarkably well in practice.

-3.0
3.0

Try clicking Step (w) and Step (b) separately — you’ll see the point move along one axis at a time, creating a staircase pattern down the bowl. Then try Step (both) — this is what real gradient descent does, updating both parameters at once. You can rotate the surface by dragging to see it from different angles.

The yellow arrow is the gradient vector — it shows the direction the next step will go. It combines dw and db into a single direction: “move this way to reduce the loss fastest.” When you click Step (w) or Step (b), you’re moving along just one component of this vector. When you click Step (both), you follow the full arrow.

You might notice that the arrow points mostly along the w axis. That’s because the gradients aren’t equal — at the starting point, dw = -20 but db = 4. The w component is 5x larger, so it dominates the direction. Gradient descent doesn’t move equally in all directions; it moves proportionally to how sensitive the loss is to each parameter. The loss is much steeper along w here, so w gets corrected first.

Why is the loss so much more sensitive to w than to b? The answer is in the gradient formulas. Compare them — dw = 2 * mean(error * x) versus db = 2 * mean(error). Notice the key difference: dw multiplies each error by the corresponding x value, while db uses just the errors alone. Our x values range from -2 to 2, so when the model is very wrong (large errors) and the inputs are large, the product error * x becomes huge. The bias gradient db only averages the errors themselves — no multiplication by x — so it’s naturally smaller.

This is a general property, not specific to our toy example. In any neural network, some parameters affect the loss more than others. The gradient captures this automatically — parameters with large gradients get large updates, and parameters with small gradients get small updates. Each parameter is corrected in proportion to how much it contributed to the error. That’s what makes gradient descent efficient: it doesn’t waste effort on parameters that are already close to correct.

Stochastic gradient descent

So far, we’ve been computing gradients using all of our training data at once. Every time you clicked “Step” in the widget above, dw = 2 * mean(error * x) computed 2 * error * x for each of our 5 data points individually, then averaged them into a single gradient:

dw = 2 * mean(error * x)
   = 2 * mean([-24.00, -7.00, 0.00, -3.00, -16.00])
   = -20.00

With 5 points that’s trivial. The advantage of using all the data is that the averaged gradient points in the best possible direction — every data point gets a vote, so no single outlier can steer the update off course. But real datasets have millions or billions of examples. Computing a gradient for every single one and averaging them all before taking a single step is expensive. Imagine a dataset with 1 billion examples — you’d have to process all 1 billion before updating w and b even once. That’s one step. Then do it again for the next step. Each step gives you a very accurate gradient, but you’re waiting forever between updates.

The fix is simple: don’t use all the data at once. Shuffle the training examples, then split them into small groups — mini-batches. Run the 5-step recipe on the first mini-batch: compute predictions, errors, loss, gradients, and update parameters — using only those few examples. Then move to the next mini-batch, and so on.

This is stochastic gradient descent (SGD). “Stochastic” just means random — referring to the random shuffle. The update rule is the same, just summing over the mini-batch instead of the full dataset:

w=wlr1BiBlossiww = w - lr \cdot \frac{1}{|B|} \sum_{i \in B} \frac{\partial \text{loss}_i}{\partial w}

where BB is the current mini-batch. Each mini-batch gives a noisy estimate of the true gradient — it won’t point in the exact right direction, but it will be roughly correct. Over the course of an epoch, every training example contributes, and the noise averages out.

Once you’ve gone through every example, that’s one epoch. Shuffle again and start the next epoch. This is why you see “epoch” in training logs — each epoch means the model has seen every example in the dataset exactly once.

For our 5 data points with a batch size of 2, it looks like this:

EpochShuffled dataBatch 1Batch 2Batch 3
1[0, 2, -1, -2, 1](0, 2)(-1, -2)(1)
2[2, -2, 1, 0, -1](2, -2)(1, 0)(-1)
3[-1, 1, -2, 2, 0](-1, 1)(-2, 2)(0)

Each batch runs the full 4-step loop (forward pass → loss → backpropagation → gradient descent), so each epoch does 3 parameter updates instead of 1. By the end of each epoch, every data point has been used exactly once — but the order is different each time, which prevents the model from memorizing the sequence.

The widget below runs both methods side by side on our 5-point dataset so you can compare them directly. Both start from the same parameters (w = -3, b = 3) and use the same learning rate. Each click of “Step” does one parameter update for each method. The blue line (full batch) uses all 5 points every step. The orange line (mini-batch) uses only batch_size points — orange circles show which ones, and the epoch bar tracks progress through the dataset.

2
0.10

Click “Step” a few times and watch the loss curves on the right. The blue curve (full batch) drops smoothly — every step uses all the data, so the gradient always points in the best direction. The orange curve (mini-batch) zigzags.

By default, fixed order is enabled so the batches are the same every run, making this zigzag pattern reproducible. Uncheck it to shuffle randomly each epoch — the orange curve will look different every time, but the overall behavior is the same.

Click through the first few steps to see why: step 1→2 the loss drops (the batch happened to give a good gradient), but step 2→3 the loss goes up — that batch pulled the parameters in a direction that helped its own points but hurt others. Then step 3→4 it drops again. This is normal: each batch only sees a slice of the data, so some steps overshoot or even go the wrong way. Over many steps, these errors average out and the model still converges.

Full batch converges in fewer steps because each step uses all the data — so why bother with mini-batches? Compare the cost for 3 parameter updates with our 5 data points:

  • Full batch (3 steps): every step uses all 5 points. That’s 3 × 5 = 15 data point computations. Each point is processed 3 times.
  • SGD with batch size 2 (3 steps = 1 epoch): each step uses only 2 points (or 1 for the last batch). That’s 2 + 2 + 1 = 5 data point computations. Each point is processed once.

Both do 3 parameter updates, but SGD does it with 3x less computation. The updates are noisier, but the savings are enormous at scale — with 1 billion examples and a batch size of 1000, one epoch gives you 1 million updates while processing each example only once. Full batch would need to process all 1 billion examples for each of those updates. The widget can’t show this cost difference (both are just a click), but at real scale, mini-batch reaches the answer faster in wall-clock time despite needing more steps.

You can also try different batch sizes to see the difference in behavior:

  • batch size = 1 — maximum noise, each step uses a single point. The loss curve zigzags wildly, but it still converges. 5 steps = 1 epoch (each point seen once). This is the original “stochastic” gradient descent.
  • batch size = 2 — less noise, each epoch takes 3 steps (2 + 2 + 1 leftover). This is closer to what’s used in practice.
  • batch size = 5 — that’s all our data in one batch, so it’s identical to full batch gradient descent. Both lines overlap perfectly. 1 step = 1 epoch.

In practice, batch sizes of 32, 64, or 256 are common. The tradeoff: smaller batches mean more steps per epoch (noisier but faster per step); larger batches mean fewer steps (smoother but more computation per step). The noise from small batches can actually help — it prevents the model from getting stuck in shallow local minima.

The chain rule across many layers

Remember how backpropagation uses the chain rule to multiply local derivatives together? Our example had the simplest case: a single neuron with one weight w, one bias b, and one input:

w b x n ŷ err loss input 1 neuron output ∂loss/∂w: x · 1 · 2·error ∂loss/∂b: 1 · 1 · 2·error

The chain from each parameter to the loss had 3 links (f1f2f3)(f'_1 \cdot f'_2 \cdot f'_3) that looked like this in Python:

y_pred = w * x + b                  # f1: prediction
error  = y_pred - y                 # f2: how far off
loss   = error ** 2                 # f3: squared error

dw = x * 1 * (2 * error)           # chain rule for w: f'1 · f'2 · f'3
db = 1 * 1 * (2 * error)           # chain rule for b: same chain, different f'1

But real networks have multiple neurons per layer, each with their own weights. The gradient computation doesn’t change — we still compute a partial derivative for each individual weight using the chain rule, same as before. The difference is scale. The chain gets longer for weights that sit further from the loss — more layers to pass through, more derivatives to multiply. And it gets wider — a neuron’s output can feed into many neurons in the next layer, so the gradient must sum contributions from all of those paths.

Look at this network with 2 layers. Toggle between the two buttons to see how the gradient path differs depending on where the weight sits:

w₁ v₁ x₁ x₂ h₁ h₂ h₃ g₁ g₂ g₃ ŷ loss input layer 1 layer 2 output

Gradient for v₁ (layer 2) starts active — v₁ is the weight on the connection from h₁ to g₁. Its gradient chain is short: just that one connection, then g₁→ŷ→loss. Notice that only the h₁→g₁ connection lights up, not h₂→g₁ or h₃→g₁. Why? Because when we compute the partial derivative with respect to v₁, the other inputs to g₁ are held constant — they’re multiplied by different weights and don’t appear in v₁’s derivative. It’s the same principle from our single-neuron model: the derivative of w·x + b with respect to w is just x — the other parameter b doesn’t appear. Same here: the derivative of v₁·h₁ + v₂·h₂ + v₃·h₃ with respect to v₁ is just h₁. So the gradient is:

lossv1=h1local derivativelossg1from output → loss\frac{\partial \text{loss}}{\partial v_1} = \underbrace{h_1}_{\text{local derivative}} \cdot \underbrace{\frac{\partial \text{loss}}{\partial g_1}}_{\text{from output → loss}}

Another way to look at this formula is to expand lossg1\frac{\partial \text{loss}}{\partial g_1} — since the loss is just error² and error flows straight from the output, this resolves to 2 · error:

lossv1=h1local derivativelossg1from output → loss=h1input2errorfrom loss\frac{\partial \text{loss}}{\partial v_1} = \underbrace{h_1}_{\text{local derivative}} \cdot \underbrace{\frac{\partial \text{loss}}{\partial g_1}}_{\text{from output → loss}} = \underbrace{h_1}_{\text{input}} \cdot \underbrace{2 \cdot error}_{\text{from loss}}

Now click Gradient for w₁ (layer 1) — w₁ is the weight on the x₁→h₁ connection. The chain starts the same way (one connection), but then h₁’s output feeds into every neuron in layer 2 — g₁, g₂, and g₃. A change in w₁ ripples through all of them before reaching the output and the loss. The gradient must sum the contributions from all three paths:

lossw1=x1relu(z1)(v1lossg1+v2lossg2+v3lossg3)sum over all paths through layer 2\frac{\partial \text{loss}}{\partial w_1} = x_1 \cdot \text{relu}'(z_1) \cdot \underbrace{\left(v_1 \cdot \frac{\partial \text{loss}}{\partial g_1} + v_2 \cdot \frac{\partial \text{loss}}{\partial g_2} + v_3 \cdot \frac{\partial \text{loss}}{\partial g_3}\right)}_{\text{sum over all paths through layer 2}}

Expanding each lossgi\frac{\partial \text{loss}}{\partial g_i} to 2 · error (same as before — the loss is just error²):

=x1relu(z1)(v12error+v22error+v32error)= x_1 \cdot \text{relu}'(z_1) \cdot \left(v_1 \cdot 2 \cdot error + v_2 \cdot 2 \cdot error + v_3 \cdot 2 \cdot error\right)

So the chain is not just longer (more layers to pass through) but also wider (more paths to sum over at each layer). Deeper networks mean longer chains; wider layers mean more paths per chain.

It’s the same chain rule principle we used with f1f2f3f'_1 \cdot f'_2 \cdot f'_3. The output neuron computes ŷ = v₁·h₁ + v₂·h₂ + b. When we ask “how does ŷ change if h₁ changes?”, the local derivative is v₁ — just like in our single-neuron model where the derivative of w·x + b with respect to x was w. So v₁ appears not as a parameter being optimized, but as the local derivative of the function connecting h₁ to ŷ. Each link in the chain contributes its local derivative, and we multiply them all — one of those derivatives just happens to be another layer’s weight. This is the key difference from our single-layer model: earlier layers’ gradients must pass through all later layers’ weights.

Once we have all the gradients, the update is the same as before — subtract lr × gradient from each weight:

# Output layer (short chain)
W2 = W2 - lr * dW2          # 2 weights

# Hidden layer (long chain — gradients passed through W2)
W1 = W1 - lr * dW1          # 4 weights (2×2 matrix)

Here’s the gradient computation for a weight in each layer, side by side:

# Gradient for v₁ (output layer) — short chain
dv1 = h1 * 2 * error
#     ↑    ↑
#     │    └── from loss
#     └── local derivative: ∂ŷ/∂v₁ = h₁

# Gradient for w₁₁ (hidden layer) — longer chain
dw11 = x1 * relu_deriv(z1) * v1 * 2 * error
#      ↑    ↑                 ↑    ↑
#      │    │                 │    └── from loss
#      │    │                 └── passes through output layer weight
#      │    └── activation derivative
#      └── local derivative: ∂z₁/∂w₁₁ = x₁

The output layer weight v₁ has 2 terms in its chain. The hidden layer weight w₁₁ has 4 — it must pass through the activation function and the output layer’s weight v₁ to reach the loss. Each layer you add extends every earlier layer’s chain by one more multiplication. With 100 layers, the first layer’s gradient is a product of over 100 terms. What happens when you multiply that many numbers together?

The vanishing gradient problem

If each layer’s local derivative is 0.5, then after 100 layers the gradient is 0.51000.5^{100} — a number so small it’s effectively zero. The first layers get no useful gradient signal. They don’t learn. This is the vanishing gradient problem, and it plagued deep networks for decades.

The reverse is just as bad: if local derivatives are greater than 1, the gradient explodes — growing so large that updates become wildly unstable.

The widget below lets you see this in action. Drag the local derivative below 1 and watch the gradient fade to nothing as it flows backwards through the layers. Then try values above 1 and watch it explode:

8
0.50

This is why deep learning was stuck for a long time — with sigmoid activations (whose derivatives are always < 1), gradients vanished in networks deeper than a few layers. The breakthroughs that fixed this include:

  • ReLU activation — its derivative is either 0 or 1, so gradients don’t shrink as they pass through
  • Residual connections (skip connections) — give the gradient a shortcut past layers, so it doesn’t have to multiply through every single one
  • Batch normalization — keeps the values flowing through each layer in a well-behaved range, preventing derivatives from consistently shrinking or growing

These are all ways of keeping local derivatives close to 1, so the chain rule product doesn’t vanish or explode even across hundreds of layers.