Apr 2, 2026

The effect of hyperparameters on training

In the previous article, we trained a neural network on the MNIST dataset — taking every concept from the theory article (softmax, cross-entropy, backpropagation, gradient descent) and applying it to classify handwritten digits. We built the network from scratch in NumPy, replicated it in Keras, and reached 97% test accuracy with a simple 2-layer architecture and default settings: one hidden layer of 128 neurons, learning rate 0.1, batch size 32.

Every one of those choices — learning rate, batch size, number of layers, activation function — are called hyperparameters: settings chosen by you, not learned by the model. The goal of this article is to show how much these choices matter — a wrong learning rate can be the difference between 97% accuracy and complete failure — and to give general practical recommendations for each one. But these recommendations are just starting points, not answers. The right hyperparameters are discovered through experimentation — change one thing, observe the effect, adjust — and depend on your specific model, dataset, and task.

Each experiment below has an interactive chart. You can click the labels to show or hide individual runs, switch between loss and accuracy views, toggle “show baseline” to include the initial random-weights measurement (epoch 0), and expand the “Computation log” section to see the raw training output. The charts show training metrics by default — our analysis is based on training values — but you can click “val” to overlay validation metrics for comparison. Code blocks with a marimo icon (visible on hover) link to an interactive notebook where you can run the experiment yourself and modify the code.

We’ll use the same dataset and model setup from the previous article:

import keras
import numpy as np

# Training: 60000 images, Test: 10000 images
(train_images, train_labels), (test_images, test_labels) = keras.datasets.mnist.load_data()

# Flatten 28×28 → 784 and normalize to [0, 1]
X_train = train_images.reshape(-1, 784).astype("float32") / 255.0
X_test = test_images.reshape(-1, 784).astype("float32") / 255.0

y_train = train_labels
y_test = test_labels

Learning rate

When we were looking at how networks learn, we saw how the learning rate controls the step size of gradient descent. Too small and training goes very slowly; too large and it overshoots the minimum, making it harder to converge toward a low loss. We demonstrated this on a 2-parameter model with an interactive widget. Now let’s see the exact same phenomenon on a real network with 100,000+ parameters.

We’ll train the same model four times, changing only the learning rate:

for lr in [0.001, 0.01, 0.1, 1.0, 10.0]:
    model = keras.Sequential([
        keras.layers.Dense(128, activation="relu", input_shape=(784,)),
        keras.layers.Dense(10, activation="softmax"),
    ])

    model.compile(
        optimizer=keras.optimizers.SGD(learning_rate=lr),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )

    model.fit(X_train, y_train, epochs=5, batch_size=32,
              validation_split=0.2, verbose=2)

Let’s see how the loss evolves per epoch for each learning rate:

Loss by Learning Rate

Computation log

First thing to notice with “show baseline” toggle on — all five runs start at the same place: a random network with ~2.3 loss. This makes sense: with 10 classes and random weights, the model assigns roughly equal probability to each class — about 0.1 per class. The cross-entropy loss for a correct class with probability 0.1 is $-\log(0.1) \approx 2.3$ . This is the baseline — the loss of a model that has learned nothing.

Next, look at lr = 10.0 — complete failure. The loss spikes to 12.2 on the first epoch as the weight updates overshoot so far that the weights blow up, then stays stuck around 2.5 for the remaining epochs. The model outputs the same prediction for every input — training accuracy is ~10% (random guessing) and never improves. This is what happens when the learning rate is so large that gradient descent can’t make any progress at all.

Let’s toggle off lr=10.0 by clicking its label — because its spike to 12.2 compresses the y-axis and makes it hard to see the detail in the other curves. With it hidden, the remaining four runs tell a clearer story:

lr = 0.001 — the loss drops, but painfully slowly. After 10 epochs it’s still at 0.41 — higher than where lr=0.1 was after just 1 epoch (0.33). The gradients point in the right direction, but each step is so tiny that the model barely moves. It would need many more epochs to catch up.
lr = 0.01 — steady progress. The loss drops smoothly from 2.3 down to 0.18 by epoch 10, and the curve is still going down — the model is clearly still improving. Given more epochs, it would reach the same loss as lr=0.1. Smaller steps don’t limit where you end up, they just take longer to get there.
lr = 0.1 — the sweet spot for this network. The loss drops sharply to 0.03 by epoch 10, with the curve flattening as it approaches the minimum. Most of the learning happens in the first 2–3 epochs.
lr = 1.0 — the loss drops fast initially but can’t settle down. It bounces around — 0.20, then 0.25, then 0.21, then 0.24 — instead of decreasing smoothly. Switch to the accuracy view and you’ll see validation accuracy actually drops in the last few epochs. The overshooting doesn’t prevent learning, but it prevents the model from fine-tuning to a good minimum.

In practice, tuning the learning rate is a more intricate process than just picking a number — it’s tied to the choice of optimizer, and plain SGD with a hand-tuned learning rate is rarely used today. Adaptive optimizers like Adam adjust the step size per-parameter and maintain momentum, making them much less sensitive to the initial learning rate choice. We explore how SGD noise helps escape local minima, how Adam’s adaptive scaling works, and why learning rate schedules and warmup are standard in modern training — with interactive demos — in the next article.

Batch size

When we were looking at how neural networks learn, we compared mini-batches of 2 with full-batch gradient descent on 5 data points. The mini-batch version was noisier (the loss zigzagged) but converged with less total computation. Now let’s see how batch size affects training on a real dataset — we’ll train the same model with batch sizes of 1, 32, 256, and 60,000 — the entire training set at once:

for bs in [1, 32, 256, 60000]:
    model = keras.Sequential([
        keras.layers.Dense(128, activation="relu", input_shape=(784,)),
        keras.layers.Dense(10, activation="softmax"),
    ])

    model.compile(
        optimizer=keras.optimizers.SGD(learning_rate=0.1),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )

    model.fit(X_train, y_train, epochs=5, batch_size=bs,
              validation_split=0.2, verbose=2)

Let’s see how the loss evolves per epoch for each batch size:

Loss by Batch Size

Computation log

The middle ground — bs=32 and bs=256 — both work well, with bs=32 converging faster:

bs = 32 — loss drops sharply from 2.4 to 0.03 by epoch 10. With 1,500 batches per epoch, the model gets frequent updates with gradients that are noisy but roughly correct.
bs = 256 — slower but steady. The loss drops to 0.17 after 10 epochs and the curve is still going down. Each epoch has only 188 batches (vs 1,500 for bs=32), so fewer updates — but each update is based on a more reliable gradient average.

The two extremes tell a more interesting story:

bs = 1 — surprisingly bad with lr=0.1. The loss barely decreases and oscillates wildly between epochs, because each update is based on a single image, so the gradient reflects the quirks of that one example rather than the overall dataset. With bs=32, these individual differences average out into a reasonable estimate. With bs=1, every update is pulled in a random direction, and lr=0.1 makes each of those random steps large enough to undo previous progress.
bs = 60000 — full batch. The loss drops painfully slowly — from 2.4 to 1.6 after 10 epochs, far worse than bs=32’s 0.03 over the same period. This might seem counterintuitive — shouldn’t a perfect gradient computed over all images be better than a noisy estimate from 32? In terms of direction, yes. But the total distance the model travels through the loss landscape matters too: bs=32 takes 15,000 steps in 10 epochs, while bs=60000 takes only 1,000 steps even after 1000 epochs. With only 1 update per epoch, 10 epochs means just 10 gradient steps total. The perfect gradient can’t compensate for having so few opportunities to update.

In our experiments, batch size 32 gave the best results — and that’s generally a strong default. It balances gradient quality against update frequency. In practice, batch size is also constrained by GPU memory, since larger batches need more memory to hold the activations and gradients for all samples simultaneously. If your GPU has room, try 128 or 256 — training will be faster per epoch (fewer updates, but each update processes more data in parallel), though you may need to increase the learning rate to compensate. Batch size 1 is rarely used because it’s too noisy and can’t take advantage of GPU parallelism. Very large batches (thousands+) require careful tuning of the learning rate and are mainly used in distributed training across multiple GPUs.

One thing to note is that batch size and learning rate should be tuned together. Our lr=0.1 is tuned for bs=32 — it’s too large for bs=1 (causing the oscillations) and arguably too small for bs=60000 (which would benefit from a larger step). Switch bs=1 to lr=0.001 and it reaches a loss of 0.09 — almost matching bs=32. The noise is fine when each step is small enough that no single bad gradient can do much damage. Larger batches need larger learning rates to compensate for fewer updates per epoch. You can’t evaluate batch size without considering the learning rate.

Network depth and width

Our baseline network has one hidden layer of 128 neurons. What happens when we go wider, deeper, or both? Let’s find out — we’ll train five variations with everything else held constant (lr=0.1, bs=32, 10 epochs):

configs = {
    "narrow (32)":  [32],
    "baseline (128)": [128],
    "wide (512)":   [512],
    "deep (2×128)": [128, 128],
    "deep (3×128)": [128, 128, 128],
}

for name, hidden_sizes in configs.items():
    model = keras.Sequential()
    model.add(keras.layers.Dense(hidden_sizes[0], activation="relu", input_shape=(784,)))
    for size in hidden_sizes[1:]:
        model.add(keras.layers.Dense(size, activation="relu"))
    model.add(keras.layers.Dense(10, activation="softmax"))

    model.compile(
        optimizer=keras.optimizers.SGD(learning_rate=0.1),
        loss="sparse_categorical_crossentropy",
        metrics=["accuracy"],
    )

    model.fit(X_train, y_train, epochs=10, batch_size=32,
              validation_split=0.2, verbose=2)

Let’s see how the loss evolves for each architecture:

Loss by Architecture

Computation log

A few things stand out:

Width helps: “width” here means the number of neurons in a single hidden layer — going from 32 to 128 to 512 neurons steadily lowers the loss. A wider layer has more parameters to detect patterns. But 512 has 4× the parameters of 128 for only a small improvement in loss — diminishing returns.
Depth helps too: adding a second hidden layer (2×128) reaches a lower loss than the single-layer baseline, even though the total parameter count is similar. Deeper networks can learn hierarchical features — the first layer might detect edges, the second layer might combine edges into shapes.
More depth, more risk: with 3 layers, the final loss matches the baseline — and the validation loss is noticeably noisier. Deeper networks are harder to train — gradients have to flow through more layers, and the vanishing gradient problem starts to matter.

In general, for fully connected networks, start simple — one or two hidden layers is often enough. Add depth only if the simpler model’s loss plateaus. For width, 128–512 neurons per layer is a reasonable range for fully connected networks. The real gains in architecture come from using the right type of layer for the data: convolutional layers for images, recurrent layers or transformers for sequences. These specialized architectures are far more parameter-efficient than the fully connected layers we’re using here — a convolutional network can match our 512-neuron model’s loss with a fraction of the parameters.

Activation functions: sigmoid vs ReLU

There are many activation functions to choose from — ReLU, sigmoid, tanh, Leaky ReLU, GELU, and others. We’ll focus on two here because the theory article made a specific prediction about them: sigmoid should struggle in deep networks because its derivative is always less than 1 (squashing gradients as they flow backward through layers), while ReLU should not (its derivative is 0 or 1, letting gradients pass through unchanged). Let’s test that prediction.

In Keras, switching activation functions is just changing a string — "relu" vs "sigmoid". We’ll train networks with 1, 3, and 5 hidden layers to see how depth interacts with the choice of activation:

for n_layers in [1, 3, 5]:
    for activation in ["relu", "sigmoid"]:
        model = keras.Sequential()
        model.add(keras.layers.Dense(128, activation=activation, input_shape=(784,)))
        for _ in range(n_layers - 1):
            model.add(keras.layers.Dense(128, activation=activation))
        model.add(keras.layers.Dense(10, activation="softmax"))

        model.compile(
            optimizer=keras.optimizers.SGD(learning_rate=0.1),
            loss="sparse_categorical_crossentropy",
            metrics=["accuracy"],
        )

        model.fit(X_train, y_train, epochs=5, batch_size=32,
                  validation_split=0.2, verbose=2)

Let’s see how the loss evolves for each combination of depth and activation function:

Loss by Activation Function

Computation log

The results confirm the theory article’s prediction — and the effect is even more dramatic than expected:

1 hidden layer — both activations work well. ReLU’s loss drops to 0.03 by epoch 10, sigmoid’s to 0.15. With only one layer, the gradient passes through a single activation, so sigmoid’s gradient shrinking doesn’t matter much.
3 hidden layers — sigmoid starts to fall behind. ReLU’s loss reaches 0.016, sigmoid’s only gets to 0.14 — nearly 10x higher. Sigmoid is noticeably slower in the first few epochs — its loss at epoch 3 is still higher than where ReLU was after epoch 1.
5 hidden layers — sigmoid completely fails. The loss barely moves from its initial ~2.3 value across all 10 epochs — the model has learned essentially nothing. Meanwhile, ReLU with 5 layers reaches a loss of 0.024 — virtually identical to 1 and 3 layers.

This is the vanishing gradient problem in action. The sigmoid derivative is always less than 1 (maximum 0.25 at $z = 0$ ), so after 5 layers of multiplication, the gradient reaching the first layer is $0.25^5 \approx 0.001$ times the original signal. The first layers can’t learn because the gradient is too small to move the weights.

ReLU doesn’t have this problem — its derivative is either 0 or 1, so gradients pass through unchanged (for active neurons). This is why all three ReLU networks perform nearly identically regardless of depth.

Today, ReLU is the default activation for most feed-forward and convolutional networks. If you encounter “dying ReLU” problems (neurons that output zero for all inputs and stop learning), try Leaky ReLU or ELU — variants that allow a small gradient when the input is negative. For transformer architectures, GELU has become the standard. Sigmoid and tanh are still used in specific places — sigmoid for binary outputs (e.g. the final layer of a binary classifier), tanh for gating mechanisms in LSTMs — but not as general-purpose hidden layer activations.

Number of epochs

In all the experiments above, we trained for 10 epochs. But how many epochs should you actually train for? Let’s find out by training our best configuration (lr=0.1, batch size 32) for 50 epochs and watching what happens:

model = keras.Sequential([
    keras.layers.Dense(128, activation="relu", input_shape=(784,)),
    keras.layers.Dense(10, activation="softmax"),
])

model.compile(
    optimizer=keras.optimizers.SGD(learning_rate=0.1),
    loss="sparse_categorical_crossentropy",
    metrics=["accuracy"],
)

history = model.fit(X_train, y_train, epochs=50, batch_size=32,
                    validation_split=0.2, verbose=2)

The training loss keeps dropping across all 50 epochs — from 0.33 at epoch 1 down to 0.0015 at epoch 50. Training accuracy hits 100% by epoch 29. The model has memorized every single training image perfectly.

But the validation loss tells a different story: it improves from 0.19 at epoch 1 to 0.075 around epoch 13, then stops improving and starts slowly increasing — 0.078 at epoch 20, 0.081 at epoch 30, 0.085 at epoch 50. Meanwhile, validation accuracy plateaus around 98% from epoch 13 onward and barely moves for the remaining 37 epochs.

This is overfitting — the model has started memorizing the training data instead of learning patterns that generalize. After a certain point, more training makes the model worse at predicting new data, even as it gets better at the data it’s already seen.

The number of epochs is a hyperparameter like any other, but it has a unique property: too few and the model hasn’t learned enough (underfitting), too many and it memorizes the training data (overfitting). Unlike learning rate or batch size, where there’s a sweet spot you can find and stick with, the right number of epochs depends on the model, the data, and all your other hyperparameters.

In general, rather than guessing the right number of epochs, use early stopping — a Keras callback that monitors validation loss and stops training automatically when it stops improving:

early_stop = keras.callbacks.EarlyStopping(
    monitor="val_loss",
    patience=5,
    restore_best_weights=True,
)

model.fit(X_train, y_train, epochs=100, batch_size=32,
          validation_split=0.2,
          callbacks=[early_stop])

Set a generous upper bound for epochs (like 100) and let early stopping find the right point. The patience=5 parameter means training stops if validation loss hasn’t improved for 5 consecutive epochs, and restore_best_weights=True rolls the model back to the epoch with the best validation loss. This way you don’t need to tune the number of epochs at all — the data tells you when to stop.

We’ll explore overfitting and early stopping in more depth in upcoming articles.

Automated hyperparameter search

In this article we tuned hyperparameters manually — changing one at a time and observing the effect. This builds intuition, but it doesn’t scale. When you have dozens of hyperparameters and thousands of possible combinations, you need a strategy for searching, and a tool to automate it.

It’s worth knowing the handful of strategies those tools implement, because they all trade off along the same axis: how many configurations you can afford to evaluate. Each one is a different answer to “given that every run costs time and compute, where do I spend the next one?”

Grid search. Pick a few values per hyperparameter and try every combination. Simple and exhaustive, but the cost is the product of the option counts — three hyperparameters at five values each is already 125 runs, and it explodes from there (the same combinatorial blow-up that makes high-dimensional spaces hard). Fine when a run is cheap and you have two or three knobs.
Random search. Sample random combinations instead of a lattice. Counterintuitively, for the same budget this usually beats grid search: typically only a few hyperparameters really matter, and random sampling tries more distinct values along those important dimensions instead of wasting runs on a rigid grid (Bergstra & Bengio, 2012 is the standard reference).
Coordinate descent, a.k.a. hill climbing. This is the automated form of exactly what we did by hand all article: start from a default, change one hyperparameter at a time, keep the change if the result improves, and repeat until nothing helps. It’s cheap — you never evaluate the full grid — and, unlike tuning each knob once in isolation, it loops back, so it can catch that the best batch size depends on the learning rate you eventually settled on. Its weakness is the classic one of any greedy local search: it can get stuck in a local optimum, a setting where no single change helps even though some joint change would.
Bayesian optimization. Build a probabilistic model of “config → score” from the runs so far, then use it to pick the most promising config to try next — balancing exploring unknown regions against exploiting good ones. It’s the most sample-efficient option, which is why it’s the default when each run is genuinely expensive.

The choice comes down to cost per evaluation. If a run takes seconds, grid or random search is fine — just try a lot. If each run is a full training job (or worse), you switch to the methods that think before they spend: coordinate descent for a few knobs, Bayesian optimization for many. A nice example from outside training: the AgentDiet paper tuned four hyperparameters where each evaluation meant running an LLM agent across 100 tasks — far too costly for a grid — so it used plain coordinate descent (“change one knob, keep improvements, repeat”) and converged in two passes.

Weights & Biases Sweeps and Optuna implement all of these — grid, random, and Bayesian search — automatically tracking every run and comparing results, and Keras Tuner offers the same integrated directly into Keras. They become valuable exactly when manual, one-at-a-time tuning runs out of road.