Understanding optimizers

In the hyperparameter tuning article, we saw how the learning rate controls training — too small and progress is painfully slow, too large and the model can’t settle into a good minimum. We used plain SGD (stochastic gradient descent) for all those experiments because it’s the simplest optimizer: compute the gradient, multiply by the learning rate, subtract from the weights.

But plain SGD has limitations. This article explores what those limitations are and how modern optimizers like Adam address them — with interactive demos that let you see the difference.

The local minima problem

Although a smaller learning rate will generally arrive at the same loss as a larger one given enough epochs, a very small learning rate can get trapped in shallow local minima or saddle points — the gradient produces a step too small to escape. A larger learning rate overshoots these shallow valleys, effectively jumping past them and landing in a deeper, potentially better minimum.

But it’s not the learning rate alone that enables escape. In practice, we use mini-batch SGD — each batch gives a slightly different gradient estimate, adding random noise to the updates. A larger learning rate amplifies this noise into bigger perturbations that can kick the optimizer out of shallow valleys.

The widget below demonstrates this on a simple 1D loss landscape with two minima — a shallow local minimum and a deeper global minimum. Both dots receive the same noisy gradient (simulating mini-batch SGD). The small learning rate (red) absorbs the noise and settles into the first valley. The large learning rate (green) amplifies the same noise into bigger steps that can kick it over the bump. Click “Step ×10” a few times — the result is random each time, but the large lr escapes much more often:

Escaping local minima with SGD noise

This is why the best strategy is often to start with a larger learning rate (to explore the loss landscape broadly) and decay it during training (to fine-tune once the model is in a good region).

SGD noise as implicit regularization

There’s a subtler effect: the noise in mini-batch gradients can actually help generalization. The random perturbations prevent the model from settling into sharp, narrow minima and push it toward flatter minima that generalize better. This is a well-studied phenomenon called the implicit regularization of SGD — the noise acts as a form of built-in regularization that you get for free just by using small batches.

This is one reason why batch size 32 outperforms full-batch gradient descent even when given the same total number of gradient steps — the noise isn’t just an imperfection to tolerate, it’s a feature.

From SGD to Adam

SGD uses the same learning rate for every parameter in the network. If the gradient for one weight is consistently large and for another is tiny, they both get scaled by the same lr. This is a problem — different parameters in a neural network can have very different gradient magnitudes, and a single learning rate can’t be optimal for all of them simultaneously.

Adam (Adaptive Moment Estimation) addresses this with two key ideas:

Per-parameter adaptive learning rates: Adam tracks a running average of each parameter’s squared gradients (the second moment). Parameters with large, consistent gradients get their effective step size reduced — they’re already moving fast. Parameters with small or noisy gradients get a larger effective step. This adaptation means Adam is much less sensitive to the initial learning rate choice.

Momentum: Adam also maintains a running average of past gradients (the first moment). This acts like momentum — if the gradient has been consistently pointing in one direction, the optimizer builds up speed in that direction. This helps push through flat regions and shallow local minima where the current gradient alone would be too small to make progress.

Here’s the same landscape, comparing SGD lr=1.0 (which can escape the local minimum but oscillates wildly) with Adam lr=0.1. Notice how Adam reaches the global minimum more smoothly — its momentum builds up in the consistent downhill direction while the adaptive scaling dampens the noisy oscillations:

SGD vs Adam

Learning rate schedules and warmup

Even with Adam, the learning rate still matters. A common strategy is to change it during training:

Learning rate decay: start with a larger learning rate to explore broadly, then reduce it to fine-tune. Keras provides built-in schedules:

lr_schedule = keras.optimizers.schedules.ExponentialDecay(
    initial_learning_rate=0.01,
    decay_steps=1000,
    decay_rate=0.9,
)
model.compile(optimizer=keras.optimizers.Adam(learning_rate=lr_schedule), ...)

Warmup: start with a very small learning rate, ramp up to the target over the first few hundred steps, then decay. The ramp-up phase helps the model escape early bad minima (the random initial weights can put you in an unstable region where a large lr immediately causes problems), and the decay phase lets it settle. Warmup is especially common when training transformers and large models.

In practice

For most projects, start with Adam and learning_rate=0.001. If training is unstable, lower it by 3–10x. If it’s too slow, raise it. This single default works reasonably well across a wide range of problems — which is why Adam is the default choice for most practitioners.

SGD with momentum is still used in some settings (notably large-scale image classification) where practitioners have the time to carefully tune the learning rate and schedule. But for getting started, Adam is the pragmatic choice.