On Gradient Descent and Building Intuition for Optimisation

Most introductions to gradient descent jump straight to partial derivatives. This is the wrong starting point. The notation obscures the idea, and the idea is remarkably simple.

The core intuition

You have a function that measures how wrong your model is. You want to make it less wrong. Gradient descent is the systematic way to do that: measure the slope, step downhill, repeat.

The function is your loss surface. The slope is the gradient. The step size is the learning rate. That’s the entire algorithm.

parameters = initialise_randomly()

for step in range(num_steps):
    loss = compute_loss(parameters, data)
    gradient = compute_gradient(loss, parameters)
    parameters = parameters - learning_rate * gradient

Everything else in optimisation — Adam, momentum, learning rate schedules, weight decay — is a refinement of this loop.

Why the learning rate matters disproportionately

The learning rate is a single scalar, yet it has more impact on training dynamics than most architectural choices. Too large and the optimiser diverges, overshooting minima entirely. Too small and convergence takes orders of magnitude longer than necessary.

This is not a theoretical concern. In practice, learning rate selection (or scheduling) is often the difference between a model that trains in hours and one that never converges at all. Adaptive optimisers like Adam partially address this by maintaining per-parameter rates, but the base learning rate still sets the scale.

Local minima are less problematic than commonly taught

The standard pedagogical concern — “what if you land in a local minimum?” — turns out to be less relevant than it appears. In high-dimensional loss landscapes (which is where neural networks operate), strict local minima are rare. Saddle points are far more common, and modern optimisers with momentum handle these reasonably well.

The more practical concern is the quality of the minimum: flat, wide minima tend to generalise better than sharp, narrow ones. This is an active area of research, and the relationship between optimiser choice, learning rate, and generalisation remains an open question.

Takeaway

Gradient descent is the foundation. If you’re building intuition for machine learning, understanding this algorithm deeply — not just the formula, but the dynamics of convergence, the role of curvature, the sensitivity to hyperparameters — will pay dividends across everything you study afterward.