On Gradient Descent and Building Intuition for Optimisation
A clear walkthrough of gradient descent — the algorithm underpinning most of modern machine learning — with emphasis on intuition over notation.
Most introductions to gradient descent jump straight to partial derivatives. This is the wrong starting point. The notation obscures the idea, and the idea is remarkably simple.
The core intuition
You have a function that measures how wrong your model is. You want to make it less wrong. Gradient descent is the systematic way to do that: measure the slope, step downhill, repeat.
The function is your loss surface. The slope is the gradient. The step size is the learning rate. That’s the entire algorithm.
parameters = initialise_randomly()
for step in range(num_steps):
loss = compute_loss(parameters, data)
gradient = compute_gradient(loss, parameters)
parameters = parameters - learning_rate * gradient
Everything else in optimisation — Adam, momentum, learning rate schedules, weight decay — is a refinement of this loop.
Why the learning rate matters disproportionately
The learning rate is a single scalar, yet it has more impact on training dynamics than most architectural choices. Too large and the optimiser diverges, overshooting minima entirely. Too small and convergence takes orders of magnitude longer than necessary.
This is not a theoretical concern. In practice, learning rate selection (or scheduling) is often the difference between a model that trains in hours and one that never converges at all. Adaptive optimisers like Adam partially address this by maintaining per-parameter rates, but the base learning rate still sets the scale.
Local minima are less problematic than commonly taught
The standard pedagogical concern — “what if you land in a local minimum?” — turns out to be less relevant than it appears. In high-dimensional loss landscapes (which is where neural networks operate), strict local minima are rare. Saddle points are far more common, and modern optimisers with momentum handle these reasonably well.
The more practical concern is the quality of the minimum: flat, wide minima tend to generalise better than sharp, narrow ones. This is an active area of research, and the relationship between optimiser choice, learning rate, and generalisation remains an open question.
Takeaway
Gradient descent is the foundation. If you’re building intuition for machine learning, understanding this algorithm deeply — not just the formula, but the dynamics of convergence, the role of curvature, the sensitivity to hyperparameters — will pay dividends across everything you study afterward.