
What is Gradient Descent?
Gradient descent is the optimization algorithm used to train machine learning models. It iteratively adjusts model parameters in the direction that most reduces the loss function — like walking downhill in a landscape of errors to find the lowest point.
Why It Matters
Gradient descent is the workhorse of all neural network training. Every LLM, every vision model, every deep learning system was trained using some variant of gradient descent. Understanding it explains why training requires so much compute and why hyperparameters like learning rate matter.
How It Works
- Initialize weights randomly.
- Compute the loss — forward pass through the network, then calculate error.
- Compute gradients — use backpropagation to find the derivative of the loss with respect to each weight. The gradient points in the direction of steepest increase.
- Update weights — move each weight in the opposite direction of its gradient (downhill), scaled by the learning rate.
- Repeat until the loss converges.
Variants:
- Batch gradient descent — compute gradients over the entire dataset. Precise but slow.
- Stochastic Gradient Descent (SGD) — compute gradients on a single random example. Fast but noisy.
- Mini-batch SGD — compute gradients on small batches (e.g., 32 or 64 examples). The practical standard.
- Adam — adaptive learning rate optimizer that's the default for most deep learning. Combines momentum with per-parameter learning rate adjustment.
The learning rate is critical: too high and the model overshoots the minimum; too low and training takes forever or gets stuck in local minima.
Example
Imagine you're blindfolded on a hilly terrain trying to reach the lowest valley. You feel the slope under your feet (the gradient) and take a step downhill. Gradient descent does exactly this — but in a space with millions of dimensions (one per model parameter).