
What is Gradient Descent?
Gradient descent is the optimization algorithm used to train machine learning models. It iteratively adjusts model parameters in the direction that most reduces the loss function — like walking downhill in a landscape of errors to find the lowest point.
Why It Matters
Gradient descent is the workhorse of all neural network training. Every LLM, every vision model, every deep learning system was trained using some variant of gradient descent. Understanding it explains why training requires so much compute and why hyperparameters like learning rate matter.
How It Works
- Initialize weights randomly.
- Compute the loss — forward pass through the network, then calculate error.
- Compute gradients — use backpropagation to find the derivative of the loss with respect to each weight. The gradient points in the direction of steepest increase.
- Update weights — move each weight in the opposite direction of its gradient (downhill), scaled by the learning rate.
- Repeat until the loss converges.
Variants:
- Batch gradient descent — compute gradients over the entire dataset. Precise but slow.
- Stochastic Gradient Descent (SGD) — compute gradients on a single random example. Fast but noisy.