
What are Batch Size and Learning Rate?
Batch size is the number of training examples processed together before updating the model's weights. Learning rate is the step size used to update weights during training — how much the model adjusts its parameters in response to each batch's error signal. Together, they are the two most important hyperparameters controlling how a neural network learns.
Why It Matters
These two numbers can make or break model training. A learning rate too high causes the model to diverge (oscillating wildly); too low and it barely learns (or gets stuck). A batch size too small introduces noisy gradients; too large wastes compute and may reduce generalization. Understanding these trade-offs is essential for anyone training or fine-tuning AI models.
How It Works
Learning rate:
- Controls the magnitude of weight updates: new_weight = old_weight - learning_rate × gradient
- Typical range: 1e-5 to 1e-2 (0.00001 to 0.01)
- Too high → training diverges, loss explodes
- Too low → training is very slow, may get stuck in local minima
- Just right → smooth convergence to a good solution
Learning rate schedules:
- Warmup — start with a very low rate, gradually increase (prevents early instability)
- Cosine decay — gradually reduce the learning rate following a cosine curve
- Step decay — reduce by a factor at specific epochs
- Warmup + cosine — the standard for transformer training
Batch size:
- Number of examples per gradient update step
- Small batches (8-32) — noisy gradients, good regularization, slower training
- Large batches (256-4096+) — smoother gradients, faster training, may need learning rate scaling
- Gradient accumulation — simulate large batches on limited GPU memory
The relationship:
- Larger batch sizes often need higher learning rates (linear scaling rule)
- The ratio learning_rate / batch_size is more important than either alone
- Popular heuristic: when doubling batch size, also double learning rate
Optimizers that adapt learning rate:
- Adam — maintains per-parameter adaptive learning rates (the default for transformers)
- AdamW — Adam with decoupled weight decay (standard for LLM training)
- SGD with momentum — simpler, sometimes better for CNNs
For LLM fine-tuning:
- Typical learning rate: 1e-5 to 5e-5
- Typical batch size: 4-16 (with gradient accumulation to simulate larger)
- Warmup: 5-10% of training steps
Example
Fine-tuning LLaMA 3 on a custom dataset: the practitioner tries learning_rate=5e-5, batch_size=8, warmup=100 steps, cosine schedule. The loss decreases smoothly. Doubling the learning rate to 1e-4 causes the loss to spike erratically. Halving to 2.5e-5 trains more slowly but converges to a slightly better result. The batch size and learning rate tuning takes a few experiments.