Batch Size & Learning Rate Explained | AI Dictionary

What are Batch Size and Learning Rate?

Batch size is the number of training examples processed together before updating the model's weights. Learning rate is the step size used to update weights during training — how much the model adjusts its parameters in response to each batch's error signal. Together, they are the two most important hyperparameters controlling how a neural network learns.

Why It Matters

These two numbers can make or break model training. A learning rate too high causes the model to diverge (oscillating wildly); too low and it barely learns (or gets stuck). A batch size too small introduces noisy gradients; too large wastes compute and may reduce generalization. Understanding these trade-offs is essential for anyone training or fine-tuning AI models.

How It Works

Learning rate:

Controls the magnitude of weight updates: new_weight = old_weight - learning_rate × gradient
Typical range: 1e-5 to 1e-2 (0.00001 to 0.01)
Too high → training diverges, loss explodes
Too low → training is very slow, may get stuck in local minima
Just right → smooth convergence to a good solution

Learning rate schedules:

Warmup — start with a very low rate, gradually increase (prevents early instability)
Cosine decay — gradually reduce the learning rate following a cosine curve
Step decay — reduce by a factor at specific epochs
Warmup + cosine — the standard for transformer training

Batch size:

Number of examples per gradient update step
Small batches (8-32) — noisy gradients, good regularization, slower training
Large batches (256-4096+) — smoother gradients, faster training, may need learning rate scaling
Gradient accumulation — simulate large batches on limited GPU memory

The relationship:

Larger batch sizes often need higher learning rates (linear scaling rule)
The ratio learning_rate / batch_size is more important than either alone
Popular heuristic: when doubling batch size, also double learning rate

Optimizers that adapt learning rate:

Adam — maintains per-parameter adaptive learning rates (the default for transformers)
AdamW — Adam with decoupled weight decay (standard for LLM training)
SGD with momentum — simpler, sometimes better for CNNs

For LLM fine-tuning:

Typical learning rate: 1e-5 to 5e-5
Typical batch size: 4-16 (with gradient accumulation to simulate larger)
Warmup: 5-10% of training steps

Example

Fine-tuning LLaMA 3 on a custom dataset: the practitioner tries learning_rate=5e-5, batch_size=8, warmup=100 steps, cosine schedule. The loss decreases smoothly. Doubling the learning rate to 1e-4 causes the loss to spike erratically. Halving to 2.5e-5 trains more slowly but converges to a slightly better result. The batch size and learning rate tuning takes a few experiments.

What are Batch Size and Learning Rate?

Why It Matters

How It Works

Learning rate:

Controls the magnitude of weight updates: new_weight = old_weight - learning_rate × gradient
Typical range: 1e-5 to 1e-2 (0.00001 to 0.01)
Too high → training diverges, loss explodes
Too low → training is very slow, may get stuck in local minima
Just right → smooth convergence to a good solution

Learning rate schedules:

Warmup — start with a very low rate, gradually increase (prevents early instability)
Cosine decay — gradually reduce the learning rate following a cosine curve
Step decay — reduce by a factor at specific epochs
Warmup + cosine — the standard for transformer training

Batch size:

Number of examples per gradient update step
Small batches (8-32) — noisy gradients, good regularization, slower training
Large batches (256-4096+) — smoother gradients, faster training, may need learning rate scaling
Gradient accumulation — simulate large batches on limited GPU memory

The relationship:

Larger batch sizes often need higher learning rates (linear scaling rule)
The ratio learning_rate / batch_size is more important than either alone
Popular heuristic: when doubling batch size, also double learning rate

Optimizers that adapt learning rate:

Adam — maintains per-parameter adaptive learning rates (the default for transformers)
AdamW — Adam with decoupled weight decay (standard for LLM training)
SGD with momentum — simpler, sometimes better for CNNs

For LLM fine-tuning:

Typical learning rate: 1e-5 to 5e-5
Typical batch size: 4-16 (with gradient accumulation to simulate larger)
Warmup: 5-10% of training steps

What are Batch Size and Learning Rate

What are Batch Size and Learning Rate?

Why It Matters

How It Works

Example

Sources

What are Batch Size and Learning Rate

What are Batch Size and Learning Rate?

Why It Matters

How It Works

Example

Sources