Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Core Concepts
  4. What are Batch Size and Learning Rate
book-openCore Concepts
Intermediate
2026-W17

What are Batch Size and Learning Rate

Batch size (examples per update) and learning rate (step size for weight updates) are the two most important hyperparameters controlling how neural networks train.

Also known as:
batchgrootte
leersnelheid
hyperparameters
training hyperparameters
AI Intel Pipeline
What are Batch Size and Learning Rate?

What are Batch Size and Learning Rate?

Batch size is the number of training examples processed together before updating the model's weights. Learning rate is the step size used to update weights during training — how much the model adjusts its parameters in response to each batch's error signal. Together, they are the two most important hyperparameters controlling how a neural network learns.

Why It Matters

These two numbers can make or break model training. A learning rate too high causes the model to diverge (oscillating wildly); too low and it barely learns (or gets stuck). A batch size too small introduces noisy gradients; too large wastes compute and may reduce generalization. Understanding these trade-offs is essential for anyone training or fine-tuning AI models.

How It Works

Learning rate:

  • Controls the magnitude of weight updates: new_weight = old_weight - learning_rate × gradient
  • Typical range: 1e-5 to 1e-2 (0.00001 to 0.01)
  • Too high → training diverges, loss explodes
  • Too low → training is very slow, may get stuck in local minima
  • Just right → smooth convergence to a good solution

Learning rate schedules:

  • Warmup — start with a very low rate, gradually increase (prevents early instability)
  • Cosine decay — gradually reduce the learning rate following a cosine curve
  • Step decay — reduce by a factor at specific epochs
  • Warmup + cosine — the standard for transformer training

Batch size:

  • Number of examples per gradient update step
  • Small batches (8-32) — noisy gradients, good regularization, slower training
  • Large batches (256-4096+) — smoother gradients, faster training, may need learning rate scaling
  • Gradient accumulation — simulate large batches on limited GPU memory

The relationship:

  • Larger batch sizes often need higher learning rates (linear scaling rule)
  • The ratio learning_rate / batch_size is more important than either alone
  • Popular heuristic: when doubling batch size, also double learning rate

Optimizers that adapt learning rate:

  • Adam — maintains per-parameter adaptive learning rates (the default for transformers)
  • AdamW — Adam with decoupled weight decay (standard for LLM training)
  • SGD with momentum — simpler, sometimes better for CNNs

For LLM fine-tuning:

  • Typical learning rate: 1e-5 to 5e-5
  • Typical batch size: 4-16 (with gradient accumulation to simulate larger)
  • Warmup: 5-10% of training steps

Example

Fine-tuning LLaMA 3 on a custom dataset: the practitioner tries learning_rate=5e-5, batch_size=8, warmup=100 steps, cosine schedule. The loss decreases smoothly. Doubling the learning rate to 1e-4 causes the loss to spike erratically. Halving to 2.5e-5 trains more slowly but converges to a slightly better result. The batch size and learning rate tuning takes a few experiments.

Sources

  1. Smith et al. – Don't Decay the Learning Rate, Increase the Batch Size
  2. Google – Deep Learning Tuning Playbook

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Tokenizer
A tokenizer converts raw text into tokens — the discrete units a language model processes — using subword algorithms like BPE or SentencePiece.
Artificial Intelligence (AI)
Artificial intelligence is the field of computer science that builds systems capable of performing tasks normally requiring human intelligence, such as learning, reasoning, and perception.
Benchmark (AI Evaluation)
A benchmark is a standardized test used to measure and compare AI model performance, providing reproducible scores across tasks like reasoning, coding, and knowledge.
Catastrophic Forgetting
Catastrophic forgetting is when training a neural network on new data overwrites previously learned knowledge, causing it to lose earlier capabilities.

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

Autoregressive Generation

Next

Beam Search

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy