What Are Scaling Laws for LLMs? How Model Size, Data & Compute Interact

Scaling laws are empirical relationships that describe how LLM performance improves as a predictable function of three variables: model size (number of parameters), training data (number of tokens), and compute (number of floating-point operations). First rigorously characterized in papers from OpenAI (Kaplan et al., 2020) and DeepMind (Hoffmann et al., 2022, the "Chinchilla" paper), scaling laws revealed that language model loss follows power-law curves — performance improves smoothly and predictably when any of the three scaling axes increases, with no sign of plateauing at current scales. The Chinchilla finding further showed that many earlier models were undertrained relative to their size: for a given compute budget, there exists an optimal balance between model size and training data, approximately 20 tokens per parameter. Scaling laws transformed AI development from trial-and-error experimentation into a quantitative engineering discipline where capability can be reliably forecasted before spending billions on training.

Why it matters

Scaling laws are the foundation of strategic AI investment decisions worth hundreds of millions of dollars. They allow organizations to predict with reasonable accuracy what capabilities a model will have at a given scale, how much training will cost, and whether increasing scale will yield sufficient improvement to justify the investment. Without scaling laws, every new model generation would be a gamble. With them, frontier labs can project that a 10× increase in compute will yield a specific improvement in benchmark performance, plan multi-year training roadmaps, and make business cases for billion-dollar GPU clusters. For organizations using AI rather than building frontier models, scaling laws explain why larger models cost more but deliver genuinely better results (not just marketing claims), help predict when smaller models will be "good enough" for specific tasks, and inform build-vs-buy decisions. Scaling laws also predict emergence — the phenomenon where capabilities like chain-of-thought reasoning and few-shot learning appear suddenly at specific scales rather than improving gradually.

How it works

Scaling laws express the relationship between loss (a measure of model error) and the three scaling variables as power laws: L(N) ∝ N^(-α), where N is the variable being scaled and α is an empirically determined exponent. For language model parameters, α ≈ 0.076; for training tokens, α ≈ 0.095; for compute, α ≈ 0.050. These exponents mean that each 10× increase in parameters reduces loss by approximately 16%, each 10× increase in data reduces loss by approximately 20%, and improvements from all three sources are approximately additive. The Chinchilla insight formalized compute-optimal training: given a fixed compute budget C, the optimal strategy allocates budget such that model size N and training data D grow proportionally, with the optimal ratio being approximately 20 tokens per parameter. This explained why a 70B model trained on 1.4 trillion tokens (Chinchilla) outperformed a 280B model trained on only 300 billion tokens (Gopher) despite using comparable compute. Modern training runs use scaling laws to run small-scale experiments first, fit the power-law curves, and extrapolate to predict the performance of full-scale models — before committing hundreds of millions of dollars in compute.

Example

A company is deciding between licensing a 70B-parameter API model and a 7B-parameter model that they can self-host. Scaling laws predict that the 10× parameter difference will yield approximately 16% lower loss on the larger model — which translates to measurably better quality on complex reasoning tasks but marginal differences on simple classification. They run a structured evaluation: on their core use cases (customer email classification, FAQ response, and document summarization), the 70B model outperforms the 7B model by 2%, 8%, and 15% respectively. Scaling laws predicted this pattern — the improvement grows with task complexity. For email classification (simple task), the 7B model at €0.001 per request is cost-optimal. For document summarization (complex task), the 70B model's 15% quality advantage justifies its €0.01 per request cost given the business value of accurate summaries. They implement model routing using task complexity as the selector, achieving 92% of frontier quality at 35% of frontier cost — a decision structure enabled by the predictability that scaling laws provide.

Why it matters

How it works

Example

What Are Scaling Laws for LLMs?

Why it matters

How it works

Example

Sources

What Are Scaling Laws for LLMs?

Why it matters

How it works

Example

Sources