What is Speculative Decoding?

Speculative decoding is an inference optimization technique that uses a small, fast "draft" model to generate candidate tokens, which are then verified in parallel by the larger, more capable target model. It speeds up text generation without changing the output quality or distribution.

Why It Matters

Autoregressive generation is inherently slow because tokens are produced one at a time. For large models like GPT-4 or Claude, each token requires a full forward pass through billions of parameters. Speculative decoding can achieve 2-3x speedups by amortizing the cost of these forward passes — generating the same quality output in less time and at lower cost.

How It Works

The core idea:

A small draft model (e.g., 1B parameters) is fast but less accurate
The large target model (e.g., 70B parameters) is slow but more accurate
For most tokens (common words, predictable continuations), the small model's predictions are correct
Only when the small model is wrong does the large model need to "fix" the output

Step by step:

Draft — the small model generates k tokens quickly (e.g., k=5)
Verify — the large model processes all k tokens in a single forward pass (this is fast because it's parallel, unlike generation)
Accept/reject — compare the large model's distribution with the draft tokens:

If the draft token matches what the large model would have produced → accept
If not → reject and use the large model's token; discard remaining draft tokens

Repeat — draft again from the last accepted token

Why it works:

The verification step processes k tokens in one pass (same cost as generating 1 token)
If most draft tokens are accepted, you get k tokens for the cost of ~1
The acceptance rate depends on how well the draft model matches the target
Mathematically guaranteed to produce the same output distribution as the target model alone

Variants:

Self-speculative decoding — use early layers of the same model as the draft
Medusa — add extra prediction heads to the target model
Eagle — use a specialized draft architecture trained on the target model's representations

Example

Generating a response with a 70B model at 30 tokens/sec: the 1B draft model generates 5 candidate tokens in 2ms. The 70B model verifies all 5 in one pass (taking 33ms instead of 5×33ms). If 4 out of 5 are accepted, the effective speed is 4 tokens per 35ms ≈ 114 tokens/sec — nearly 4x faster for the same output quality.