
What is Speculative Decoding?
Speculative decoding is an inference optimization technique that uses a small, fast "draft" model to generate candidate tokens, which are then verified in parallel by the larger, more capable target model. It speeds up text generation without changing the output quality or distribution.
Why It Matters
Autoregressive generation is inherently slow because tokens are produced one at a time. For large models like GPT-4 or Claude, each token requires a full forward pass through billions of parameters. Speculative decoding can achieve 2-3x speedups by amortizing the cost of these forward passes — generating the same quality output in less time and at lower cost.
How It Works
The core idea:
- A small draft model (e.g., 1B parameters) is fast but less accurate
- The large target model (e.g., 70B parameters) is slow but more accurate
- For most tokens (common words, predictable continuations), the small model's predictions are correct
- Only when the small model is wrong does the large model need to "fix" the output
Step by step:
- Draft — the small model generates k tokens quickly (e.g., k=5)
- Verify — the large model processes all k tokens in a single forward pass (this is fast because it's parallel, unlike generation)
- Accept/reject — compare the large model's distribution with the draft tokens:
- If the draft token matches what the large model would have produced → accept
- If not → reject and use the large model's token; discard remaining draft tokens
- Repeat — draft again from the last accepted token
Why it works:
- The verification step processes k tokens in one pass (same cost as generating 1 token)
- If most draft tokens are accepted, you get k tokens for the cost of ~1
- The acceptance rate depends on how well the draft model matches the target
- Mathematically guaranteed to produce the same output distribution as the target model alone
Variants:
- Self-speculative decoding — use early layers of the same model as the draft
- Medusa — add extra prediction heads to the target model
- Eagle — use a specialized draft architecture trained on the target model's representations
Example
Generating a response with a 70B model at 30 tokens/sec: the 1B draft model generates 5 candidate tokens in 2ms. The 70B model verifies all 5 in one pass (taking 33ms instead of 5×33ms). If 4 out of 5 are accepted, the effective speed is 4 tokens per 35ms ≈ 114 tokens/sec — nearly 4x faster for the same output quality.