
What is Speculative Decoding?
Speculative decoding is an inference optimization technique that uses a small, fast "draft" model to generate candidate tokens, which are then verified in parallel by the larger, more capable target model. It speeds up text generation without changing the output quality or distribution.
Why It Matters
Autoregressive generation is inherently slow because tokens are produced one at a time. For large models like GPT-4 or Claude, each token requires a full forward pass through billions of parameters. Speculative decoding can achieve 2-3x speedups by amortizing the cost of these forward passes β generating the same quality output in less time and at lower cost.
How It Works
The core idea:
- A small draft model (e.g., 1B parameters) is fast but less accurate
- The large target model (e.g., 70B parameters) is slow but more accurate
- For most tokens (common words, predictable continuations), the small model's predictions are correct
- Only when the small model is wrong does the large model need to "fix" the output
Step by step:
- Draft β the small model generates k tokens quickly (e.g., k=5)
- Verify β the large model processes all k tokens in a single forward pass (this is fast because it's parallel, unlike generation)
- Accept/reject β compare the large model's distribution with the draft tokens: