Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Models & Architecture
  4. What is Speculative Decoding?
brainModels & Architecture
Advanced
2026-W17

What is Speculative Decoding?

Speculative decoding speeds up LLM inference by having a small draft model generate candidate tokens that the large model verifies in parallel — same quality, 2-3x faster.

Also known as:
speculatief decoderen
assisted generation
draft-and-verify
AI Intel Pipeline
What is Speculative Decoding?

What is Speculative Decoding?

Speculative decoding is an inference optimization technique that uses a small, fast "draft" model to generate candidate tokens, which are then verified in parallel by the larger, more capable target model. It speeds up text generation without changing the output quality or distribution.

Why It Matters

Autoregressive generation is inherently slow because tokens are produced one at a time. For large models like GPT-4 or Claude, each token requires a full forward pass through billions of parameters. Speculative decoding can achieve 2-3x speedups by amortizing the cost of these forward passes — generating the same quality output in less time and at lower cost.

How It Works

The core idea:

  • A small draft model (e.g., 1B parameters) is fast but less accurate
  • The large target model (e.g., 70B parameters) is slow but more accurate
  • For most tokens (common words, predictable continuations), the small model's predictions are correct
  • Only when the small model is wrong does the large model need to "fix" the output

Step by step:

  1. Draft — the small model generates k tokens quickly (e.g., k=5)
  2. Verify — the large model processes all k tokens in a single forward pass (this is fast because it's parallel, unlike generation)
  3. Accept/reject — compare the large model's distribution with the draft tokens:
  • If the draft token matches what the large model would have produced → accept
  • If not → reject and use the large model's token; discard remaining draft tokens
  1. Repeat — draft again from the last accepted token

Why it works:

  • The verification step processes k tokens in one pass (same cost as generating 1 token)
  • If most draft tokens are accepted, you get k tokens for the cost of ~1
  • The acceptance rate depends on how well the draft model matches the target
  • Mathematically guaranteed to produce the same output distribution as the target model alone

Variants:

  • Self-speculative decoding — use early layers of the same model as the draft
  • Medusa — add extra prediction heads to the target model
  • Eagle — use a specialized draft architecture trained on the target model's representations

Example

Generating a response with a 70B model at 30 tokens/sec: the 1B draft model generates 5 candidate tokens in 2ms. The 70B model verifies all 5 in one pass (taking 33ms instead of 5×33ms). If 4 out of 5 are accepted, the effective speed is 4 tokens per 35ms ≈ 114 tokens/sec — nearly 4x faster for the same output quality.

Sources

  1. Leviathan et al. – Fast Inference from Transformers via Speculative Decoding
  2. DeepMind – Accelerating LLM Inference with Staged Speculative Decoding

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Activation Function
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Common ones: ReLU, GELU (transformers), sigmoid, softmax.
Gemini Omni
Google's any-to-any multimodal foundation model capable of generating any output (text, image, audio, video) from any input, with physics-grounded video generation as its first major capability.
MiniMax-M2
A 229.9B parameter Mixture-of-Experts model with only 9.8B active parameters per token, optimized for agentic tasks and exhibiting early signs of self-evolution—autonomously debugging its own training and modifying its scaffolding.
Nemotron-Labs Diffusion
NVIDIA's family of language models (3B-14B) that merge autoregressive and diffusion generation into one architecture, enabling both GPT-style sequential generation and 10-50x faster parallel diffusion mode.

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

Semantic Training Gap

Next

Speech AI

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy