
What is Autoregressive Generation?
Autoregressive generation is the method by which large language models produce text: they generate one token at a time, where each new token is predicted based on all previously generated tokens. The model's output feeds back as input for the next step, creating a sequential chain of predictions.
Why It Matters
Understanding autoregressive generation explains fundamental LLM behaviors: why responses stream in word by word, why longer outputs take longer (and cost more), why models can "lose the thread" in long responses, and why techniques like KV-cache and speculative decoding exist to speed up generation. It's the core mechanism behind every GPT, Claude, Gemini, and LLaMA response.
How It Works
- Input processing — the model receives the full prompt and encodes it using self-attention (the "prefill" phase).
- Token prediction — the model predicts a probability distribution over its entire vocabulary for the next token.
- Sampling — one token is selected from this distribution (using temperature, top-p, or other sampling strategies).
- Feedback — the selected token is appended to the sequence, and the model uses the extended sequence to predict the next token.
- Repeat — steps 2–4 continue until the model produces a stop token or reaches a length limit.
This is inherently sequential — each token depends on all previous tokens, so generation cannot be parallelized across tokens. This is why:
- Prompts (which can be processed in parallel) are fast
- Generation (one token at a time) is slower
- Output tokens are the primary cost driver in API pricing
Speed optimizations:
- KV-cache — stores intermediate computations so they're not recalculated for each new token
- Speculative decoding — a smaller model drafts tokens that the larger model verifies in parallel
- Batching — process multiple users' generation steps simultaneously on GPU
Example
When Claude responds to "Write a haiku about rain," it doesn't generate the entire poem at once. It predicts "Soft" (most likely first token), then given "Soft" it predicts "drops", then given "Soft drops" it predicts "fall", and so on — one token at a time until the haiku is complete. This is why you see responses appearing word by word in streaming mode.