
What is Autoregressive Generation?
Autoregressive generation is the method by which large language models produce text: they generate one token at a time, where each new token is predicted based on all previously generated tokens. The model's output feeds back as input for the next step, creating a sequential chain of predictions.
Why It Matters
Understanding autoregressive generation explains fundamental LLM behaviors: why responses stream in word by word, why longer outputs take longer (and cost more), why models can "lose the thread" in long responses, and why techniques like KV-cache and speculative decoding exist to speed up generation. It's the core mechanism behind every GPT, Claude, Gemini, and LLaMA response.
How It Works
- Input processing β the model receives the full prompt and encodes it using self-attention (the "prefill" phase).
- Token prediction β the model predicts a probability distribution over its entire vocabulary for the next token.
- Sampling β one token is selected from this distribution (using temperature, top-p, or other sampling strategies).
- Feedback β the selected token is appended to the sequence, and the model uses the extended sequence to predict the next token.
- Repeat β steps 2β4 continue until the model produces a stop token or reaches a length limit.
This is inherently sequential β each token depends on all previous tokens, so generation cannot be parallelized across tokens. This is why: