
What is Beam Search?
Beam search is a text generation strategy that explores multiple possible output sequences simultaneously, keeping the top-k most promising candidates ("beams") at each step. Unlike greedy decoding (which always picks the most likely next token), beam search considers that the globally best sequence might not start with the locally most likely token.
Why It Matters
The choice of generation strategy significantly affects output quality. Greedy decoding can miss better sequences, pure sampling can produce incoherent text, and beam search offers a deterministic middle ground that finds high-probability sequences. Understanding generation strategies explains why the same model produces different outputs depending on settings.
How It Works
Generation strategies compared:
1. Greedy decoding:
- Always select the highest-probability token at each step
- Fast but can miss globally optimal sequences
- Example: "The" → "cat" → "is" might be better than "The" → "big" → "cat" → "is" but greedy picks "big" if it's locally most likely
2. Beam search:
- Maintain k parallel hypotheses (beams), typically k=4-8
- At each step, expand each beam with all possible next tokens
- Keep only the top-k scoring sequences (by total log-probability)
- Continue until all beams produce an end token
- Return the highest-scoring complete sequence
- Deterministic — same input always produces same output
3. Sampling-based (nucleus/top-p, top-k, temperature):
- Randomly sample from the probability distribution
- More creative and diverse outputs
- Non-deterministic
- Preferred for creative tasks and chat
When to use beam search:
- Machine translation (finding the best translation)
- Summarization (coherent, faithful summaries)
- Speech recognition (finding most likely transcript)
- Any task where you want the single "best" output
When NOT to use beam search:
- Creative writing, brainstorming, chat
- When diversity is desired
- Long-form generation (beam search tends to produce repetitive text)
Beam search variations:
- Length normalization — divide score by sequence length to avoid short-sequence bias
- Diverse beam search — penalize beams that are too similar
- Constrained beam search — require certain tokens to appear in the output
Example
Translating "The weather is nice" to Dutch with beam width 3: the model explores three parallel paths simultaneously. Beam 1: "Het weer is mooi." Beam 2: "Het weer is aangenaam." Beam 3: "Het weer is fijn." After scoring all complete sequences, beam search returns the highest-probability translation.