What is Beam Search?

Beam search is a text generation strategy that explores multiple possible output sequences simultaneously, keeping the top-k most promising candidates ("beams") at each step. Unlike greedy decoding (which always picks the most likely next token), beam search considers that the globally best sequence might not start with the locally most likely token.

Why It Matters

The choice of generation strategy significantly affects output quality. Greedy decoding can miss better sequences, pure sampling can produce incoherent text, and beam search offers a deterministic middle ground that finds high-probability sequences. Understanding generation strategies explains why the same model produces different outputs depending on settings.

How It Works

Generation strategies compared:

1. Greedy decoding:

Always select the highest-probability token at each step
Fast but can miss globally optimal sequences
Example: "The" → "cat" → "is" might be better than "The" → "big" → "cat" → "is" but greedy picks "big" if it's locally most likely

2. Beam search:

Maintain k parallel hypotheses (beams), typically k=4-8
At each step, expand each beam with all possible next tokens
Keep only the top-k scoring sequences (by total log-probability)
Continue until all beams produce an end token
Return the highest-scoring complete sequence
Deterministic — same input always produces same output

3. Sampling-based (nucleus/top-p, top-k, temperature):

Randomly sample from the probability distribution
More creative and diverse outputs
Non-deterministic
Preferred for creative tasks and chat

When to use beam search:

Machine translation (finding the best translation)
Summarization (coherent, faithful summaries)
Speech recognition (finding most likely transcript)
Any task where you want the single "best" output

When NOT to use beam search:

Creative writing, brainstorming, chat
When diversity is desired
Long-form generation (beam search tends to produce repetitive text)

Beam search variations:

Length normalization — divide score by sequence length to avoid short-sequence bias
Diverse beam search — penalize beams that are too similar
Constrained beam search — require certain tokens to appear in the output

Example

Translating "The weather is nice" to Dutch with beam width 3: the model explores three parallel paths simultaneously. Beam 1: "Het weer is mooi." Beam 2: "Het weer is aangenaam." Beam 3: "Het weer is fijn." After scoring all complete sequences, beam search returns the highest-probability translation.