
Top-P Sampling (also called nucleus sampling) is a decoding strategy that selects the next token from a dynamically sized candidate set: only the smallest group of tokens whose cumulative probability exceeds a threshold p is considered, and the final token is sampled from that group. Unlike top-k sampling, which always draws from a fixed number of candidates regardless of the probability distribution, top-p adapts its candidate pool — when the model is confident, only a few tokens qualify; when it is uncertain, dozens or hundreds may enter the pool. This adaptive behavior makes top-p the default sampling method for most commercial LLM APIs, typically set between 0.9 and 0.95 for general-purpose text generation.
Why it matters
Top-P sampling solves a fundamental problem with fixed-size candidate pools: the optimal number of candidates varies with every token position. After the phrase "The capital of France is", the model is extremely confident — only one or two tokens are plausible. A fixed top-k of 50 would include dozens of nonsensical tokens and risk incoherent output. Conversely, for a creative continuation like "The sunset looked like a", many tokens are plausible — a top-k of 5 would be too restrictive and produce bland text. Top-p handles both situations automatically by adjusting the pool size to match model confidence. For production applications this means fewer hallucinations in factual contexts (smaller pool) and richer creativity in open-ended contexts (larger pool), all from a single parameter setting.
How it works
During inference, the model produces a probability distribution (logits converted through softmax) over its entire vocabulary for the next token. The top-p algorithm sorts these probabilities from highest to lowest, then accumulates them until the running total exceeds p. Only tokens within this cumulative threshold are kept as candidates; the rest are discarded. The remaining probabilities are renormalized to sum to 1.0, and the next token is sampled from this reduced distribution. For example, with p=0.9, if the top three tokens have probabilities 0.7, 0.15, and 0.08 (cumulative: 0.93), only those three tokens are candidates. Top-p is frequently combined with temperature: temperature reshapes the distribution first (making it sharper or flatter), then top-p selects the candidate pool from the reshaped distribution. Most APIs apply temperature before top-p in the generation pipeline.
Example
A legal-tech company configures its contract review assistant with top-p set to 0.85 for clause analysis. When the model identifies a standard indemnification clause, confidence is high — the nucleus contains only 3-4 tokens at each position, producing precise, predictable legal language. When the same system generates a risk summary requiring nuanced judgment, confidence is lower and the nucleus expands to 20-30 tokens, allowing more varied and contextually appropriate phrasing. A competing system using fixed top-k=40 produces the opposite pattern: the clause analysis sometimes includes bizarre word choices from low-probability tokens, while the risk summaries feel repetitive because 40 candidates are too few for genuinely diverse expression. By switching to top-p, the legal-tech company reduces clause-analysis errors by 25% and improves the readability scores of risk summaries by 18% — without any prompt changes or model retraining.