Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Core Concepts
  4. What Is Top-p (Nucleus) Sampling?
book-openCore Concepts
Intermediate

What Is Top-p (Nucleus) Sampling?

A decoding method that samples from the smallest set of tokens whose cumulative probability exceeds a threshold p — adapting candidate pool size to model confidence

Also known as:
Nucleus Sampling
Top-p
Nucleus Decoding
What Is Top-p (Nucleus) Sampling? How It Controls LLM Output Diversity

Top-P Sampling (also called nucleus sampling) is a decoding strategy that selects the next token from a dynamically sized candidate set: only the smallest group of tokens whose cumulative probability exceeds a threshold p is considered, and the final token is sampled from that group. Unlike top-k sampling, which always draws from a fixed number of candidates regardless of the probability distribution, top-p adapts its candidate pool — when the model is confident, only a few tokens qualify; when it is uncertain, dozens or hundreds may enter the pool. This adaptive behavior makes top-p the default sampling method for most commercial LLM APIs, typically set between 0.9 and 0.95 for general-purpose text generation.

Why it matters

Top-P sampling solves a fundamental problem with fixed-size candidate pools: the optimal number of candidates varies with every token position. After the phrase "The capital of France is", the model is extremely confident — only one or two tokens are plausible. A fixed top-k of 50 would include dozens of nonsensical tokens and risk incoherent output. Conversely, for a creative continuation like "The sunset looked like a", many tokens are plausible — a top-k of 5 would be too restrictive and produce bland text. Top-p handles both situations automatically by adjusting the pool size to match model confidence. For production applications this means fewer hallucinations in factual contexts (smaller pool) and richer creativity in open-ended contexts (larger pool), all from a single parameter setting.

How it works

During inference, the model produces a probability distribution (logits converted through softmax) over its entire vocabulary for the next token. The top-p algorithm sorts these probabilities from highest to lowest, then accumulates them until the running total exceeds p. Only tokens within this cumulative threshold are kept as candidates; the rest are discarded. The remaining probabilities are renormalized to sum to 1.0, and the next token is sampled from this reduced distribution. For example, with p=0.9, if the top three tokens have probabilities 0.7, 0.15, and 0.08 (cumulative: 0.93), only those three tokens are candidates. Top-p is frequently combined with temperature: temperature reshapes the distribution first (making it sharper or flatter), then top-p selects the candidate pool from the reshaped distribution. Most APIs apply temperature before top-p in the generation pipeline.

Example

A legal-tech company configures its contract review assistant with top-p set to 0.85 for clause analysis. When the model identifies a standard indemnification clause, confidence is high — the nucleus contains only 3-4 tokens at each position, producing precise, predictable legal language. When the same system generates a risk summary requiring nuanced judgment, confidence is lower and the nucleus expands to 20-30 tokens, allowing more varied and contextually appropriate phrasing. A competing system using fixed top-k=40 produces the opposite pattern: the clause analysis sometimes includes bizarre word choices from low-probability tokens, while the risk summaries feel repetitive because 40 candidates are too few for genuinely diverse expression. By switching to top-p, the legal-tech company reduces clause-analysis errors by 25% and improves the readability scores of risk summaries by 18% — without any prompt changes or model retraining.

Sources

  1. Holtzman et al. — The Curious Case of Neural Text Degeneration
    arXiv
  2. OpenAI API Reference — top_p Parameter
  3. Wikipedia

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Temperature in AI
A parameter controlling the randomness of LLM output — lower values produce consistent results, higher values increase creativity
Token in AI
The smallest unit of text an LLM processes — approximately 4 characters or 0.75 words
AI Inference
The process of running a trained LLM to generate output from input

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

Token Economics

Next

Transformer

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • GitHub
  • Twitter / X
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy