Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Core Concepts
  4. What Is Top-p (Nucleus) Sampling?
book-openCore Concepts
Intermediate

What Is Top-p (Nucleus) Sampling?

A decoding method that samples from the smallest set of tokens whose cumulative probability exceeds a threshold p — adapting candidate pool size to model confidence

Also known as:
Nucleus Sampling
Top-p
Nucleus Decoding
AI Intel Pipeline
What Is Top-p (Nucleus) Sampling? How It Controls LLM Output Diversity

Top-P Sampling (also called nucleus sampling) is a decoding strategy that selects the next token from a dynamically sized candidate set: only the smallest group of tokens whose cumulative probability exceeds a threshold p is considered, and the final token is sampled from that group. Unlike top-k sampling, which always draws from a fixed number of candidates regardless of the probability distribution, top-p adapts its candidate pool — when the model is confident, only a few tokens qualify; when it is uncertain, dozens or hundreds may enter the pool. This adaptive behavior makes top-p the default sampling method for most commercial LLM APIs, typically set between 0.9 and 0.95 for general-purpose text generation.

Why it matters

Top-P sampling solves a fundamental problem with fixed-size candidate pools: the optimal number of candidates varies with every token position. After the phrase "The capital of France is", the model is extremely confident — only one or two tokens are plausible. A fixed top-k of 50 would include dozens of nonsensical tokens and risk incoherent output. Conversely, for a creative continuation like "The sunset looked like a", many tokens are plausible — a top-k of 5 would be too restrictive and produce bland text. Top-p handles both situations automatically by adjusting the pool size to match model confidence. For production applications this means fewer hallucinations in factual contexts (smaller pool) and richer creativity in open-ended contexts (larger pool), all from a single parameter setting.

How it works

During inference, the model produces a probability distribution (logits converted through softmax) over its entire vocabulary for the next token. The top-p algorithm sorts these probabilities from highest to lowest, then accumulates them until the running total exceeds p. Only tokens within this cumulative threshold are kept as candidates; the rest are discarded. The remaining probabilities are renormalized to sum to 1.0, and the next token is sampled from this reduced distribution. For example, with p=0.9, if the top three tokens have probabilities 0.7, 0.15, and 0.08 (cumulative: 0.93), only those three tokens are candidates. Top-p is frequently combined with temperature: temperature reshapes the distribution first (making it sharper or flatter), then top-p selects the candidate pool from the reshaped distribution. Most APIs apply temperature before top-p in the generation pipeline.

Example

A legal-tech company configures its contract review assistant with top-p set to 0.85 for clause analysis. When the model identifies a standard indemnification clause, confidence is high — the nucleus contains only 3-4 tokens at each position, producing precise, predictable legal language. When the same system generates a risk summary requiring nuanced judgment, confidence is lower and the nucleus expands to 20-30 tokens, allowing more varied and contextually appropriate phrasing. A competing system using fixed top-k=40 produces the opposite pattern: the clause analysis sometimes includes bizarre word choices from low-probability tokens, while the risk summaries feel repetitive because 40 candidates are too few for genuinely diverse expression. By switching to top-p, the legal-tech company reduces clause-analysis errors by 25% and improves the readability scores of risk summaries by 18% — without any prompt changes or model retraining.

Sources

  1. Holtzman et al. — The Curious Case of Neural Text Degeneration
    arXiv
  2. OpenAI API Reference — top_p Parameter
  3. Wikipedia

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Tokenizer
A tokenizer converts raw text into tokens — the discrete units a language model processes — using subword algorithms like BPE or SentencePiece.
Artificial Intelligence (AI)
Artificial intelligence is the field of computer science that builds systems capable of performing tasks normally requiring human intelligence, such as learning, reasoning, and perception.
Batch Size
Batch size (examples per update) and learning rate (step size for weight updates) are the two most important hyperparameters controlling how neural networks train.
Benchmark (AI Evaluation)
A benchmark is a standardized test used to measure and compare AI model performance, providing reproducible scores across tasks like reasoning, coding, and knowledge.

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

Tokenizer

Next

Difference Between Training

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy