Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Core Concepts
  4. What Is Top-p (Nucleus) Sampling?
book-openCore Concepts
Intermediate

What Is Top-p (Nucleus) Sampling?

A decoding method that samples from the smallest set of tokens whose cumulative probability exceeds a threshold p — adapting candidate pool size to model confidence

Also known as:
Nucleus Sampling
Top-p
Nucleus Decoding
AI Intel Pipeline
What Is Top-p (Nucleus) Sampling? How It Controls LLM Output Diversity

Top-P Sampling (also called nucleus sampling) is a decoding strategy that selects the next token from a dynamically sized candidate set: only the smallest group of tokens whose cumulative probability exceeds a threshold p is considered, and the final token is sampled from that group. Unlike top-k sampling, which always draws from a fixed number of candidates regardless of the probability distribution, top-p adapts its candidate pool — when the model is confident, only a few tokens qualify; when it is uncertain, dozens or hundreds may enter the pool. This adaptive behavior makes top-p the default sampling method for most commercial LLM APIs, typically set between 0.9 and 0.95 for general-purpose text generation.

Why it matters

Top-P sampling solves a fundamental problem with fixed-size candidate pools: the optimal number of candidates varies with every token position. After the phrase "The capital of France is", the model is extremely confident — only one or two tokens are plausible. A fixed top-k of 50 would include dozens of nonsensical tokens and risk incoherent output. Conversely, for a creative continuation like "The sunset looked like a", many tokens are plausible — a top-k of 5 would be too restrictive and produce bland text. Top-p handles both situations automatically by adjusting the pool size to match model confidence. For production applications this means fewer hallucinations in factual contexts (smaller pool) and richer creativity in open-ended contexts (larger pool), all from a single parameter setting.

How it works

During inference, the model produces a probability distribution (logits converted through softmax) over its entire vocabulary for the next token. The top-p algorithm sorts these probabilities from highest to lowest, then accumulates them until the running total exceeds p. Only tokens within this cumulative threshold are kept as candidates; the rest are discarded. The remaining probabilities are renormalized to sum to 1.0, and the next token is sampled from this reduced distribution. For example, with p=0.9, if the top three tokens have probabilities 0.7, 0.15, and 0.08 (cumulative: 0.93), only those three tokens are candidates. Top-p is frequently combined with temperature: temperature reshapes the distribution first (making it sharper or flatter), then top-p selects the candidate pool from the reshaped distribution. Most APIs apply temperature before top-p in the generation pipeline.

Example

A legal-tech company configures its contract review assistant with top-p set to 0.85 for clause analysis. When the model identifies a standard indemnification clause, confidence is high — the nucleus contains only 3-4 tokens at each position, producing precise, predictable legal language. When the same system generates a risk summary requiring nuanced judgment, confidence is lower and the nucleus expands to 20-30 tokens, allowing more varied and contextually appropriate phrasing. A competing system using fixed top-k=40 produces the opposite pattern: the clause analysis sometimes includes bizarre word choices from low-probability tokens, while the risk summaries feel repetitive because 40 candidates are too few for genuinely diverse expression. By switching to top-p, the legal-tech company reduces clause-analysis errors by 25% and improves the readability scores of risk summaries by 18% — without any prompt changes or model retraining.

Sources

  1. Holtzman et al. — The Curious Case of Neural Text Degeneration
    arXiv
  2. OpenAI API Reference — top_p Parameter
  3. Wikipedia

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Token in AI
The smallest unit of text an LLM processes — approximately 4 characters or 0.75 words
AI Inference
The process of running a trained LLM to generate output from input
Neural Network
A network of interconnected artificial neurons that learns patterns from data — the foundational architecture behind all modern AI
Prompt
The input text or instructions given to an LLM to generate a response

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

Token Economics

Next

Transformer

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • GitHub
  • Twitter / X
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy