Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Core Concepts
  4. What is a Tokenizer?
book-openCore Concepts
Beginner
2026-W17-phase2

What is a Tokenizer?

A tokenizer converts raw text into tokens — the discrete units a language model processes — using subword algorithms like BPE or SentencePiece.

Also known as:
tokenization
tokenisatie
BPE tokenizer
subword tokenization
AI Intel Pipeline
What is a Tokenizer?

What is a Tokenizer?

A tokenizer is the component that converts raw text into tokens — the discrete units that a language model actually processes. Tokenization is the essential first step in any NLP pipeline: before a model can read a sentence, the tokenizer must break it into pieces the model understands.

Why It Matters

Tokenization directly affects model performance, cost, and behavior. The way text is split into tokens determines how many tokens a prompt uses (which affects API pricing), how well the model handles different languages, and whether it can process code, URLs, or unusual text correctly. Understanding tokenization explains why some languages cost more to process and why models sometimes split words in unexpected ways.

How It Works

Modern LLMs use subword tokenization — a middle ground between character-level and word-level splitting:

  1. Training the tokenizer — analyze a large text corpus to find frequently occurring character sequences (subwords). Common words become single tokens; rare words are split into pieces.
  2. Encoding — convert input text into a sequence of token IDs from the vocabulary.
  3. Decoding — convert token IDs back to readable text.

Popular tokenization algorithms:

  • BPE (Byte Pair Encoding) — used by GPT models. Iteratively merges the most frequent character pairs.
  • WordPiece — used by BERT. Similar to BPE but uses likelihood-based merging.
  • SentencePiece — language-agnostic tokenizer used by LLaMA and T5. Treats the input as a raw byte stream.
  • Tiktoken — OpenAI's fast BPE implementation for GPT-3.5/4.

Vocabulary sizes vary: GPT-4 uses ~100K tokens; LLaMA 2 uses ~32K. Larger vocabularies handle more languages and domains but require more memory.

Example

The sentence "Tokenization is important" might be tokenized as ["Token", "ization", " is", " important"] — four tokens. But "onomatopoeia" might become ["on", "omat", "opo", "eia"] — four tokens for one uncommon word. This is why rare or non-English words cost more tokens.

Related

See also: Token, Large Language Model, Context Window, Token Economics

Sources

  1. Hugging Face – Tokenizers Summary
  2. OpenAI Tiktoken

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Artificial Intelligence (AI)
Artificial intelligence is the field of computer science that builds systems capable of performing tasks normally requiring human intelligence, such as learning, reasoning, and perception.
Batch Size
Batch size (examples per update) and learning rate (step size for weight updates) are the two most important hyperparameters controlling how neural networks train.
Benchmark (AI Evaluation)
A benchmark is a standardized test used to measure and compare AI model performance, providing reproducible scores across tasks like reasoning, coding, and knowledge.
Catastrophic Forgetting
Catastrophic forgetting is when training a neural network on new data overwrites previously learned knowledge, causing it to lose earlier capabilities.

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

Token Economics

Next

Top-p (Nucleus) Sampling

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy