What is a Tokenizer?

A tokenizer is the component that converts raw text into tokens — the discrete units that a language model actually processes. Tokenization is the essential first step in any NLP pipeline: before a model can read a sentence, the tokenizer must break it into pieces the model understands.

Why It Matters

Tokenization directly affects model performance, cost, and behavior. The way text is split into tokens determines how many tokens a prompt uses (which affects API pricing), how well the model handles different languages, and whether it can process code, URLs, or unusual text correctly. Understanding tokenization explains why some languages cost more to process and why models sometimes split words in unexpected ways.

How It Works

Modern LLMs use subword tokenization — a middle ground between character-level and word-level splitting:

Training the tokenizer — analyze a large text corpus to find frequently occurring character sequences (subwords). Common words become single tokens; rare words are split into pieces.
Encoding — convert input text into a sequence of token IDs from the vocabulary.
Decoding — convert token IDs back to readable text.

Popular tokenization algorithms:

BPE (Byte Pair Encoding) — used by GPT models. Iteratively merges the most frequent character pairs.
WordPiece — used by BERT. Similar to BPE but uses likelihood-based merging.
SentencePiece — language-agnostic tokenizer used by LLaMA and T5. Treats the input as a raw byte stream.
Tiktoken — OpenAI's fast BPE implementation for GPT-3.5/4.

Vocabulary sizes vary: GPT-4 uses ~100K tokens; LLaMA 2 uses ~32K. Larger vocabularies handle more languages and domains but require more memory.

Example

The sentence "Tokenization is important" might be tokenized as ["Token", "ization", " is", " important"] — four tokens. But "onomatopoeia" might become ["on", "omat", "opo", "eia"] — four tokens for one uncommon word. This is why rare or non-English words cost more tokens.