
What is a Tokenizer?
A tokenizer is the component that converts raw text into tokens β the discrete units that a language model actually processes. Tokenization is the essential first step in any NLP pipeline: before a model can read a sentence, the tokenizer must break it into pieces the model understands.
Why It Matters
Tokenization directly affects model performance, cost, and behavior. The way text is split into tokens determines how many tokens a prompt uses (which affects API pricing), how well the model handles different languages, and whether it can process code, URLs, or unusual text correctly. Understanding tokenization explains why some languages cost more to process and why models sometimes split words in unexpected ways.
How It Works
Modern LLMs use subword tokenization β a middle ground between character-level and word-level splitting:
- Training the tokenizer β analyze a large text corpus to find frequently occurring character sequences (subwords). Common words become single tokens; rare words are split into pieces.
- Encoding β convert input text into a sequence of token IDs from the vocabulary.
- Decoding β convert token IDs back to readable text.
Popular tokenization algorithms:
- BPE (Byte Pair Encoding) β used by GPT models. Iteratively merges the most frequent character pairs.
- WordPiece β used by BERT. Similar to BPE but uses likelihood-based merging.