
What is Positional Encoding?
Positional encoding is a mechanism that gives transformer models information about the order and position of tokens in a sequence. Since transformers process all tokens simultaneously (unlike RNNs which process sequentially), they need an explicit signal to know that "the cat sat on the mat" is different from "the mat sat on the cat."
Why It Matters
Without positional encoding, transformers would treat text as a bag of words — losing all meaning that depends on word order. The choice of positional encoding also determines a model's maximum context length and how well it handles long sequences. Recent innovations like RoPE and ALiBi are what enabled the jump from 2K to 128K+ context windows.
How It Works
The problem:
- Self-attention computes relationships between all token pairs
- But attention is permutation-invariant — it doesn't inherently know token order
- "Dog bites man" and "Man bites dog" would produce the same attention patterns
Solutions:
1. Sinusoidal positional encoding (original transformer):
- Add fixed sine/cosine waves of different frequencies to token embeddings
- Each position gets a unique pattern; nearby positions have similar patterns
- Fixed — can't extend beyond the trained context length
2. Learned positional embeddings (GPT-2, BERT):
- Train a separate embedding for each position (position 1, position 2, ... position N)
- More flexible but still fixed to maximum trained length
3. RoPE — Rotary Position Embedding (LLaMA, Mistral, Qwen):
- Encodes position by rotating the query and key vectors
- Relative positions naturally captured in the attention computation
- Can be extended beyond training length with scaling techniques (NTK-aware, YaRN)
- The dominant approach for modern LLMs
4. ALiBi — Attention with Linear Biases (BLOOM, MPT):
- Adds a linear distance-based penalty to attention scores
- Closer tokens get higher attention; distant tokens are penalized
- Simple and effective for length extrapolation
Context length extension: RoPE scaling techniques enable models trained on 4K tokens to work at 128K+:
- Linear scaling — simple but quality degrades
- NTK-aware scaling — adjusts the frequency base
- YaRN — combines multiple scaling strategies
Example
In the sentence "The bank by the river" vs "I went to the bank," the word "bank" appears at different positions. Positional encoding ensures the model knows where each word sits in the sequence, helping it use surrounding context to determine that one "bank" means a riverbank and the other means a financial institution.