
What is Positional Encoding?
Positional encoding is a mechanism that gives transformer models information about the order and position of tokens in a sequence. Since transformers process all tokens simultaneously (unlike RNNs which process sequentially), they need an explicit signal to know that "the cat sat on the mat" is different from "the mat sat on the cat."
Why It Matters
Without positional encoding, transformers would treat text as a bag of words β losing all meaning that depends on word order. The choice of positional encoding also determines a model's maximum context length and how well it handles long sequences. Recent innovations like RoPE and ALiBi are what enabled the jump from 2K to 128K+ context windows.
How It Works
The problem:
- Self-attention computes relationships between all token pairs
- But attention is permutation-invariant β it doesn't inherently know token order
- "Dog bites man" and "Man bites dog" would produce the same attention patterns
Solutions:
1. Sinusoidal positional encoding (original transformer):
- Add fixed sine/cosine waves of different frequencies to token embeddings
- Each position gets a unique pattern; nearby positions have similar patterns
- Fixed β can't extend beyond the trained context length