What is Positional Encoding?

Positional encoding is a mechanism that gives transformer models information about the order and position of tokens in a sequence. Since transformers process all tokens simultaneously (unlike RNNs which process sequentially), they need an explicit signal to know that "the cat sat on the mat" is different from "the mat sat on the cat."

Why It Matters

Without positional encoding, transformers would treat text as a bag of words — losing all meaning that depends on word order. The choice of positional encoding also determines a model's maximum context length and how well it handles long sequences. Recent innovations like RoPE and ALiBi are what enabled the jump from 2K to 128K+ context windows.

How It Works

The problem:

Self-attention computes relationships between all token pairs
But attention is permutation-invariant — it doesn't inherently know token order
"Dog bites man" and "Man bites dog" would produce the same attention patterns

Solutions:

1. Sinusoidal positional encoding (original transformer):

Add fixed sine/cosine waves of different frequencies to token embeddings
Each position gets a unique pattern; nearby positions have similar patterns
Fixed — can't extend beyond the trained context length

2. Learned positional embeddings (GPT-2, BERT):

Train a separate embedding for each position (position 1, position 2, ... position N)
More flexible but still fixed to maximum trained length

3. RoPE — Rotary Position Embedding (LLaMA, Mistral, Qwen):

Encodes position by rotating the query and key vectors
Relative positions naturally captured in the attention computation
Can be extended beyond training length with scaling techniques (NTK-aware, YaRN)
The dominant approach for modern LLMs

4. ALiBi — Attention with Linear Biases (BLOOM, MPT):

Adds a linear distance-based penalty to attention scores
Closer tokens get higher attention; distant tokens are penalized
Simple and effective for length extrapolation

Context length extension: RoPE scaling techniques enable models trained on 4K tokens to work at 128K+:

Linear scaling — simple but quality degrades
NTK-aware scaling — adjusts the frequency base
YaRN — combines multiple scaling strategies

Example

In the sentence "The bank by the river" vs "I went to the bank," the word "bank" appears at different positions. Positional encoding ensures the model knows where each word sits in the sequence, helping it use surrounding context to determine that one "bank" means a riverbank and the other means a financial institution.