Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Models & Architecture
  4. What is Positional Encoding?
brainModels & Architecture
Advanced
2026-W17

What is Positional Encoding?

Positional encoding tells transformers the order of tokens in a sequence, since self-attention alone is position-agnostic. Modern approaches like RoPE enable 128K+ context windows.

Also known as:
positie-encodering
RoPE
rotary embedding
ALiBi
AI Intel Pipeline
What is Positional Encoding?

What is Positional Encoding?

Positional encoding is a mechanism that gives transformer models information about the order and position of tokens in a sequence. Since transformers process all tokens simultaneously (unlike RNNs which process sequentially), they need an explicit signal to know that "the cat sat on the mat" is different from "the mat sat on the cat."

Why It Matters

Without positional encoding, transformers would treat text as a bag of words — losing all meaning that depends on word order. The choice of positional encoding also determines a model's maximum context length and how well it handles long sequences. Recent innovations like RoPE and ALiBi are what enabled the jump from 2K to 128K+ context windows.

How It Works

The problem:

  • Self-attention computes relationships between all token pairs
  • But attention is permutation-invariant — it doesn't inherently know token order
  • "Dog bites man" and "Man bites dog" would produce the same attention patterns

Solutions:

1. Sinusoidal positional encoding (original transformer):

  • Add fixed sine/cosine waves of different frequencies to token embeddings
  • Each position gets a unique pattern; nearby positions have similar patterns
  • Fixed — can't extend beyond the trained context length

2. Learned positional embeddings (GPT-2, BERT):

  • Train a separate embedding for each position (position 1, position 2, ... position N)
  • More flexible but still fixed to maximum trained length

3. RoPE — Rotary Position Embedding (LLaMA, Mistral, Qwen):

  • Encodes position by rotating the query and key vectors
  • Relative positions naturally captured in the attention computation
  • Can be extended beyond training length with scaling techniques (NTK-aware, YaRN)
  • The dominant approach for modern LLMs

4. ALiBi — Attention with Linear Biases (BLOOM, MPT):

  • Adds a linear distance-based penalty to attention scores
  • Closer tokens get higher attention; distant tokens are penalized
  • Simple and effective for length extrapolation

Context length extension: RoPE scaling techniques enable models trained on 4K tokens to work at 128K+:

  • Linear scaling — simple but quality degrades
  • NTK-aware scaling — adjusts the frequency base
  • YaRN — combines multiple scaling strategies

Example

In the sentence "The bank by the river" vs "I went to the bank," the word "bank" appears at different positions. Positional encoding ensures the model knows where each word sits in the sequence, helping it use surrounding context to determine that one "bank" means a riverbank and the other means a financial institution.

Sources

  1. Vaswani et al. – Attention Is All You Need (Section 3.5)
  2. Su et al. – RoFormer: Enhanced Transformer with Rotary Position Embedding

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Activation Function
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Common ones: ReLU, GELU (transformers), sigmoid, softmax.
Gemini Omni
Google's any-to-any multimodal foundation model capable of generating any output (text, image, audio, video) from any input, with physics-grounded video generation as its first major capability.
MiniMax-M2
A 229.9B parameter Mixture-of-Experts model with only 9.8B active parameters per token, optimized for agentic tasks and exhibiting early signs of self-evolution—autonomously debugging its own training and modifying its scaffolding.
Nemotron-Labs Diffusion
NVIDIA's family of language models (3B-14B) that merge autoregressive and diffusion generation into one architecture, enabling both GPT-style sequential generation and 10-50x faster parallel diffusion mode.

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

Perplexity in NLP

Next

Pre-training

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy