Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Core Concepts
  4. What Is a Context Window?
book-openCore Concepts
Beginner

What Is a Context Window?

The maximum number of tokens an LLM can process in a single request

Also known as:
Contextvenster
Context Length
Token Limit
Context Window

The context window is the maximum number of tokens a Large Language Model can process in a single request, encompassing both the input (system prompt, user message, retrieved documents, conversation history) and the generated output. Modern context windows range from 8K tokens for lightweight models to over 1 million tokens for frontier models like Claude and Gemini. The context window defines the model's working memory — everything it needs to consider when generating a response must fit within this limit. When content exceeds the window, it is simply invisible to the model, making context management a critical engineering discipline for any non-trivial AI application.

Why it matters

The context window determines what an LLM can "see" during any single interaction. A 200K-token window sounds vast until you calculate that a 100-page technical manual consumes around 25K tokens, a day of customer conversation history might use 50K tokens, and instructions plus examples take another 2K tokens. For applications involving long documents, multi-turn conversations, or multi-source RAG, context window size directly constrains what is possible. Cost also scales with context length — processing 100K tokens costs roughly 12× more than processing 8K tokens. This creates the fundamental tradeoff in AI application design: include more context for better quality, or compress aggressively for lower cost. Context compression techniques have emerged as a critical optimization discipline specifically to maximize the value extracted per token of context.

How it works

The model's transformer architecture processes all tokens in the context window simultaneously through attention mechanisms. Each token attends to every other token, with computational cost scaling quadratically — doubling the context from 50K to 100K tokens roughly quadruples the attention computation. This is why larger context windows require more compute and cost more. In practice, the context window is filled in order: system prompt first, then any injected context (RAG documents, tool results), then conversation history, and finally the user's current message. If the total exceeds the window, earlier content must be truncated or summarized. Most applications implement a context management strategy — sliding windows that keep only recent turns, summarization of older context, or priority-based inclusion where the most relevant content is always retained.

Example

A research team builds an AI assistant that analyzes scientific papers. Each paper averages 12K tokens. The team wants the system to compare three papers simultaneously while following a detailed analysis prompt (2K tokens) and maintaining conversation context (3K tokens). Total needed: 39K tokens — well within a 200K-token window but expensive at scale. They implement a two-tier strategy: for initial analysis, the full papers are loaded into context. For follow-up questions, only the relevant sections (identified via semantic search on embeddings) are included, reducing context to 8K tokens per request. This context compression approach cuts costs by 80% on follow-up queries while maintaining answer quality, because the model only sees the passages it actually needs rather than entire papers.

Sources

  1. Vaswani et al. — Attention Is All You Need
    arXiv
  2. Anthropic — Claude Models and Context Limits
    Web
  3. Wikipedia

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Context Compression for AI Agents
Techniques to reduce token counts while preserving meaning — critical for agentic workflows that exhaust even million-token context windows.
Token in AI
The smallest unit of text an LLM processes — approximately 4 characters or 0.75 words
RAG (Retrieval-Augmented Generation)
A technique that combines LLMs with external knowledge retrieval to improve accuracy and reduce hallucinations
Large Language Model (LLM)
A neural network trained on massive text data to understand and generate human-like language
Prompt Caching
Storing and reusing processed prompt prefixes on LLM servers to reduce costs by up to 90% and latency by 3×
KV Cache
A memory optimization that stores previously computed key-value pairs in transformer attention layers — avoiding redundant computation and accelerating generation 3-5×

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

Context Compression for AI Agents

Next

Embedding

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • GitHub
  • Twitter / X
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy