
The context window is the maximum number of tokens a Large Language Model can process in a single request, encompassing both the input (system prompt, user message, retrieved documents, conversation history) and the generated output. Modern context windows range from 8K tokens for lightweight models to over 1 million tokens for frontier models like Claude and Gemini. The context window defines the model's working memory — everything it needs to consider when generating a response must fit within this limit. When content exceeds the window, it is simply invisible to the model, making context management a critical engineering discipline for any non-trivial AI application.
Why it matters
The context window determines what an LLM can "see" during any single interaction. A 200K-token window sounds vast until you calculate that a 100-page technical manual consumes around 25K tokens, a day of customer conversation history might use 50K tokens, and instructions plus examples take another 2K tokens. For applications involving long documents, multi-turn conversations, or multi-source RAG, context window size directly constrains what is possible. Cost also scales with context length — processing 100K tokens costs roughly 12× more than processing 8K tokens. This creates the fundamental tradeoff in AI application design: include more context for better quality, or compress aggressively for lower cost. Context compression techniques have emerged as a critical optimization discipline specifically to maximize the value extracted per token of context.
How it works
The model's transformer architecture processes all tokens in the context window simultaneously through attention mechanisms. Each token attends to every other token, with computational cost scaling quadratically — doubling the context from 50K to 100K tokens roughly quadruples the attention computation. This is why larger context windows require more compute and cost more. In practice, the context window is filled in order: system prompt first, then any injected context (RAG documents, tool results), then conversation history, and finally the user's current message. If the total exceeds the window, earlier content must be truncated or summarized. Most applications implement a context management strategy — sliding windows that keep only recent turns, summarization of older context, or priority-based inclusion where the most relevant content is always retained.
Example
A research team builds an AI assistant that analyzes scientific papers. Each paper averages 12K tokens. The team wants the system to compare three papers simultaneously while following a detailed analysis prompt (2K tokens) and maintaining conversation context (3K tokens). Total needed: 39K tokens — well within a 200K-token window but expensive at scale. They implement a two-tier strategy: for initial analysis, the full papers are loaded into context. For follow-up questions, only the relevant sections (identified via semantic search on embeddings) are included, reducing context to 8K tokens per request. This context compression approach cuts costs by 80% on follow-up queries while maintaining answer quality, because the model only sees the passages it actually needs rather than entire papers.