Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Models & Architecture
  4. What Is the KV Cache?
brainModels & Architecture
Advanced

What Is the KV Cache?

A memory optimization that stores previously computed key-value pairs in transformer attention layers — avoiding redundant computation and accelerating generation 3-5×

Also known as:
Key-Value Cache
KV Cache
Attention Cache
What Is the KV Cache? How Key-Value Caching Accelerates LLM Inference

KV-Cache (Key-Value Cache) is a memory optimization technique used in transformer-based language models that stores the key and value tensors computed during the attention mechanism for all previously processed tokens. During autoregressive text generation — where the model produces one token at a time — the KV-cache eliminates the need to recompute attention over the entire sequence for each new token, reducing computational complexity from quadratic to linear in sequence length. This optimization is so fundamental that every production LLM deployment uses KV-caching by default, achieving 3-5× speedups in generation latency. Without it, real-time conversational AI would be impractically slow for sequences beyond a few hundred tokens.

Why it matters

KV-caching is what makes interactive LLM applications feasible at scale. Without caching, generating the 100th token in a sequence requires recomputing attention across all 99 previous tokens — and the 200th token requires attending to 199. This quadratic growth means that a 4,000-token conversation would take 40× longer to generate per token than the first response. KV-caching collapses this to a constant-time operation per new token: look up the stored keys and values, compute attention only for the new token against the cached values, and append the new key-value pair. For businesses running LLM inference at scale, KV-caching directly determines cost and responsiveness. The trade-off is memory: a KV-cache for a model like Claude or GPT-4 can consume 1-4 GB per active session on long contexts, making memory capacity the primary bottleneck for concurrent user count in production deployments.

How it works

In a transformer's attention mechanism, each token is projected into three vectors: Query (Q), Key (K), and Value (V). Attention scores are computed by comparing the current token's Query against all Keys, then using those scores to weight the Values. During autoregressive generation, the model processes tokens sequentially. Without caching, it recomputes K and V for every previous token at each step. The KV-cache stores the K and V vectors for all tokens processed so far. When generating a new token, the model only computes Q, K, and V for that single new token, retrieves the cached K and V from all previous tokens, concatenates the new K and V to the cache, and computes attention using the full set. Each transformer layer maintains its own cache, so total cache size equals: layers × sequence_length × 2 × hidden_dimension × precision_bytes. Advanced techniques like grouped-query attention (GQA) and multi-query attention (MQA) reduce cache size by sharing K and V heads across query heads.

Example

A cloud provider benchmarks their LLM serving infrastructure with and without KV-caching on a 70B-parameter model. Without caching, generating a 500-token response to a 2,000-token prompt takes 12.4 seconds — each new token requires a full attention pass over the growing sequence. With KV-caching enabled, the same generation completes in 2.8 seconds: the prompt is processed once (prefill phase, 1.6 seconds) and the 500-token generation phase takes only 1.2 seconds because each token only computes attention against the cached keys and values. The speedup is 4.4×. However, the cache consumes 2.1 GB of GPU memory per session. At 80 GB per GPU, this limits concurrent sessions to roughly 38 per GPU. The provider implements a paged attention system (like vLLM's PagedAttention) that manages cache memory in fixed-size blocks, reducing fragmentation and increasing concurrent sessions to 52 per GPU — a 37% improvement in throughput with no quality impact.

Sources

  1. Vaswani et al. — Attention Is All You Need
    arXiv
  2. Pope et al. — Efficiently Scaling Transformer Inference
    arXiv
  3. Wikipedia

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Attention Mechanism
The mathematical mechanism that allows transformers to dynamically focus on the most relevant parts of the input when processing each token
Transformer
The neural network architecture underlying all modern LLMs, using attention mechanisms to process text
Context Window
The maximum number of tokens an LLM can process in a single request
RAG (Retrieval-Augmented Generation)
A technique that combines LLMs with external knowledge retrieval to improve accuracy and reduce hallucinations

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

AI Jailbreaking

Next

Quantization

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • GitHub
  • Twitter / X
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy