Skip to main content
BVDNETBVDNET
ServicesWorkLibraryAboutPricingBlogContact
Contact
  1. Home
  2. AI Woordenboek
  3. Models & Architecture
  4. What Is the KV Cache?
brainModels & Architecture
Advanced

What Is the KV Cache?

A memory optimization that stores previously computed key-value pairs in transformer attention layers — avoiding redundant computation and accelerating generation 3-5×

Also known as:
Key-Value Cache
KV Cache
Attention Cache
AI Intel Pipeline
What Is the KV Cache? How Key-Value Caching Accelerates LLM Inference

KV-Cache (Key-Value Cache) is a memory optimization technique used in transformer-based language models that stores the key and value tensors computed during the attention mechanism for all previously processed tokens. During autoregressive text generation — where the model produces one token at a time — the KV-cache eliminates the need to recompute attention over the entire sequence for each new token, reducing computational complexity from quadratic to linear in sequence length. This optimization is so fundamental that every production LLM deployment uses KV-caching by default, achieving 3-5× speedups in generation latency. Without it, real-time conversational AI would be impractically slow for sequences beyond a few hundred tokens.

Why it matters

KV-caching is what makes interactive LLM applications feasible at scale. Without caching, generating the 100th token in a sequence requires recomputing attention across all 99 previous tokens — and the 200th token requires attending to 199. This quadratic growth means that a 4,000-token conversation would take 40× longer to generate per token than the first response. KV-caching collapses this to a constant-time operation per new token: look up the stored keys and values, compute attention only for the new token against the cached values, and append the new key-value pair. For businesses running LLM inference at scale, KV-caching directly determines cost and responsiveness. The trade-off is memory: a KV-cache for a model like Claude or GPT-4 can consume 1-4 GB per active session on long contexts, making memory capacity the primary bottleneck for concurrent user count in production deployments.

How it works

In a transformer's attention mechanism, each token is projected into three vectors: Query (Q), Key (K), and Value (V). Attention scores are computed by comparing the current token's Query against all Keys, then using those scores to weight the Values. During autoregressive generation, the model processes tokens sequentially. Without caching, it recomputes K and V for every previous token at each step. The KV-cache stores the K and V vectors for all tokens processed so far. When generating a new token, the model only computes Q, K, and V for that single new token, retrieves the cached K and V from all previous tokens, concatenates the new K and V to the cache, and computes attention using the full set. Each transformer layer maintains its own cache, so total cache size equals: layers × sequence_length × 2 × hidden_dimension × precision_bytes. Advanced techniques like grouped-query attention (GQA) and multi-query attention (MQA) reduce cache size by sharing K and V heads across query heads.

Example

A cloud provider benchmarks their LLM serving infrastructure with and without KV-caching on a 70B-parameter model. Without caching, generating a 500-token response to a 2,000-token prompt takes 12.4 seconds — each new token requires a full attention pass over the growing sequence. With KV-caching enabled, the same generation completes in 2.8 seconds: the prompt is processed once (prefill phase, 1.6 seconds) and the 500-token generation phase takes only 1.2 seconds because each token only computes attention against the cached keys and values. The speedup is 4.4×. However, the cache consumes 2.1 GB of GPU memory per session. At 80 GB per GPU, this limits concurrent sessions to roughly 38 per GPU. The provider implements a paged attention system (like vLLM's PagedAttention) that manages cache memory in fixed-size blocks, reducing fragmentation and increasing concurrent sessions to 52 per GPU — a 37% improvement in throughput with no quality impact.

Sources

  1. Vaswani et al. — Attention Is All You Need
    arXiv
  2. Pope et al. — Efficiently Scaling Transformer Inference
    arXiv
  3. Wikipedia

Need help implementing AI?

I can help you apply this concept to your business.

Get in touch

Related Concepts

Activation Function
Activation functions introduce non-linearity into neural networks, enabling them to learn complex patterns. Common ones: ReLU, GELU (transformers), sigmoid, softmax.
Gemini Omni
Google's any-to-any multimodal foundation model capable of generating any output (text, image, audio, video) from any input, with physics-grounded video generation as its first major capability.
MiniMax-M2
A 229.9B parameter Mixture-of-Experts model with only 9.8B active parameters per token, optimized for agentic tasks and exhibiting early signs of self-evolution—autonomously debugging its own training and modifying its scaffolding.
Nemotron-Labs Diffusion
NVIDIA's family of language models (3B-14B) that merge autoregressive and diffusion generation into one architecture, enabling both GPT-style sequential generation and 10-50x faster parallel diffusion mode.

AI Consulting

Need help understanding or implementing this concept?

Talk to an expert
Previous

Knowledge Graph

Next

Quantization

BVDNETBVDNET

Web development and AI automation. Done properly.

Company

  • About
  • Contact
  • FAQ

Resources

  • Services
  • Work
  • Library
  • Blog
  • Pricing

Connect

  • LinkedIn
  • Email

© 2026 BVDNET. All rights reserved.

Privacy Policy•Terms of Service•Cookie Policy