
KV-Cache (Key-Value Cache) is a memory optimization technique used in transformer-based language models that stores the key and value tensors computed during the attention mechanism for all previously processed tokens. During autoregressive text generation — where the model produces one token at a time — the KV-cache eliminates the need to recompute attention over the entire sequence for each new token, reducing computational complexity from quadratic to linear in sequence length. This optimization is so fundamental that every production LLM deployment uses KV-caching by default, achieving 3-5× speedups in generation latency. Without it, real-time conversational AI would be impractically slow for sequences beyond a few hundred tokens.
Why it matters
KV-caching is what makes interactive LLM applications feasible at scale. Without caching, generating the 100th token in a sequence requires recomputing attention across all 99 previous tokens — and the 200th token requires attending to 199. This quadratic growth means that a 4,000-token conversation would take 40× longer to generate per token than the first response. KV-caching collapses this to a constant-time operation per new token: look up the stored keys and values, compute attention only for the new token against the cached values, and append the new key-value pair. For businesses running LLM inference at scale, KV-caching directly determines cost and responsiveness. The trade-off is memory: a KV-cache for a model like Claude or GPT-4 can consume 1-4 GB per active session on long contexts, making memory capacity the primary bottleneck for concurrent user count in production deployments.
How it works
In a transformer's attention mechanism, each token is projected into three vectors: Query (Q), Key (K), and Value (V). Attention scores are computed by comparing the current token's Query against all Keys, then using those scores to weight the Values. During autoregressive generation, the model processes tokens sequentially. Without caching, it recomputes K and V for every previous token at each step. The KV-cache stores the K and V vectors for all tokens processed so far. When generating a new token, the model only computes Q, K, and V for that single new token, retrieves the cached K and V from all previous tokens, concatenates the new K and V to the cache, and computes attention using the full set. Each transformer layer maintains its own cache, so total cache size equals: layers × sequence_length × 2 × hidden_dimension × precision_bytes. Advanced techniques like grouped-query attention (GQA) and multi-query attention (MQA) reduce cache size by sharing K and V heads across query heads.
Example
A cloud provider benchmarks their LLM serving infrastructure with and without KV-caching on a 70B-parameter model. Without caching, generating a 500-token response to a 2,000-token prompt takes 12.4 seconds — each new token requires a full attention pass over the growing sequence. With KV-caching enabled, the same generation completes in 2.8 seconds: the prompt is processed once (prefill phase, 1.6 seconds) and the 500-token generation phase takes only 1.2 seconds because each token only computes attention against the cached keys and values. The speedup is 4.4×. However, the cache consumes 2.1 GB of GPU memory per session. At 80 GB per GPU, this limits concurrent sessions to roughly 38 per GPU. The provider implements a paged attention system (like vLLM's PagedAttention) that manages cache memory in fixed-size blocks, reducing fragmentation and increasing concurrent sessions to 52 per GPU — a 37% improvement in throughput with no quality impact.