
Prompt caching is an optimization technique where the processed representation of a prompt prefix — such as a system prompt, knowledge base, or set of few-shot examples — is stored on the LLM provider's servers and reused across subsequent requests, dramatically reducing both cost and latency. The first request that establishes the cache pays a small write premium (typically 1.25× the normal token cost), but all subsequent requests that share the same prefix read from cache at a 90% discount (0.1× normal cost). For high-volume AI applications where the same system prompt and context are sent with every request, prompt caching can reduce API costs by over 80% annually while also cutting response latency by a factor of three, since the cached tokens don't need to be reprocessed.
Why it matters
Prompt caching is perhaps the highest-ROI optimization available for production LLM applications. Consider a customer support chatbot that sends an 8,000-token system prompt, a 10,000-token knowledge base, and only 500 tokens of unique user input with each request. Without caching, the full 18,500 tokens are processed every time. With caching, the 18,000 stable tokens are processed once and then read from cache at 10% cost for every subsequent request. At 1,000 requests per day, this translates to annual savings of over €20,000 — from a two-hour implementation that simply adds caching headers to API calls. The break-even point is just 1.4 requests with the same prefix, meaning caching pays for itself almost immediately. For any application making more than a handful of requests per day with a stable system prompt, not using prompt caching is leaving money on the table.
How it works
When an LLM provider receives a request with caching enabled, it computes and stores the internal key-value (KV) representations of the cacheable prefix — the computationally expensive intermediate results that the transformer produces while processing those tokens. On subsequent requests with the same prefix, the provider detects the cache hit, skips the expensive computation for the cached portion, and only processes the new, non-cached tokens. The cache typically expires after 5 minutes of inactivity, meaning low-traffic periods incur a "cold start" cache-write cost when traffic resumes. Implementation is straightforward: you mark stable prompt sections (system prompt, knowledge base, examples) with a cache control header, and the API handles the rest. Multi-level caching strategies separate content by update frequency — completely static guidelines in one cached block, semi-static product catalogs in another, and dynamic per-user context outside the cache entirely.
Example
A SaaS platform runs an AI document analysis tool that processes customer-uploaded contracts. Each request includes: a 6,000-token system prompt with analysis instructions, a 12,000-token legal reference library, and the variable-length customer document (average 3,000 tokens). Without caching, their monthly API bill at 5,000 analyses per day is €32,000. They implement two-level prompt caching: the system prompt and legal library (18,000 tokens) are marked as cacheable. The first request each morning pays 1.25× for the cache write. The remaining 4,999 requests read those 18,000 tokens at 0.1× cost, only paying full price for the 3,000 variable document tokens. Monthly cost drops to €7,200 — a 77% reduction. They add a keep-alive ping every 4 minutes during business hours to prevent cache expiration during brief traffic lulls, which costs an additional €80 per month but eliminates the sporadic cold-start latency spikes their users were noticing.