Skip to main content
AI engineering glossary

What is the KV Cache in LLM Inference?

The KV Cache (key-value cache) is a memory structure used during LLM inference that stores the computed key and value matrices from previous tokens: enabling each new token to be generated by attending to all prior context without recomputing the attention for tokens already seen, dramatically reducing per-token compute cost during autoregressive generation.

Last updated 2026-04-29BearPlex AI Engineering Team

Overview

The KV Cache is one of the most important production optimizations in LLM inference. Without it, generating each token would require recomputing attention across all prior tokens: making generation cost grow quadratically with sequence length. With KV Cache, prior token computations are reused, making per-token generation cost roughly constant. The trade-off is memory: KV Cache stores significant amounts of GPU memory (proportional to context length × number of layers × hidden size). KV Cache memory is often the binding constraint on LLM serving capacity: limiting how many concurrent requests a GPU can serve. Production inference engines (vLLM, TensorRT-LLM, llama.cpp) implement sophisticated KV Cache management (PagedAttention, prefix caching, eviction policies) to maximize serving capacity.

How KV Cache works

During LLM inference, each generation step computes attention between the new token's query vector and all prior tokens' key and value vectors. Without caching, you'd recompute K and V for all prior tokens at every step. The KV Cache stores K and V from prior tokens, so generation just needs to compute K and V for the new token and attend across the cached K/V plus the new ones. This is the difference between O(n²) and O(n) total compute for generating n tokens. The cache grows as generation proceeds; for a sequence of n tokens with hidden size d, num_layers L, the KV Cache size is approximately 2 × L × n × d × 2 bytes (for FP16). For a 70B model with 80 layers, hidden size 8192, generating a 4K-token response uses several GB of GPU memory just for KV Cache.

Production KV Cache management

Production inference engines manage KV Cache aggressively. PagedAttention (vLLM's contribution): treats KV Cache like virtual memory with paging, allocates cache in fixed-size blocks rather than contiguous arrays, dramatically reducing memory waste from variable-length sequences. Prefix caching: when multiple requests share a common prefix (system prompt, retrieved context), share the KV Cache for the prefix across requests, major cost saving for production RAG and chat applications. Eviction policies: when KV Cache memory fills, evict cache entries based on usage patterns. KV Cache management can 5-10× the throughput of naive serving: production deployments depend on it.

Why KV Cache memory is the binding constraint

GPU memory has two demands: model weights (fixed) and KV Cache (variable, grows with sequence length and concurrent requests). For a 70B model, weights take ~140GB FP16 or ~35GB INT4. The remaining GPU memory holds KV Cache. A single H100 80GB GPU running INT4-quantized 70B model has ~45GB available for KV Cache: enough for maybe 50-100 concurrent moderate-context conversations or much fewer long-context requests. KV Cache memory determines serving capacity per GPU. This is why long-context LLMs are expensive: long contexts require huge KV Cache, dramatically reducing concurrent serving capacity. Quantization techniques (FP8 KV Cache, INT8 KV Cache) help by reducing per-token cache size at small accuracy cost.

Use cases

  • Production LLM serving (every modern inference engine uses KV Cache)
  • Prefix caching for RAG (share KV Cache for retrieved context across users)
  • Multi-turn chat applications (KV Cache persists across turns within a conversation)
  • Optimizing serving capacity per GPU
  • Long-context LLM applications (KV Cache management is critical)

Examples in production

vLLM

PagedAttention paper introduced sophisticated KV Cache management: became the foundation for vLLM, the most popular open-source LLM inference engine.

Source

Anthropic prompt caching

Anthropic's prompt caching feature exposes prefix caching to API users: 90% discount on cached prefix tokens, dramatic cost savings for production applications with stable system prompts.

Source

OpenAI prompt caching

OpenAI's prompt caching feature exposes prefix caching with 50% discount on cached prefixes: same idea as Anthropic's but at smaller discount.

Source

KV Cache compared to alternatives

AlternativeChoose KV Cache whenChoose alternative when
Recomputing attention each step
Compute K and V for all tokens at every generation step
KV Cache is essentially mandatory for production inference: recomputing is wastefulRecomputing only used in research or for very short sequences where caching overhead exceeds compute savings
Quantized KV Cache (INT8, FP8)
Reduce KV Cache memory by quantizing K and V values
FP16 KV Cache for highest qualityINT8 / FP8 KV Cache when memory is the binding constraint and small accuracy loss is acceptable

Common pitfalls

  • Underestimating KV Cache memory requirements: long contexts can blow up memory unexpectedly
  • Not using prefix caching for RAG / chat applications: major cost / latency benefit available
  • Mixing concurrent requests with very different context lengths: wastes KV Cache memory
  • Naive serving without PagedAttention or equivalent: significant capacity loss
  • Trying to serve long-context models without provisioning sufficient GPU memory
FAQ

Questions about KV Cache.

Depends on model and sequence length. For a 70B model with 80 layers, hidden size 8192, generating a 4K-token response uses ~2.5GB GPU memory for KV Cache. Long contexts (128K+ tokens) can use 10-50+ GB just for one request's KV Cache. KV Cache is often the binding constraint on serving capacity per GPU.

Prefix caching shares KV Cache across requests with common prefixes (system prompt, retrieved context). For production applications with stable system prompts, prefix caching dramatically reduces cost (Anthropic 90% discount, OpenAI 50% discount on cached tokens) and latency. Yes, almost always use it for production.

Yes: quantization (FP8 or INT8 KV Cache instead of FP16) halves or quarters KV Cache memory at small accuracy cost. Other techniques: grouped-query attention (GQA) shares K/V across query heads. Modern frontier models use both.

vLLM's contribution: manages KV Cache like OS virtual memory with paging. Allocates cache in fixed-size blocks rather than contiguous arrays, dramatically reducing memory waste from variable-length sequences. Now standard in production inference engines.

Work with BearPlex

Need help implementing KV Cache?

BearPlex builds production AI systems that use KV Cache for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.