What's prefix caching and should we use it?

Prefix caching shares KV Cache across requests with common prefixes (system prompt, retrieved context). For production applications with stable system prompts, prefix caching dramatically reduces cost (Anthropic 90% discount, OpenAI 50% discount on cached tokens) and latency. Yes, almost always use it for production.

Can we reduce KV Cache memory?

Yes: quantization (FP8 or INT8 KV Cache instead of FP16) halves or quarters KV Cache memory at small accuracy cost. Other techniques: grouped-query attention (GQA) shares K/V across query heads. Modern frontier models use both.

Start a conversation

AI engineering glossary

What is the KV Cache in LLM Inference?

Q: What's PagedAttention?

vLLM's contribution: manages KV Cache like OS virtual memory with paging. Allocates cache in fixed-size blocks rather than contiguous arrays, dramatically reducing memory waste from variable-length sequences. Now standard in production inference engines.

The KV Cache (key-value cache) is a memory structure used during LLM inference that stores the computed key and value matrices from previous tokens: enabling each new token to be generated by attending to all prior context without recomputing the attention for tokens already seen, dramatically reducing per-token compute cost during autoregressive generation.

Last updated 2026-04-29BearPlex AI Engineering Team

Overview

The KV Cache is one of the most important production optimizations in LLM inference. Without it, generating each token would require recomputing attention across all prior tokens: making generation cost grow quadratically with sequence length. With KV Cache, prior token computations are reused, making per-token generation cost roughly constant. The trade-off is memory: KV Cache stores significant amounts of GPU memory (proportional to context length × number of layers × hidden size). KV Cache memory is often the binding constraint on LLM serving capacity: limiting how many concurrent requests a GPU can serve. Production inference engines (vLLM, TensorRT-LLM, llama.cpp) implement sophisticated KV Cache management (PagedAttention, prefix caching, eviction policies) to maximize serving capacity.

How KV Cache works

During LLM inference, each generation step computes attention between the new token's query vector and all prior tokens' key and value vectors. Without caching, you'd recompute K and V for all prior tokens at every step. The KV Cache stores K and V from prior tokens, so generation just needs to compute K and V for the new token and attend across the cached K/V plus the new ones. This is the difference between O(n²) and O(n) total compute for generating n tokens. The cache grows as generation proceeds; for a sequence of n tokens with hidden size d, num_layers L, the KV Cache size is approximately 2 × L × n × d × 2 bytes (for FP16). For a 70B model with 80 layers, hidden size 8192, generating a 4K-token response uses several GB of GPU memory just for KV Cache.

Production KV Cache management

Production inference engines manage KV Cache aggressively. PagedAttention (vLLM's contribution): treats KV Cache like virtual memory with paging, allocates cache in fixed-size blocks rather than contiguous arrays, dramatically reducing memory waste from variable-length sequences. Prefix caching: when multiple requests share a common prefix (system prompt, retrieved context), share the KV Cache for the prefix across requests, major cost saving for production RAG and chat applications. Eviction policies: when KV Cache memory fills, evict cache entries based on usage patterns. KV Cache management can 5-10× the throughput of naive serving: production deployments depend on it.

Why KV Cache memory is the binding constraint

GPU memory has two demands: model weights (fixed) and KV Cache (variable, grows with sequence length and concurrent requests). For a 70B model, weights take ~140GB FP16 or ~35GB INT4. The remaining GPU memory holds KV Cache. A single H100 80GB GPU running INT4-quantized 70B model has ~45GB available for KV Cache: enough for maybe 50-100 concurrent moderate-context conversations or much fewer long-context requests. KV Cache memory determines serving capacity per GPU. This is why long-context LLMs are expensive: long contexts require huge KV Cache, dramatically reducing concurrent serving capacity. Quantization techniques (FP8 KV Cache, INT8 KV Cache) help by reducing per-token cache size at small accuracy cost.

Use cases

Production LLM serving (every modern inference engine uses KV Cache)
Prefix caching for RAG (share KV Cache for retrieved context across users)
Multi-turn chat applications (KV Cache persists across turns within a conversation)
Optimizing serving capacity per GPU
Long-context LLM applications (KV Cache management is critical)

Examples in production

vLLM

PagedAttention paper introduced sophisticated KV Cache management: became the foundation for vLLM, the most popular open-source LLM inference engine.

Source

Anthropic prompt caching

Anthropic's prompt caching feature exposes prefix caching to API users: 90% discount on cached prefix tokens, dramatic cost savings for production applications with stable system prompts.

Source

OpenAI prompt caching

OpenAI's prompt caching feature exposes prefix caching with 50% discount on cached prefixes: same idea as Anthropic's but at smaller discount.

Source

KV Cache compared to alternatives

Alternative	Choose KV Cache when	Choose alternative when
Recomputing attention each step Compute K and V for all tokens at every generation step	KV Cache is essentially mandatory for production inference: recomputing is wasteful	Recomputing only used in research or for very short sequences where caching overhead exceeds compute savings
Quantized KV Cache (INT8, FP8) Reduce KV Cache memory by quantizing K and V values	FP16 KV Cache for highest quality	INT8 / FP8 KV Cache when memory is the binding constraint and small accuracy loss is acceptable

Common pitfalls

Underestimating KV Cache memory requirements: long contexts can blow up memory unexpectedly
Not using prefix caching for RAG / chat applications: major cost / latency benefit available
Mixing concurrent requests with very different context lengths: wastes KV Cache memory
Naive serving without PagedAttention or equivalent: significant capacity loss
Trying to serve long-context models without provisioning sufficient GPU memory

Related BearPlex services

Model Engineering & Fine-Tuning Sovereign Cloud Infrastructure

Full AI glossary

FAQ

Questions about KV Cache.

Depends on model and sequence length. For a 70B model with 80 layers, hidden size 8192, generating a 4K-token response uses ~2.5GB GPU memory for KV Cache. Long contexts (128K+ tokens) can use 10-50+ GB just for one request's KV Cache. KV Cache is often the binding constraint on serving capacity per GPU.

Need help implementing KV Cache?

BearPlex builds production AI systems that use KV Cache for Fortune 500s and high-growth scale-ups. Outcome-based pricing. 90-day embedded sprints.

Talk to BearPlex See case studies

What is the KV Cache in LLM Inference?

Overview

How KV Cache works

Production KV Cache management

Why KV Cache memory is the binding constraint

Use cases

Examples in production

vLLM

Anthropic prompt caching

OpenAI prompt caching

KV Cache compared to alternatives

Common pitfalls

Related terms

Related BearPlex services

Questions about KV Cache.

Related reading

Need help implementing KV Cache?