Prefix Caching Is Free Throughput

Prefix caching is free throughput.

Not free as in no cost — free as in already paid. The KV cache your model computed for that 1,000-token system prompt? It will be computed again for the next request. And the one after that. And the ten thousand after that. Every time, identical GPU cycles spent on identical tokens.

Automatic Prefix Caching (APC) stops that. vLLM hashes each block of tokens in a request's prefix. If a matching block already lives in memory, it is reused. The prefill for those tokens is skipped entirely. Only the new tokens — the ones that actually differ — require computation.

The mechanism depends on PagedAttention's block-level memory model. KV cache is managed in fixed-size pages (typically 16 tokens per block), not as contiguous per-sequence allocations. This makes content-addressable reuse possible: blocks are shared across requests by pointer, not by copy.

The numbers are not marginal. On workloads with stable system prompts or shared retrieved context:

Cache hit rates of 70–90% are typical for chatbot workloads.
Aggregate throughput improves 30–50%.
Latency drops 5–10x when prompts are structured with static content first.

The structure matters. APC matches prefixes from the start of the prompt. If your static system prompt is first and the user query is appended at the end, every request hits the cache. If you interleave static and dynamic content, you break the hash chain and miss the cache entirely.

The rule is simple: static before dynamic. System prompt. Retrieved documents. Few-shot examples. Then the query.

This is enabled with a single flag in vLLM: enable_prefix_caching=True. In recent releases it is on by default.

The optimization costs nothing to deploy. The failure to deploy it costs throughput on every request, indefinitely.

---

References

1. vLLM Project. "Automatic Prefix Caching." vLLM Documentation. https://docs.vllm.ai/en/latest/features/automatic_prefix_caching/

2. Kwon, W., et al. (2023). "Efficient Memory Management for Large Language Model Serving with PagedAttention." arXiv:2309.06180. https://arxiv.org/abs/2309.06180

3. vLLM Team. (2025). "Inside vLLM: Anatomy of a High-Throughput LLM Inference System." vLLM Blog. https://vllm.ai/blog/2025-09-05-anatomy-of-vllm