Prompt Caching Is Free Money
Prompt Caching Is Free Money
Every transformer request has two stages: prefill and decode.
Prefill processes your input tokens — it computes key-value (KV) tensors across every attention head and every layer. This is expensive. It scales with input length. Decode generates one token at a time and is comparatively cheap.
When your application sends the same system prompt on every request, it pays full prefill cost every time. Prompt caching stops that.
The Mechanism
Transformers compute KV projections for each token during attention. Prompt caching persists those tensors to memory and reuses them when a subsequent request begins with an identical prefix.
The match must be exact. Even a single character difference misses the cache. But for structured applications — system prompts, few-shot examples, retrieved documents, tool definitions — the prefix is usually stable.
A cache hit skips prefill entirely for the matched prefix. The model picks up decode from where the prefix ends.
What It Costs You to Not Use It
Latency numbers from providers:
- 1,000-token prefixes: ~7% TTFT improvement on a cache hit
- 150,000-token prefixes: ~67% TTFT improvement
Cost numbers:
- OpenAI: cached tokens billed at up to 90% discount (triggering at 1,024+ tokens, in 128-token increments)
- Anthropic: cached reads billed at 0.1× base input cost — effectively 90% off
For an agent that sends 2,000 tokens of system context per request, at GPT-4o pricing, caching cuts input costs in half or more on repeated calls. At scale, the savings are not marginal.
Where It Applies
Prompt caching pays off most when:
- System prompts are long and static
- You run many completions against the same document or context
- Tool definitions or few-shot examples repeat across calls
- Agent loops re-inject the same prior context
It pays off least when inputs are short or highly dynamic. Below 1,024 tokens, OpenAI won't cache. Above that threshold, structure your prompt so stable content leads.
The Design Rule
Put static content at the top. Put dynamic content at the bottom.
If your system prompt changes per-user mid-block, you break prefix matching for everyone. Separate what is constant from what varies. Build your prompts the way you build a URL — shared base, dynamic suffix.
Prompt caching is not a feature to opt into later. It is a constraint on how you structure prompts now, with a compounding payoff every time a user repeats an action.
References
- OpenAI. (2024). Prompt caching. OpenAI API Documentation. https://platform.openai.com/docs/guides/prompt-caching
- OpenAI. (2024). Prompt Caching in the API. OpenAI Blog. https://openai.com/index/api-prompt-caching
- OpenAI. (2024). Prompt Caching 201. OpenAI Cookbook. https://developers.openai.com/cookbook/examples/prompt_caching_201
- Anthropic. (2024). Prompt caching with Claude. Anthropic News. https://www.anthropic.com/news/prompt-caching
- Gim, I., Chen, G., Lee, S., Shen, M., Zheng, L., & Jin, X. (2024). Prompt Cache: Modular Attention Reuse for Low-Latency Inference. MLSys 2024. https://proceedings.mlsys.org/paper_files/paper/2024/file/a66caa1703fe34705a4368c3014c1966-Paper-Conference.pdf
- Redis. (2025). What Is Prompt Caching? LLM Speed & Cost Guide. Redis Blog. https://redis.io/blog/what-is-prompt-caching/
Cite as
devinfo.dev. (2026). "Prompt Caching Is Free Money." devinfo.dev:2026.0014. https://devinfo.dev/d/2026.0014
devinfo.dev | https://devinfo.dev/d/2026.0014
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev