Prompt Caching Is Free Money

Every transformer request has two stages: prefill and decode.

Prefill processes your input tokens — it computes key-value (KV) tensors across every attention head and every layer. This is expensive. It scales with input length. Decode generates one token at a time and is comparatively cheap.

When your application sends the same system prompt on every request, it pays full prefill cost every time. Prompt caching stops that.

The Mechanism

Transformers compute KV projections for each token during attention. Prompt caching persists those tensors to memory and reuses them when a subsequent request begins with an identical prefix.

The match must be exact. Even a single character difference misses the cache. But for structured applications — system prompts, few-shot examples, retrieved documents, tool definitions — the prefix is usually stable.

A cache hit skips prefill entirely for the matched prefix. The model picks up decode from where the prefix ends.

What It Costs You to Not Use It

Latency numbers from providers:

1,000-token prefixes: ~7% TTFT improvement on a cache hit
150,000-token prefixes: ~67% TTFT improvement

Cost numbers:

OpenAI: cached tokens billed at up to 90% discount (triggering at 1,024+ tokens, in 128-token increments)
Anthropic: cached reads billed at 0.1× base input cost — effectively 90% off

For an agent that sends 2,000 tokens of system context per request, at GPT-4o pricing, caching cuts input costs in half or more on repeated calls. At scale, the savings are not marginal.

Where It Applies

Prompt caching pays off most when:

System prompts are long and static
You run many completions against the same document or context
Tool definitions or few-shot examples repeat across calls
Agent loops re-inject the same prior context

It pays off least when inputs are short or highly dynamic. Below 1,024 tokens, OpenAI won't cache. Above that threshold, structure your prompt so stable content leads.

The Design Rule

Put static content at the top. Put dynamic content at the bottom.

If your system prompt changes per-user mid-block, you break prefix matching for everyone. Separate what is constant from what varies. Build your prompts the way you build a URL — shared base, dynamic suffix.

Prompt caching is not a feature to opt into later. It is a constraint on how you structure prompts now, with a compounding payoff every time a user repeats an action.

References

OpenAI. (2024). Prompt caching. OpenAI API Documentation. https://platform.openai.com/docs/guides/prompt-caching
OpenAI. (2024). Prompt Caching in the API. OpenAI Blog. https://openai.com/index/api-prompt-caching
OpenAI. (2024). Prompt Caching 201. OpenAI Cookbook. https://developers.openai.com/cookbook/examples/prompt_caching_201
Anthropic. (2024). Prompt caching with Claude. Anthropic News. https://www.anthropic.com/news/prompt-caching
Gim, I., Chen, G., Lee, S., Shen, M., Zheng, L., & Jin, X. (2024). Prompt Cache: Modular Attention Reuse for Low-Latency Inference. MLSys 2024. https://proceedings.mlsys.org/paper_files/paper/2024/file/a66caa1703fe34705a4368c3014c1966-Paper-Conference.pdf
Redis. (2025). What Is Prompt Caching? LLM Speed & Cost Guide. Redis Blog. https://redis.io/blog/what-is-prompt-caching/