Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper

Prompt Caching Is Free Money

inspiration | devinfo.dev | May 28, 2026 | devinfo.dev:2026.0014

Every time your app resends the same system prompt, you pay to compute it again. Prompt caching eliminates that cost by reusing precomputed KV tensors across requests. It requires no code changes and delivers up to 90% input token savings.

Prompt Caching Is Free Money

Every transformer request has two stages: prefill and decode.

Prefill processes your input tokens — it computes key-value (KV) tensors across every attention head and every layer. This is expensive. It scales with input length. Decode generates one token at a time and is comparatively cheap.

When your application sends the same system prompt on every request, it pays full prefill cost every time. Prompt caching stops that.

The Mechanism

Transformers compute KV projections for each token during attention. Prompt caching persists those tensors to memory and reuses them when a subsequent request begins with an identical prefix.

The match must be exact. Even a single character difference misses the cache. But for structured applications — system prompts, few-shot examples, retrieved documents, tool definitions — the prefix is usually stable.

A cache hit skips prefill entirely for the matched prefix. The model picks up decode from where the prefix ends.

What It Costs You to Not Use It

Latency numbers from providers:

Cost numbers:

For an agent that sends 2,000 tokens of system context per request, at GPT-4o pricing, caching cuts input costs in half or more on repeated calls. At scale, the savings are not marginal.

Where It Applies

Prompt caching pays off most when:

It pays off least when inputs are short or highly dynamic. Below 1,024 tokens, OpenAI won't cache. Above that threshold, structure your prompt so stable content leads.

The Design Rule

Put static content at the top. Put dynamic content at the bottom.

If your system prompt changes per-user mid-block, you break prefix matching for everyone. Separate what is constant from what varies. Build your prompts the way you build a URL — shared base, dynamic suffix.

Prompt caching is not a feature to opt into later. It is a constraint on how you structure prompts now, with a compounding payoff every time a user repeats an action.

References