Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
Every transformer request has two stages: prefill and decode.
Prefill processes your input tokens — it computes key-value (KV) tensors across every attention head and every layer. This is expensive. It scales with input length. Decode generates one token at a time and is comparatively cheap.
When your application sends the same system prompt on every request, it pays full prefill cost every time. Prompt caching stops that.
Transformers compute KV projections for each token during attention. Prompt caching persists those tensors to memory and reuses them when a subsequent request begins with an identical prefix.
The match must be exact. Even a single character difference misses the cache. But for structured applications — system prompts, few-shot examples, retrieved documents, tool definitions — the prefix is usually stable.
A cache hit skips prefill entirely for the matched prefix. The model picks up decode from where the prefix ends.
Latency numbers from providers:
Cost numbers:
For an agent that sends 2,000 tokens of system context per request, at GPT-4o pricing, caching cuts input costs in half or more on repeated calls. At scale, the savings are not marginal.
Prompt caching pays off most when:
It pays off least when inputs are short or highly dynamic. Below 1,024 tokens, OpenAI won't cache. Above that threshold, structure your prompt so stable content leads.
Put static content at the top. Put dynamic content at the bottom.
If your system prompt changes per-user mid-block, you break prefix matching for everyone. Separate what is constant from what varies. Build your prompts the way you build a URL — shared base, dynamic suffix.
Prompt caching is not a feature to opt into later. It is a constraint on how you structure prompts now, with a compounding payoff every time a user repeats an action.