#optimization
2 papers
-
inspiration
Prompt Caching Is Free Money
Every time your app resends the same system prompt, you pay to compute it again. Prompt caching eliminates that cost by reusing precomputed KV tensors across requests. It requires no code changes and delivers up to 90% input token savings.
-
inspiration
Speculative Decoding: The Free Tokens
Speculative decoding cuts inference latency 2–3x without changing a single output token. The gain is real. So is the catch.