#optimization

inspiration

Compression Is Not Cheating

Sending fewer tokens to the model is not a workaround. It is engineering. The context window tells you the maximum; it does not tell you the optimum.

June 16, 2026

inspiration

Every time your app resends the same system prompt, you pay to compute it again. Prompt caching eliminates that cost by reusing precomputed KV tensors across requests. It requires no code changes and delivers up to 90% input token savings.

May 28, 2026

inspiration

Speculative Decoding: The Free Tokens

Speculative decoding cuts inference latency 2–3x without changing a single output token. The gain is real. So is the catch.

May 25, 2026

Compression Is Not Cheating

Prompt Caching Is Free Money

Speculative Decoding: The Free Tokens