#kv-cache — devinfo.dev

inspiration
Attention Heads Are Not Equal

Multi-head attention gives every query its own key and value heads. That is thorough — and expensive. Grouped-Query Attention proves the redundancy: Llama 3 70B serves 64 query heads from 8 KV heads, cuts its KV cache by 8x, and loses almost nothing in quality.
June 26, 2026
inspiration
The KV Cache Is Your Real Memory Budget

The KV cache — not the model weights — is what limits how many tokens you can generate and how many requests you can serve. Understanding it changes how you provision hardware and tune inference.
June 3, 2026
inspiration
Attention Sinks: The Tokens That Hold Everything Together

Transformers quietly route a disproportionate share of attention to their first tokens — not because those tokens are important, but because softmax needs somewhere to put mass. Understanding this changes how you think about KV cache design.
June 2, 2026