#kv-cache
2 papers
-
inspiration
The KV Cache Is Your Real Memory Budget
The KV cache — not the model weights — is what limits how many tokens you can generate and how many requests you can serve. Understanding it changes how you provision hardware and tune inference.
-
inspiration
Attention Sinks: The Tokens That Hold Everything Together
Transformers quietly route a disproportionate share of attention to their first tokens — not because those tokens are important, but because softmax needs somewhere to put mass. Understanding this changes how you think about KV cache design.