#transformers
1 paper
-
inspiration
Attention Sinks: The Tokens That Hold Everything Together
Transformers quietly route a disproportionate share of attention to their first tokens — not because those tokens are important, but because softmax needs somewhere to put mass. Understanding this changes how you think about KV cache design.