Attention Sinks: The Tokens That Hold Everything Together
Attention Sinks: The Tokens That Hold Everything Together
Softmax has a constraint: every row of the attention matrix must sum to 1.
When a token has nothing relevant to attend to, the model still has to put that attention mass somewhere. It goes to the first token. Always. Consistently. Regardless of what the first token actually contains.
These are attention sinks.
They are not semantically important. They are arithmetically necessary.
What Breaks Without Them
Standard sliding-window attention evicts old tokens to save memory. Evict the first few tokens — the sinks — and the model's attention weights lose their anchor. Perplexity collapses. Output degrades catastrophically.
This is not a subtle degradation. It is a cliff.
What StreamingLLM Found
Xiao et al. (2023) at MIT-HAN Lab showed a clean fix: retain the KV cache states of roughly 4 initial tokens permanently, then slide a window of recent tokens alongside them.
That is the entire trick.
With 4 sink tokens held in place plus a recent-context window, the model runs stably across sequences of 4 million tokens — without fine-tuning, without modification to the model weights.
Throughput: up to 22× faster than naive full-context approaches. Memory: bounded and predictable.
The Practical Implication
If you are building long-context inference infrastructure — a RAG pipeline, a streaming chat application, a document processing service — your KV cache eviction strategy matters as much as your model choice.
Evict indiscriminately and your inference will fail on long inputs in ways that look random but are not.
The fix is not expensive. Protect the first 4 tokens. Always.
A Deeper Lesson
The model is not doing what you think it is doing.
Attention sinks exist because the model learned to use initial tokens as a mathematical pressure valve during training. No one designed this. No one intended it. It emerged.
This pattern repeats throughout deep learning: the model finds its own solution, and that solution is often not interpretable, not obvious, and not safe to ignore in production.
You cannot fully optimize what you do not understand. Attention sinks are a small, tractable example of this principle — concrete enough to reason about, important enough to act on.
Know your inference stack. Know what it assumes.
References
- Xiao, G., Tian, Y., Chen, B., Han, S., & Lewis, M. (2023). Efficient Streaming Language Models with Attention Sinks. arXiv:2309.17453. https://arxiv.org/abs/2309.17453
- MIT-HAN Lab. (2023). StreamingLLM: How Attention Sinks Keep Language Models Stable. Blog post. https://hanlab.mit.edu/blog/streamingllm
- MIT-HAN Lab. StreamingLLM Project Page. https://hanlab.mit.edu/projects/streamingllm
- MIT-HAN Lab. streaming-llm (GitHub repository). https://github.com/mit-han-lab/streaming-llm
Cite as
devinfo.dev. (2026). "Attention Sinks: The Tokens That Hold Everything Together." devinfo.dev:2026.0020. https://devinfo.dev/d/2026.0020
devinfo.dev | https://devinfo.dev/d/2026.0020
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev