Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper

Attention Sinks: The Tokens That Hold Everything Together

inspiration | devinfo.dev | June 2, 2026 | devinfo.dev:2026.0020

Transformers quietly route a disproportionate share of attention to their first tokens — not because those tokens are important, but because softmax needs somewhere to put mass. Understanding this changes how you think about KV cache design.

Attention Sinks: The Tokens That Hold Everything Together

Softmax has a constraint: every row of the attention matrix must sum to 1.

When a token has nothing relevant to attend to, the model still has to put that attention mass somewhere. It goes to the first token. Always. Consistently. Regardless of what the first token actually contains.

These are attention sinks.

They are not semantically important. They are arithmetically necessary.

What Breaks Without Them

Standard sliding-window attention evicts old tokens to save memory. Evict the first few tokens — the sinks — and the model's attention weights lose their anchor. Perplexity collapses. Output degrades catastrophically.

This is not a subtle degradation. It is a cliff.

What StreamingLLM Found

Xiao et al. (2023) at MIT-HAN Lab showed a clean fix: retain the KV cache states of roughly 4 initial tokens permanently, then slide a window of recent tokens alongside them.

That is the entire trick.

With 4 sink tokens held in place plus a recent-context window, the model runs stably across sequences of 4 million tokens — without fine-tuning, without modification to the model weights.

Throughput: up to 22× faster than naive full-context approaches. Memory: bounded and predictable.

The Practical Implication

If you are building long-context inference infrastructure — a RAG pipeline, a streaming chat application, a document processing service — your KV cache eviction strategy matters as much as your model choice.

Evict indiscriminately and your inference will fail on long inputs in ways that look random but are not.

The fix is not expensive. Protect the first 4 tokens. Always.

A Deeper Lesson

The model is not doing what you think it is doing.

Attention sinks exist because the model learned to use initial tokens as a mathematical pressure valve during training. No one designed this. No one intended it. It emerged.

This pattern repeats throughout deep learning: the model finds its own solution, and that solution is often not interpretable, not obvious, and not safe to ignore in production.

You cannot fully optimize what you do not understand. Attention sinks are a small, tractable example of this principle — concrete enough to reason about, important enough to act on.

Know your inference stack. Know what it assumes.

References