inspiration

Long Context Is Not Long Attention

devinfo.dev — June 14, 2026

devinfo.dev:2026.0032

#long-context #attention #inference #llm-engineering

Save as PDF

Long Context Is Not Long Attention

A 128k context window means the model can accept 128k tokens. It does not mean the model uses them well.

These are different claims. Most practitioners conflate them.

---

The Evidence

Liu et al. (2023) ran a controlled experiment: place the relevant document at different positions inside a multi-document QA context. The result was a U-shaped performance curve. Models performed best when the answer appeared at the very beginning or end of the context. Performance degraded significantly for information in the middle.

This is called the lost-in-the-middle problem.

A 2026 theoretical analysis by Barbero et al. confirmed the cause is structural, not learned. The U-shape appears at initialization, before any training. Causal masking guarantees primacy bias. Residual connections guarantee a recency anchor. The model's architecture creates the problem before the model has seen a single training token.

---

Why RoPE Does Not Fix It

Rotary Position Embedding (RoPE) is how most modern LLMs encode token positions. When a model is trained at 4k context and you extend it to 128k via RoPE scaling (NTK, YaRN, LongRoPE), the positional encoding adapts. Perplexity stays low.

But low perplexity is not the same as correct retrieval.

Pal et al. (2024) showed that RoPE extensions can produce superficially low perplexity while losing the ability to retrieve specific information from long contexts. The model looks healthy on aggregate metrics. It fails on targeted lookups.

A separate study on attention entropy found that where attention uncertainty is high — positions where the model cannot confidently locate relevant tokens — retrieval errors cluster there. RoPE scaling does not reduce that uncertainty. It redistributes it.

---

The Sliding Window Tradeoff

Mistral 7B (Jiang et al., 2023) addressed the compute problem with sliding window attention (SWA): each token attends only to a fixed window of W preceding tokens (W=4096). Across layers, information propagates further — theoretically up to 131k tokens at depth.

The tradeoff is explicit in the Mistral documentation: tokens outside the sliding window still influence predictions through layer stacking, but when sequences become very long, the model stops using the full context.

SWA solves the quadratic cost of full attention. It does not solve position bias. It trades one problem for another — and the documentation says so directly.

---

What This Means in Practice

If you are building a system that relies on long-context retrieval:

Chunk placement matters. Critical information should not be buried in the middle of a long context. This is not a stylistic preference. It is a response to documented architecture behaviour.

Window size is not recall accuracy. A 128k context window is a necessary condition for long-context tasks, not a sufficient one. Benchmark your actual retrieval accuracy across positions — not just aggregate scores.

Perplexity is not your health metric. A model can score well on perplexity while failing consistently at targeted retrieval. Use position-aware evals.

Reranking context before injection helps. Placing the most relevant chunks at the beginning or end of the context window is a practical mitigation with immediate, measurable benefit. It exploits primacy and recency bias rather than fighting it.

---

The long context race — 128k, 1M, 2M tokens — is about capacity. Attention quality across that capacity is a different engineering problem. Buying a larger bucket does not fix the leak.

References

Liu, N.F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. (2024). Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12, 157–173. https://aclanthology.org/2024.tacl-1.9.pdf

Barbero, F. et al. (2026). Lost in the Middle at Birth: An Exact Theory of Transformer Position Bias. arXiv preprint arXiv:2603.10123. https://arxiv.org/pdf/2603.10123v1

Pal, A. et al. (2024). Understanding superficial long context capability with RoPE base values. arXiv preprint arXiv:2405.14591. https://arxiv.org/pdf/2405.14591

Jiang, A.Q. et al. (2023). Mistral 7B. arXiv preprint arXiv:2310.06825. https://arxiv.org/pdf/2310.06825

Jin, D. et al. (2024). Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective. arXiv preprint arXiv:2406.13282. https://arxiv.org/html/2406.13282v2

Cite as

devinfo.dev. (2026). "Long Context Is Not Long Attention." devinfo.dev:2026.0032. https://devinfo.dev/d/2026.0032

devinfo.dev | https://devinfo.dev/d/2026.0032
Content licensed under CC BY-NC 4.0. Free to share with attribution for non-commercial use.
https://devinfo.dev

Long Context Is Not Long Attention

The Evidence

Why RoPE Does Not Fix It

The Sliding Window Tradeoff

What This Means in Practice

References

Cite as

See also