Use Ctrl+P (or Cmd+P) to save as PDF. Back to paper
A model that fits 128K tokens can still fail to use information you placed at token 60K. The context window is a capacity claim. Where you put information inside that window is a separate engineering decision — one with a measurable performance cost if you get it wrong.
In 2023, Liu et al. ran a systematic experiment: they placed a relevant document at different positions inside a long input context, then measured multi-document question-answering accuracy. The result was a U-shaped performance curve.
Models performed best when the relevant information appeared at the very beginning or the very end of the context. Performance degraded significantly — in some cases by more than 20 percentage points — when the same information was placed in the middle. In the worst tested configuration (20–30 documents), GPT-3.5-Turbo's performance fell below its closed-book baseline: the model did better with no documents than with the relevant document buried in the middle.
The phenomenon has a name: Lost in the Middle.
The U-shape reflects two overlapping biases:
Primacy bias — models attend disproportionately to early tokens. This is structural. Causal masking in a deep autoregressive transformer guarantees that early tokens accumulate more contextualized representations across layers. Chowdhury (2026) showed analytically that primacy bias is not a training artifact — it is a structural inevitability of deep autoregressive transformers. Causal masking alone guarantees it.
Recency bias — models attend disproportionately to recent tokens. Wu et al. (2025) showed that LayerNorm, not just positional encoding, induces this effect. The residual connection architecture preserves identity paths through depth, biasing finite-depth rollout toward recent tokens.
The middle of the context has no such structural advantage. It is simply farther from both anchors.
For RAG systems: If retrieval returns five chunks and you place the highest-relevance chunk third, you have engineered a degraded response. Chunk ordering is not cosmetic — it is part of the retrieval quality problem. The finding from Liu et al. applies directly: reranking by relevance and placing the top result first (or last) is not optional.
For long-document tasks: Summarization over a 50-page document does not uniformly attend to all pages. Instructions or constraints buried in the middle of a system prompt are less likely to be followed than those at the top.
For agentic pipelines: Tool call results appended to the middle of an accumulating context are in the worst position. If a result is critical, it belongs at a structural boundary — beginning of a new message, or last in the sequence.
Place high-salience content at the structural edges of your context. Beginning and end are not arbitrary — they are architecturally privileged positions. This is not a workaround. It is an acknowledgment of what the model is.
The context window tells you the maximum. Position tells you what the model will actually use.