Context Windows and Lost-in-the-Middle

You bought a model with a giant context window. You pasted in forty pages. The model answered using the first page and the last, and treated the thirty-eight pages in between like background noise it had agreed to ignore. This isn't a bug report. It's the documented, reproducible default behavior of long-context models, and pretending otherwise is how good RAG systems quietly rot.

What a context window actually is

The context window is the maximum number of tokens — prompt plus generated output — the model can attend over in a single pass. Mechanically, it's bounded by the positions the model was trained to handle and by memory: attention builds a KV cache that grows with sequence length, so longer contexts cost more memory and compute per step.

A bigger window means the model can reference more. It does not mean the model attends to all of it evenly. Those are separate claims, and the gap between them is the entire topic of this post.

The U-curve

In 2023, Liu and colleagues ran a clean experiment (published in TACL the following year). Give a model a multi-document question-answering task. Bury the one document that actually contains the answer at different positions — first, middle, last — and measure accuracy as you slide it around.

The result was a U. Accuracy is highest when the relevant information sits near the beginning or the end of the context, and it sags in the middle. On some models the dip was large enough that an answer placed dead-center was retrieved less reliably than if the model had been given no long context at all. They named it lost in the middle, and the name stuck because everyone who tested their own pipeline saw the same shape.

Retrieval accuracy versus the position of the relevant text in the context window. — The lost-in-the-middle U-curve: accuracy peaks at the edges and sags in the middle.

The exact numbers depend on the model and task — treat the curve as the shape, not a spec sheet. But the shape holds across every model people have tested, and it carries an unsettling implication: where you place a fact in the prompt changes whether the model can use it, independent of the fact itself.

Why the middle gets lost

A few forces stack up, and it's worth separating them because the fixes differ.

Position bias from training. Models see a lot of text where the important stuff lives at the top (topic sentences) or the bottom (conclusions, recent dialogue turns). The recency end is reinforced hard by the next-token objective — the model is always predicting what comes right after the most recent tokens. The beginning gets a primacy boost. The middle gets neither.

Attention dilution. With thousands of tokens competing, the softmax has to spread its weight thin. A genuinely relevant token in the middle has to out-shout enormous volumes of mildly-relevant filler around it, and often loses.

Positional encoding behavior at length. Schemes like RoPE encode relative distance, and models are usually trained on sequences far shorter than the windows they're later stretched to serve. Positions the model rarely saw at training time are handled less reliably, and those tend to be the long-range, deep-in-the-middle ones.

"Just use a bigger window" — the contrarian read

The marketing line is that million-token windows make retrieval obsolete: dump everything in, let the model sort it out. The data says be careful.

There's a difference between retrieval and reasoning over what you retrieved. The needle-in-a-haystack test — hide one sentence in a long document, ask for it verbatim — is a retrieval test, and modern long-context models pass it impressively. That's why the demos look magical. But benchmarks like RULER (2024) raised the bar: ask for multiple needles, ask the model to aggregate or reason across positions, add distractors that look like the answer. Effective context — the length at which a model still performs the task, not just spots a string — is routinely much shorter than the advertised window. A model rated for hundreds of thousands of tokens might hold task quality only across a fraction of that before degrading.

So the honest answer to "does a 1M-token window kill RAG?" is: it kills the naive reason to do RAG (fitting more text), and it sharpens the real reason (giving the model less, better-ordered text so attention isn't spread across noise). Bigger windows make retrieval less mandatory and good curation more valuable, not less.

What to actually do

This is a layout-and-curation problem as much as a model problem. Concretely:

Put the load-bearing content at the edges. If you have a key instruction or the most relevant chunk, place it near the start or the very end of the prompt — not buried in position 15 of 30.
Send fewer documents. Retrieve broadly, then rerank and keep the top handful. Ten well-chosen chunks beat fifty mediocre ones, and not by a little. The middle you never include can't get lost.
Order by relevance, not by source. Some teams place the highest-scoring chunk last (nearest the question, riding the recency boost). Test which end your model prefers — it varies.
Compress the middle. Summarize or drop low-value spans so the surviving context is denser.
Measure your own curve. Take your real task, slide a known-relevant chunk through positions, and plot accuracy. Your pipeline has a U. You just haven't looked at it yet.

The uncomfortable part is that none of this is fixed by waiting for a longer window. A model that attends unevenly across 100k tokens will attend unevenly across a million — the dead zone just gets wider. Context length is a budget, and like any budget, the wins come from spending less of it on things that don't matter. The model reads everything you give it. It just doesn't believe all of it equally, and it never told you which parts it skipped.