lesson lost-middle · 8 min · 40 xp

Lost in the middle

Why language models degrade when key info sits mid-context — interactive needle-in-a-haystack test across positions and context lengths up to 100K tokens

Long context does not mean reliable context

Every modern SLM advertises a long context window — Gemma 3 at 128k, Qwen3 at 128k, Phi-4-mini at 128k. But advertised context length and usablecontext length are different things. Liu et al. 2023 (“Lost in the Middle: How Language Models Use Long Contexts”) ran a multi-document QA probe on GPT-3.5-Turbo, Claude 1.3, MPT-30B-Instruct, and LongChat. They put a “needle” (the document containing the answer) at different positions in a 20-document prompt and measured retrieval. Accuracy was ~75% when the needle sat in position 1, dropped smoothly to ~53% at position 10 (the middle), and climbed back to ~63% at position 20. The shape is a lopsided U — the start of context is the most recoverable, the tail is second, and the deep middle loses roughly a third of recall versus the edges. This held even on models whose advertised window was several times larger than the prompt.

Drag the slider below to move a needle through a 128k-token context. Watch the retrieval curve.

needle position50% through context

128k-token haystack ─── ↑ needle

estimated retrieval accuracy

Why the middle gets lost

Attention is a softmax over all tokens. With a 128k-token context, the average weight per token is 1/128000 ≈ 8×10⁻⁶. For the model to actually retrieve the token at position 60,000, its attention needs to concentrate sharply on that position. That concentration requires training: the model has to have seen many examples of “attend precisely to a needle deep in a long context.” But long-context training is expensive, so there are few such examples in pretraining.

Meanwhile, the beginning of the context is reinforced by every subsequent attention head naturally routing through it (it's the start of every sequence), and the end is reinforced by recency bias (recent tokens are closest to the next prediction). The middle gets neither advantage.

The production answer — RAG over short context

Long context should be treated as unreliable for retrieval. For production SLM workloads:

Retrieve first, then summarize. Use a vector store to pull the top-k relevant chunks, concatenate 2–8 k of them, and let the SLM work in its sweet spot.
Re-rank to the tail. When you do stuff a long prompt, put the most relevantretrieved chunks last, not first. Liu's U-curve is asymmetric — tail recall beats middle recall by ~10 points — so exploiting recency with a reranker (bge-reranker, Cohere Rerank) gives a free accuracy bump with no retraining.
Short effective context. Even if the model supports 128k, keep the actual prompt under 16k for reliable behavior.
Prompt the tail.Anthropic's December 2023 writeup on Claude 2.1 long-context showed a dramatic prompting fix: prepending “Here is the most relevant sentence in the context:”to the assistant's reply raised needle-in-a-haystack recall on the 200K window from 27% to 98%. The lesson: middle-of-context recall responds dramatically to anchoring cues that bias the decoder to extract before answering — no fine-tuning required.
Gemma 3's local-global hybridhelps, but doesn't fully fix the middle — it makes long context feasible, not reliable.

The needle-in-a-haystack benchmark family (NIAH, RULER, LongBench) is how you audit this in practice. Run it on any SLM you plan to deploy with long context; don't trust the advertised window.

Long context does not mean reliable context

Drag the slider below to move a needle through a 128k-token context. Watch the retrieval curve.

Why the middle gets lost

The production answer — RAG over short context

Long context should be treated as unreliable for retrieval. For production SLM workloads:

Retrieve first, then summarize. Use a vector store to pull the top-k relevant chunks, concatenate 2–8 k of them, and let the SLM work in its sweet spot.

Re-rank to the tail. When you do stuff a long prompt, put the most relevantretrieved chunks last, not first. Liu's U-curve is asymmetric — tail recall beats middle recall by ~10 points — so exploiting recency with a reranker (bge-reranker, Cohere Rerank) gives a free accuracy bump with no retraining.

Short effective context. Even if the model supports 128k, keep the actual prompt under 16k for reliable behavior.

Prompt the tail.Anthropic's December 2023 writeup on Claude 2.1 long-context showed a dramatic prompting fix: prepending “Here is the most relevant sentence in the context:”to the assistant's reply raised needle-in-a-haystack recall on the 200K window from 27% to 98%. The lesson: middle-of-context recall responds dramatically to anchoring cues that bias the decoder to extract before answering — no fine-tuning required.

Gemma 3's local-global hybridhelps, but doesn't fully fix the middle — it makes long context feasible, not reliable.

The needle-in-a-haystack benchmark family (NIAH, RULER, LongBench) is how you audit this in practice. Run it on any SLM you plan to deploy with long context; don't trust the advertised window.