Long context does not mean reliable context
Every modern SLM advertises a long context window — Gemma 3 at 128k, Qwen3 at 128k, Phi-4-mini at 128k. But advertised context length and usablecontext length are different things. Liu et al. 2023 (“Lost in the Middle: How Language Models Use Long Contexts”) ran a multi-document QA probe on GPT-3.5-Turbo, Claude 1.3, MPT-30B-Instruct, and LongChat. They put a “needle” (the document containing the answer) at different positions in a 20-document prompt and measured retrieval. Accuracy was ~75% when the needle sat in position 1, dropped smoothly to ~53% at position 10 (the middle), and climbed back to ~63% at position 20. The shape is a lopsided U — the start of context is the most recoverable, the tail is second, and the deep middle loses roughly a third of recall versus the edges. This held even on models whose advertised window was several times larger than the prompt.
Drag the slider below to move a needle through a 128k-token context. Watch the retrieval curve.
Why the middle gets lost
Attention is a softmax over all tokens. With a 128k-token context, the average weight per token is 1/128000 ≈ 8×10⁻⁶. For the model to actually retrieve the token at position 60,000, its attention needs to concentrate sharply on that position. That concentration requires training: the model has to have seen many examples of “attend precisely to a needle deep in a long context.” But long-context training is expensive, so there are few such examples in pretraining.
Meanwhile, the beginning of the context is reinforced by every subsequent attention head naturally routing through it (it's the start of every sequence), and the end is reinforced by recency bias (recent tokens are closest to the next prediction). The middle gets neither advantage.
The production answer — RAG over short context
Long context should be treated as unreliable for retrieval. For production SLM workloads:
- Retrieve first, then summarize. Use a vector store to pull the top-k relevant chunks, concatenate 2–8 k of them, and let the SLM work in its sweet spot.
- Re-rank to the tail. When you do stuff a long prompt, put the most relevantretrieved chunks last, not first. Liu's U-curve is asymmetric — tail recall beats middle recall by ~10 points — so exploiting recency with a reranker (bge-reranker, Cohere Rerank) gives a free accuracy bump with no retraining.
- Short effective context. Even if the model supports 128k, keep the actual prompt under 16k for reliable behavior.
- Needle-in-a-haystack fine-tuning. Anthropic's Claude 2.1 release notes explicitly described retraining on synthetic long-context retrieval tasks to flatten the U; the published curves showed the middle dip shrinking from ~30 points to ~5. It does not go to zero — the attention sink at position 0 is baked into the positional distribution — but it turns a cliff into a gentle slope.
- Gemma 3's local-global hybridhelps, but doesn't fully fix the middle — it makes long context feasible, not reliable.