Maybe position is the problem, not the solution
You just learned that RoPE encodes position by rotating Q and K vectors. Every layer uses RoPE. That was the conventional wisdom until Yang et al. (2025) asked a surprisingly sharp question: what if some layers do better without any position encoding at all?
The argument runs: position is a lexical signal. Early layers absolutely need it — the model needs to know that “dog” comes before “bites”. Late layers, though, are doing more abstract aggregation — pulling together information across the sequence into a final representation. Forcing positional structure into those aggregations can bottleneck them, because the model has to spend capacity reasoning around angular extrapolation at long contexts.
Selective NoPE — a compromise
Instead of all-RoPE or no-RoPE, SmolLM3 adopted the pattern from the RoPE-to-NoRoPE paper: remove RoPE from every 4th layer. The other three-quarters of layers keep it; the remaining quarter is positionally blind.
Ablations on the same 3B model showed the NoPE-included version generalized better to longer contexts than the pure-RoPE baseline, without hurting short-context performance. The mechanistic story is that the position-free layers learn to aggregate global information without fighting RoPE's angular extrapolation, while the RoPE-carrying layers handle local syntax.
Why a position-free layer isn't positionally blind
The first thing you should ask is: if you strip RoPE from a layer in a decoder-only model, can it still tell token order at all? Astonishingly, yes — and that fact predates NoPE by three years. Haviv et al. (2022) showed that the causal attention mask by itself leaks positional information: the first token can only attend to itself, the second to two keys, the -th to . Token count along the causal triangle is a free positional signal, and a transformer with no explicit position encoding still learns to use it. Their GPT-style model at 1.3B parameters trained to within 0.05 nats/token of its ALiBi baseline — effectively matched, with zero position information injected anywhere.
NoPE layers feel paradoxical — how can removing information help? The answer is: it isn't removing information, it's removing a constraint. The causal mask still leaks position into the layer via attention shape. What you're removing is the forced angular structure — the requirement that the Q–K dot product depends on the cosine of a rotation difference. Long-context attention pays a real price for that: at 128k tokens, the highest-index RoPE pair rotates by angles the model never saw during 8k pretraining, and those extrapolated cosines drive the score toward noise. A NoPE layer side-steps the entire extrapolation regime. A layer forced to reason about positions has less flexibility to learn pure content aggregations. Removing position lets that layer specialise in the content-only job.
The 2025 SmolLM3 release was the first major open SLM to ship selective NoPE as a deliberate design choice. Expect to see more. It's a lovely example of how architectural questions that couldhave been answered years ago with an ablation were simply unasked because “every layer has position encoding” felt like a law instead of a choice.