lesson nope · 8 min · 40 xp

NoPE layers

Sometimes the best position encoding is none. Why SmolLM3 drops RoPE every 4th layer — the 3:1 NoPE ratio that boosts long-context without hurting short

Maybe position is the problem, not the solution

You just learned that RoPE encodes position by rotating Q and K vectors. Every layer uses RoPE. That was the conventional wisdom until Yang et al. (2025) asked a surprisingly sharp question: what if some layers do better without any position encoding at all?

The argument runs: position is a lexical signal. Early layers absolutely need it — the model needs to know that “dog” comes before “bites”. Late layers, though, are doing more abstract aggregation — pulling together information across the sequence into a final representation. Forcing positional structure into those aggregations can bottleneck them, because the model has to spend capacity reasoning around angular extrapolation at long contexts.

Selective NoPE — a compromise

Instead of all-RoPE or no-RoPE, SmolLM3 adopted the pattern from the RoPE-to-NoRoPE paper: remove RoPE from every 4th layer. The other three-quarters of layers keep it; the remaining quarter is positionally blind.

Ablations on the same 3B model showed the NoPE-included version generalized better to longer contexts than the pure-RoPE baseline, without hurting short-context performance. The mechanistic story is that the position-free layers learn to aggregate global information without fighting RoPE's angular extrapolation, while the RoPE-carrying layers handle local syntax.

one NoPE layer every 4 layers · total 32 layers

layer with RoPE

layer without position (NoPE)

long-context quality (schematic, based on published ablations)

Why a position-free layer isn't positionally blind

The first thing you should ask is: if you strip RoPE from a layer in a decoder-only model, can it still tell token order at all? Astonishingly, yes — and that fact predates NoPE by three years. Haviv et al. (2022) showed that the causal attention mask by itself leaks positional information: the first token can only attend to itself, the second to two keys, the $t$ -th to $t$ . Token count along the causal triangle is a free positional signal, and a transformer with no explicit position encoding still learns to use it. Their GPT-style model at 1.3B parameters trained to within 0.05 nats/token of its ALiBi baseline — effectively matched, with zero position information injected anywhere.

◆ paper

Transformer Language Models without Positional Encodings Still Learn Positional Information

Haviv, Ram, Press, Izsak, Levy · 2022

arxiv:2203.16634

The empirical result that licensed NoPE layers three years later. The causal mask is a position signal disguised as a data-flow constraint.

NoPE layers feel paradoxical — how can removing information help? The answer is: it isn't removing information, it's removing a constraint. The causal mask still leaks position into the layer via attention shape. What you're removing is the forced angular structure — the requirement that the Q–K dot product depends on the cosine of a rotation difference. Long-context attention pays a real price for that: at 128k tokens, the highest-index RoPE pair rotates by angles the model never saw during 8k pretraining, and those extrapolated cosines drive the score toward noise. A NoPE layer side-steps the entire extrapolation regime. A layer forced to reason about positions has less flexibility to learn pure content aggregations. Removing position lets that layer specialise in the content-only job.

The 2025 SmolLM3 release was the first major open SLM to ship selective NoPE as a deliberate design choice. Expect to see more. It's a lovely example of how architectural questions that couldhave been answered years ago with an ablation were simply unasked because “every layer has position encoding” felt like a law instead of a choice.

NoPE is not the same as “no position information anywhere”. The model still has RoPE on 75% of its layers, which is plenty to carry positional signal forward. The NoPE layers are downstream of RoPE layers and inherit the positional structure implicitly.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 2

Why does selectively removing RoPE from some layers improve long-context performance?

this lesson appears in

SmolLM3