Microscale
0
Act IIInside the Machine
lesson nope · 8 min · 40 xp

NoPE layers

Sometimes the best position encoding is none

Maybe position is the problem, not the solution

You just learned that RoPE encodes position by rotating Q and K vectors. Every layer uses RoPE. That was the conventional wisdom until Yang et al. (2025) asked a surprisingly sharp question: what if some layers do better without any position encoding at all?

The argument runs: position is a lexical signal. Early layers absolutely need it — the model needs to know that “dog” comes before “bites”. Late layers, though, are doing more abstract aggregation — pulling together information across the sequence into a final representation. Forcing positional structure into those aggregations can bottleneck them, because the model has to spend capacity reasoning around angular extrapolation at long contexts.

Selective NoPE — a compromise

Instead of all-RoPE or no-RoPE, SmolLM3 adopted the pattern from the RoPE-to-NoRoPE paper: remove RoPE from every 4th layer. The other three-quarters of layers keep it; the remaining quarter is positionally blind.

Ablations on the same 3B model showed the NoPE-included version generalized better to longer contexts than the pure-RoPE baseline, without hurting short-context performance. The mechanistic story is that the position-free layers learn to aggregate global information without fighting RoPE's angular extrapolation, while the RoPE-carrying layers handle local syntax.

one NoPE layer every 4 layers · total 32 layers
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
layer with RoPE
layer without position (NoPE)
long-context quality (schematic, based on published ablations)
1k8k32k128k00.51log₂ context lengthnormalised quality

Why a position-free layer isn't positionally blind

The first thing you should ask is: if you strip RoPE from a layer in a decoder-only model, can it still tell token order at all? Astonishingly, yes — and that fact predates NoPE by three years. Haviv et al. (2022) showed that the causal attention mask by itself leaks positional information: the first token can only attend to itself, the second to two keys, the tt-th to tt. Token count along the causal triangle is a free positional signal, and a transformer with no explicit position encoding still learns to use it. Their GPT-style model at 1.3B parameters trained to within 0.05 nats/token of its ALiBi baseline — effectively matched, with zero position information injected anywhere.

◆ paper
Transformer Language Models without Positional Encodings Still Learn Positional Information
Haviv, Ram, Press, Izsak, Levy · 2022
arxiv:2203.16634
The empirical result that licensed NoPE layers three years later. The causal mask is a position signal disguised as a data-flow constraint.

NoPE layers feel paradoxical — how can removing information help? The answer is: it isn't removing information, it's removing a constraint. The causal mask still leaks position into the layer via attention shape. What you're removing is the forced angular structure — the requirement that the Q–K dot product depends on the cosine of a rotation difference. Long-context attention pays a real price for that: at 128k tokens, the highest-index RoPE pair rotates by angles the model never saw during 8k pretraining, and those extrapolated cosines drive the score toward noise. A NoPE layer side-steps the entire extrapolation regime. A layer forced to reason about positions has less flexibility to learn pure content aggregations. Removing position lets that layer specialise in the content-only job.

The 2025 SmolLM3 release was the first major open SLM to ship selective NoPE as a deliberate design choice. Expect to see more. It's a lovely example of how architectural questions that couldhave been answered years ago with an ablation were simply unasked because “every layer has position encoding” felt like a law instead of a choice.

NoPE is not the same as “no position information anywhere”. The model still has RoPE on 75% of its layers, which is plenty to carry positional signal forward. The NoPE layers are downstream of RoPE layers and inherit the positional structure implicitly.
comprehension check
comprehension · 1 / 2

Why does selectively removing RoPE from some layers improve long-context performance?