Microscale
0
Act IIInside the Machine
lesson rope · 10 min · 55 xp

RoPE as rotation

Position as literal rotation in the plane

Attention is permutation-invariant — we have to fix that

Attention as you learned it in the first Act II lesson does not care about the order of tokens. If you shuffle a sentence, the attention scores rearrange to match, and you get exactly the same output. That's useless for language — “dog bites man” is not “man bites dog”. Position has to be injected somehow.

MMXXVI
historical note
2021 · Jianlin Su et al., RoFormer team
Before RoPE, transformers added learned absolute positional embeddings to token embeddings at the input layer. This worked but had three problems: the embeddings were fixed to training sequence length, long contexts required retraining from scratch, and position information had to survive every downstream layer to still be useful at the top. Su et al. proposed applying position as a literal rotation of query and key vectors inside the attention mechanism. The result was elegant, cheap, and strictly better on every long-context benchmark. Llama, Qwen, Phi, Gemma, and SmolLM all use RoPE; learned absolute embeddings have essentially vanished from modern production models.
◆ paper
RoFormer: Enhanced Transformer with Rotary Position Embedding
Su, Lu, Pan, Murtadha, Wen, Liu · 2021
arxiv:2104.09864
The original RoPE paper. The construction is fully geometric and relies on a beautiful property of rotation matrices that we'll derive in the next few paragraphs.

The construction — one pair at a time

Treat each pair of adjacent feature dimensions as a 2D vector. For a token at position mm, rotate that pair by angle mθm\theta for some frequency θ\theta. Concretely, if the pair is (qx,qy)(q_x, q_y):

Rm(qxqy)=(cos(mθ)sin(mθ)sin(mθ)cos(mθ))(qxqy)R_m \begin{pmatrix} q_x \\ q_y \end{pmatrix} = \begin{pmatrix} \cos(m\theta) & -\sin(m\theta) \\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} q_x \\ q_y \end{pmatrix}

Do this to every Q and every K in the attention layer before computing dot products. In a real dd-dimensional transformer the dd-dim vectors are carved into d/2d/2 pairs, each rotated with a different frequency θi\theta_i. The geometry is still just 2D rotations — stacked.

3
8
0.40
real RoPE uses tiny θ for high dims, big θ for low dims
⟨R_m Q, R_n K⟩
-1.196
⟨Q, R_{n-m} K⟩
-1.196
notice — both readouts are identical. That's the offset-invariance proof in live numbers.
Q (m=0)R_3QK (n=0)R_8K

The offset-invariance property — derivation

Rotations have a beautiful algebraic property that is the whole reason RoPE works. The inner product between two rotated vectors depends only on the difference of the rotation angles, not on their absolute values.

offset invariance, line by line
RmQ,RnK\langle R_m Q,\, R_n K \rangle
=(RmQ)(RnK)= (R_m Q)^\top (R_n K)
=QRmRnK= Q^\top R_m^\top R_n K
=QRnmK= Q^\top R_{n-m} K

The last step uses two facts: rotations compose (RaRb=Ra+bR_a R_b = R_{a+b}) and rotations are orthogonal (Ra=RaR_a^\top = R_{-a}), so RmRn=RmRn=RnmR_m^\top R_n = R_{-m} R_n = R_{n-m}.

So the attention score between a query at position mm and a key at position nn depends only on the relative offset nmn - m. This is exactly what you want for language — a verb's dependence on its subject depends on the distance between them, not on where they happen to sit in the document. Absolute positions cancel cleanly.

Multi-dimensional — frequency schedule

The 2D picture is a teaching lie — a useful one. The real thing operates on pairs of dimensions across the full dd-dimensional Q and K. Each pair (2i,2i+1)(2i, 2i+1) for i=0,1,,d/21i = 0, 1, \ldots, d/2 - 1 is rotated by its own frequency θi\theta_i. The original paper uses a geometric schedule:

θi  =  base2i/d,base=10000\theta_i \;=\; \text{base}^{-2i/d}, \quad \text{base} = 10000

Plot this against ii: low-index pairs have big θ\theta and rotate fast with position. High-index pairs have tiny θ\theta and rotate slowly. The idea: different pairs encode position at different time scales.

RoPE frequency schedule · d = 64
0816243200.51dim pair index iθ_i
base = 10,000 (RoFormer default)
base = 10⁶ (Gemma 3 global layers)
Both schedules drop off quickly — most pairs have tiny θ. The base controls how quickly the drop happens, and therefore how much positional resolution the late-index pairs retain.

Context extension — YaRN and position interpolation

You trained your model at 4k context. You want to deploy it at 32k. Two things can go wrong:

  1. The positions beyond 4k rotate to angles the model has never seen. Output degenerates immediately past training length.
  2. Even if the model tolerates the new angles, the far-apart tokens land on rotation phases that aliasing-wise look like nearby tokens. The model can't tell a token 20k away from one 2k away.

The solution family is called context extension. Three practical methods:

Why RoPE is applied to Q and K but not V

This is a subtle but important point. The attention mechanism has three things: query, key, value. RoPE rotates Q and K only. V is left alone. Why?

We want the attention score (which depends on the Q–K dot product) to encode relative position. We do not want the retrieved content(which is V) to be position-dependent, because V carries the actual information being retrieved. If you rotated V too, the retrieved content would be a position-scrambled version of the value — the mechanism would retrieve “what's at position nn relative to mm” but pass through the value post-rotated by an irrelevant angle.

Position lives only in the attention score computation. Content lives in V, free of position. This clean separation is one of the properties that makes RoPE composable with long-context tricks like sliding windows and FlashAttention — the value pipeline never cares about positions.

comprehension check
comprehension · 1 / 4

What is the key algebraic property that makes RoPE work as a relative position encoding?