Microscale
0
Act IIInside the Machine
lesson local-global · 9 min · 45 xp

Local + global attention

Gemma 3's 5:1 trick

Full attention is O(L²) — a problem at 128k tokens

A full attention layer at sequence length LL costs O(L2d)O(L^2 \cdot d) compute and stores a KV cache linear in LL. At L=128,000L = 128{,}000 those numbers become catastrophic — the KV cache alone for a 30-layer model with 8 KV heads at dh=128d_h = 128 FP16 is 15.0 GB per sequence. That's more than the weights of a 14B model.

Mistral 7B tried pure sliding-window attention at 4k tokens and saw long-range tasks break — information from the beginning of a document literally couldn't propagate to the end. Gemma 3's fix is elegant: interleave local and global layers.

Five local for every global

Gemma 3's pattern is 5:1 local-to-global. Five sliding-window-1024 layers, one full-attention layer, repeating. The local layers handle syntactic and semantic work that only needs recent context. The periodic global layers re-inject cross-document information so information can still propagate from the far past.

The KV cache saving is dramatic. Local layers only need to cache 1024 keys (the window) regardless of actual sequence length. Global layers still cache the full sequence. With 5:1 interleaving, ~83% of layers are bounded-cache, ~17% are full-cache. Over a 128k context, total KV memory drops from 15.0 GB (all-global) to 2.6 GB (5:1 hybrid) — roughly a 5.8× reduction.

layers — 30 total, alternating local/global
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
local (25 layers · window=1024)
global (5 layers · full context)
all-global KV @ 128k
15.0GB
5:1 hybrid KV @ 128k
2.6GB
reduction
5.8×

Different RoPE bases for different roles

There's a beautiful second trick in Gemma 3: the local and global layers use different RoPE base frequencies. Local layers keep θbase=10,000\theta_\text{base} = 10{,}000 (the original RoPE choice). Global layers use θbase=106\theta_\text{base} = 10^6 — 100× larger.

Why? You learned in the RoPE lesson that higher base frequency lets high-index feature pairs sweep meaningful angles across longer sequences. Local layers only see a 1024-token window — base 10k is plenty for that range. Global layers see 128k tokens and need the high-index pairs to actually rotate enough to distinguish far-apart positions.

This matching of position-encoding choice to attention scope is a tiny architectural detail that saves quality at long contexts. The paper reports KV cache memory share drops from ~60% of total footprint to under 15% with the hybrid scheme.

comprehension check
comprehension · 1 / 2

Why does pure sliding-window attention fail at long-range tasks?