lesson local-global · 9 min · 45 xp

Local + global attention

Interleave 5 local sliding-window layers with 1 global layer. How Gemma 3's 5:1 ratio slashes KV cache while keeping long-context coherence

Full attention is O(L²) — a problem at 128k tokens

A full attention layer at sequence length $L$ costs $O(L^2 \cdot d)$ compute and stores a KV cache linear in $L$ . At $L = 128{,}000$ those numbers become catastrophic — the KV cache alone for a 30-layer model with 8 KV heads at $d_h = 128$ FP16 is 15.0 GB per sequence. That's more than the weights of a 7B model.

Mistral 7B tried pure sliding-window attention at 4k tokens and saw long-range tasks break — information from the beginning of a document literally couldn't propagate to the end. Gemma 3's fix is elegant: interleave local and global layers.

Five local for every global

Gemma 3's pattern is 5:1 local-to-global. Five sliding-window-1024 layers, one full-attention layer, repeating. The local layers handle syntactic and semantic work that only needs recent context. The periodic global layers re-inject cross-document information so information can still propagate from the far past.

The KV cache saving is dramatic. Local layers only need to cache 1024 keys (the window) regardless of actual sequence length. Global layers still cache the full sequence. With 5:1 interleaving, ~83% of layers are bounded-cache, ~17% are full-cache. Over a 128k context, total KV memory drops from 15.0 GB (all-global) to 2.6 GB (5:1 hybrid) — roughly a 5.8× reduction.

layers — 30 total, alternating local/global

local (25 layers · window=1024)

global (5 layers · full context)

all-global KV @ 128k

15.0GB

5:1 hybrid KV @ 128k

2.6GB

reduction

5.8×

Different RoPE bases for different roles

There's a beautiful second trick in Gemma 3: the local and global layers use different RoPE base frequencies. Local layers keep $\theta_\text{base} = 10{,}000$ (the original RoPE choice). Global layers use $\theta_\text{base} = 10^6$ — 100× larger.

Why? You learned in the RoPE lesson that higher base frequency lets high-index feature pairs sweep meaningful angles across longer sequences. Local layers only see a 1024-token window — base 10k is plenty for that range. Global layers see 128k tokens and need the high-index pairs to actually rotate enough to distinguish far-apart positions.

This matching of position-encoding choice to attention scope is a tiny architectural detail that saves quality at long contexts. The paper reports KV cache memory share drops from ~60% of total footprint to under 15% with the hybrid scheme.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 2

Why does pure sliding-window attention fail at long-range tasks?

this lesson appears in

Gemma 3

Full attention is O(L²) — a problem at 128k tokens

A full attention layer at sequence length

L

costs

O(L^2 \cdot d)

compute and stores a KV cache linear in

L

. At

L = 128{,}000

those numbers become catastrophic — the KV cache alone for a 30-layer model with 8 KV heads at

d_h = 128

FP16 is 15.0 GB per sequence. That's more than the weights of a 7B model.

Five local for every global

Different RoPE bases for different roles

There's a beautiful second trick in Gemma 3: the local and global layers use different RoPE base frequencies. Local layers keep

\theta_\text{base} = 10{,}000

(the original RoPE choice). Global layers use

\theta_\text{base} = 10^6

— 100× larger.