Full attention is O(L²) — a problem at 128k tokens
A full attention layer at sequence length costs compute and stores a KV cache linear in . At those numbers become catastrophic — the KV cache alone for a 30-layer model with 8 KV heads at FP16 is 15.0 GB per sequence. That's more than the weights of a 14B model.
Mistral 7B tried pure sliding-window attention at 4k tokens and saw long-range tasks break — information from the beginning of a document literally couldn't propagate to the end. Gemma 3's fix is elegant: interleave local and global layers.
Five local for every global
Gemma 3's pattern is 5:1 local-to-global. Five sliding-window-1024 layers, one full-attention layer, repeating. The local layers handle syntactic and semantic work that only needs recent context. The periodic global layers re-inject cross-document information so information can still propagate from the far past.
The KV cache saving is dramatic. Local layers only need to cache 1024 keys (the window) regardless of actual sequence length. Global layers still cache the full sequence. With 5:1 interleaving, ~83% of layers are bounded-cache, ~17% are full-cache. Over a 128k context, total KV memory drops from 15.0 GB (all-global) to 2.6 GB (5:1 hybrid) — roughly a 5.8× reduction.
Different RoPE bases for different roles
There's a beautiful second trick in Gemma 3: the local and global layers use different RoPE base frequencies. Local layers keep (the original RoPE choice). Global layers use — 100× larger.
Why? You learned in the RoPE lesson that higher base frequency lets high-index feature pairs sweep meaningful angles across longer sequences. Local layers only see a 1024-token window — base 10k is plenty for that range. Global layers see 128k tokens and need the high-index pairs to actually rotate enough to distinguish far-apart positions.
This matching of position-encoding choice to attention scope is a tiny architectural detail that saves quality at long contexts. The paper reports KV cache memory share drops from ~60% of total footprint to under 15% with the hybrid scheme.