Gemma 3
Gemma 3's signature architectural choice is a 5:1 ratio of local-sliding-window attention to full attention. Most Gemma 3 layers only look at the last 4K tokens. Every sixth layer looks globally.
The 5:1 local-global attention pattern is the most load-bearing design choice in Gemma 3. Five consecutive layers use sliding windows of 4K tokens — fast, cheap, and sufficient for most local syntactic and coreference work. Every sixth layer uses full global attention so distant information still flows through. The net effect: KV cache footprint drops dramatically at long contexts while quality holds, because the global layers carry the long-range dependencies without every layer paying the full-attention cost.
The Sliding window lesson walks through the 5:1 ratio and the KV-cache math — for a 128K-context Gemma 3 27B, the hybrid attention cuts KV cache roughly 3× vs a same-size fully-global model, with measurable wall-clock decode speedups via FlashAttention.
The size progression (1B → 4B → 12B → 27B) is deliberately dense at the small-model end — 1B and 4B are the SLM-relevant variants, both quantise well, both serve competitively with Llama 3.2's 1B and 3B peers. The 4B variant adds multimodal capability (SigLIP-based vision encoder). See the model museum for a direct head-to-head against Llama 3.2 and SmolLM3 at the same scale.
- Sizes
- 1B, 4B, 12B, 27B
- Architecture
- Dense, GQA
- Attention
- 5:1 local-global (4K window)
- Context
- 128K
- Multimodal
- 4B+ (SigLIP vision)
- Act II · 9 min · 45 xpLocal + global attentionInterleave 5 local sliding-window layers with 1 global layer. How Gemma 3's 5:1 ratio slashes KV cache while keeping long-context coherence
- Act III · 10 min · 40 xpThe model museumExplore every major SLM — Phi-4, Llama 3.2, Qwen3, Gemma 3, SmolLM3, BitNet — with architecture diagrams, training recipes, and benchmarks