model

Gemma 3

by Google DeepMind · Released March 2025 · LAST REVIEWED APR 2026

Gemma 3's signature architectural choice is a 5:1 ratio of local-sliding-window attention to full attention. Most Gemma 3 layers only look at the last 4K tokens. Every sixth layer looks globally.

what's new in this one

The 5:1 local-global attention pattern is the most load-bearing design choice in Gemma 3. Five consecutive layers use sliding windows of 4K tokens — fast, cheap, and sufficient for most local syntactic and coreference work. Every sixth layer uses full global attention so distant information still flows through. The net effect: KV cache footprint drops dramatically at long contexts while quality holds, because the global layers carry the long-range dependencies without every layer paying the full-attention cost.

The Sliding window lesson walks through the 5:1 ratio and the KV-cache math — for a 128K-context Gemma 3 27B, the hybrid attention cuts KV cache roughly 3× vs a same-size fully-global model, with measurable wall-clock decode speedups via FlashAttention.

The size progression (1B → 4B → 12B → 27B) is deliberately dense at the small-model end — 1B and 4B are the SLM-relevant variants, both quantise well, both serve competitively with Llama 3.2's 1B and 3B peers. The 4B variant adds multimodal capability (SigLIP-based vision encoder). See the model museum for a direct head-to-head against Llama 3.2 and SmolLM3 at the same scale.

the shape in numbers

Sizes: 1B, 4B, 12B, 27B
Architecture: Dense, GQA
Attention: 5:1 local-global (4K window)
Context: 128K
Multimodal: 4B+ (SigLIP vision)