model

Llama 4

by Meta · Released April 2025 · LAST REVIEWED APR 2026

Llama 4 is Meta's first big MoE. Scout (17B active / 109B total) and Maverick (17B active / 400B total). Natively multimodal from the start. Scout extends context to 10M tokens.

what's new in this one

Llama 4 is where Meta switched lanes — the 3.x series was dense across every size, Llama 4 is MoE-first. Both Scout and Maverick activate the same 17B per token (matching the dense-Llama-3.x workhorse tier) but pack total parameter capacity far higher. The MoE lessonexplains the basic trade; Llama 4's specifics (shared-expert count, routing top-k, training curriculum) differ from DeepSeek-V3 and Kimi K2, but the underlying math is the same.

The differentiating feature is context length. Scout's 10M-token context is the largest in any open-weights release to date, pushing well past the 128K–1M range where most contemporary models stopped. The Long-context lesson covers why naive RoPE extension breaks long before 10M; Llama 4 Scout uses a custom combination of frequency-band scaling and what Meta calls "infini-attention" — architectural choices explicitly aimed at the million-plus regime.

Natively multimodal means vision tokens and text tokens flow through the same transformer from the first layer (early fusion), not a separate vision encoder grafted onto a text model. That's an architectural departure from Llama 3.2's vision variants, which use the encoder-adapter pattern. Early-fusion multimodal isn't covered as its own Microscale lesson yet — it's the clearest gap in the current curriculum if Llama 4's approach becomes standard.

the shape in numbers

Scout: 17B active, 109B total (MoE)
Maverick: 17B active, 400B total (MoE)
Architecture: MoE + early-fusion multimodal
Scout context: 10M tokens
Maverick context: 1M tokens