Microscale
0
model

Llama 4

by Meta · Released April 2025 · LAST REVIEWED APR 2026

Llama 4 is Meta's first big MoE. Scout (17B active / 109B total) and Maverick (17B active / 400B total). Natively multimodal from the start. Scout extends context to 10M tokens.

what's new in this one

Llama 4 is where Meta switched lanes — the 3.x series was dense across every size, Llama 4 is MoE-first. Both Scout and Maverick activate the same 17B per token (matching the dense-Llama-3.x workhorse tier) but pack total parameter capacity far higher. The MoE lessonexplains the basic trade; Llama 4's specifics (shared-expert count, routing top-k, training curriculum) differ from DeepSeek-V3 and Kimi K2, but the underlying math is the same.

The differentiating feature is context length. Scout's 10M-token context is the largest in any open-weights release to date, pushing well past the 128K–1M range where most contemporary models stopped. The Long-context lesson covers why naive RoPE extension breaks long before 10M; Llama 4 Scout uses a custom combination of frequency-band scaling and what Meta calls "infini-attention" — architectural choices explicitly aimed at the million-plus regime.

Natively multimodal means vision tokens and text tokens flow through the same transformer from the first layer (early fusion), not a separate vision encoder grafted onto a text model. That's an architectural departure from Llama 3.2's vision variants, which use the encoder-adapter pattern. Early-fusion multimodal isn't covered as its own Microscale lesson yet — it's the clearest gap in the current curriculum if Llama 4's approach becomes standard.

the shape in numbers
Scout
17B active, 109B total (MoE)
Maverick
17B active, 400B total (MoE)
Architecture
MoE + early-fusion multimodal
Scout context
10M tokens
Maverick context
1M tokens
read alongside