Llama 4
Llama 4 is Meta's first big MoE. Scout (17B active / 109B total) and Maverick (17B active / 400B total). Natively multimodal from the start. Scout extends context to 10M tokens.
Llama 4 is where Meta switched lanes — the 3.x series was dense across every size, Llama 4 is MoE-first. Both Scout and Maverick activate the same 17B per token (matching the dense-Llama-3.x workhorse tier) but pack total parameter capacity far higher. The MoE lessonexplains the basic trade; Llama 4's specifics (shared-expert count, routing top-k, training curriculum) differ from DeepSeek-V3 and Kimi K2, but the underlying math is the same.
The differentiating feature is context length. Scout's 10M-token context is the largest in any open-weights release to date, pushing well past the 128K–1M range where most contemporary models stopped. The Long-context lesson covers why naive RoPE extension breaks long before 10M; Llama 4 Scout uses a custom combination of frequency-band scaling and what Meta calls "infini-attention" — architectural choices explicitly aimed at the million-plus regime.
Natively multimodal means vision tokens and text tokens flow through the same transformer from the first layer (early fusion), not a separate vision encoder grafted onto a text model. That's an architectural departure from Llama 3.2's vision variants, which use the encoder-adapter pattern. Early-fusion multimodal isn't covered as its own Microscale lesson yet — it's the clearest gap in the current curriculum if Llama 4's approach becomes standard.
- Scout
- 17B active, 109B total (MoE)
- Maverick
- 17B active, 400B total (MoE)
- Architecture
- MoE + early-fusion multimodal
- Scout context
- 10M tokens
- Maverick context
- 1M tokens
- Act II · 22 min · 65 xpEight stations, two lanternsWhy DeepSeek-V3 claims 671B parameters but only activates 37B per token. Top-k routing, shared experts, and the load-balance thermostat
- Act III · 10 min · 40 xpThe model museumExplore every major SLM — Phi-4, Llama 3.2, Qwen3, Gemma 3, SmolLM3, BitNet — with architecture diagrams, training recipes, and benchmarks