Llama 3.1
Llama 3.1 is the open-weights dense-transformer workhorse. 405B scales the frontier-class recipe; 70B and 8B are the teacher and student variants behind half of 2025's distillation pipelines.
The Llama 3.1 release is the "dense, done well" point in the open-weights landscape. No MoE, no MLA, no MTP — just a conventional transformer decoder scaled carefully. The contribution is execution: strong base pretraining, clean GQA configuration, and a context extension that actually works at 128K.
Attention uses GQA with 8 KV heads on the 70B and 405B variants. That's the compression ratio the MHA-to-GQA lesson uses as its worked example — the 32 KV head → 8 group collapse that cuts the KV cache footprint by 4× with minimal quality loss. Llama 3.1 is the canonical production example of that compression ratio.
Context extension to 128K uses Meta's own llama3 rope_type with a scaling factor of 8 — specifically not YaRN, though the two approaches reach similar effective contexts. The Long-context lesson contrasts the two. Llama 3.1's approach is a frequency-band-split interpolation tailored to its base RoPE configuration.
Most of 2025's published reasoning and instruction-following recipes that distill from "a 70B teacher" or "a 405B teacher" without naming the teacher — it's this model. Worth knowing when reading the Distillation lesson: the teacher distributions the student is matching were almost certainly produced by Llama 3.1.
- Sizes
- 8B, 70B, 405B
- Architecture
- Dense, GQA
- Attention
- GQA with 8 KV heads (70B, 405B)
- Context
- 128K (llama3 rope_type, factor 8)
- Role
- Teacher for most 2025 distillations
- Act II · 9 min · 50 xpFrom MHA to GQACollapse 32 KV heads into 8 groups and cut memory 3×. The MHA-to-MQA-to-GQA progression, per-token cache math, and the Ainslie uptraining recipe
- Act III · 10 min · 40 xpThe model museumExplore every major SLM — Phi-4, Llama 3.2, Qwen3, Gemma 3, SmolLM3, BitNet — with architecture diagrams, training recipes, and benchmarks