model

Llama 3.1

by Meta · Released July 2024 · LAST REVIEWED APR 2026

Llama 3.1 is the open-weights dense-transformer workhorse. 405B scales the frontier-class recipe; 70B and 8B are the teacher and student variants behind half of 2025's distillation pipelines.

what's new in this one

The Llama 3.1 release is the "dense, done well" point in the open-weights landscape. No MoE, no MLA, no MTP — just a conventional transformer decoder scaled carefully. The contribution is execution: strong base pretraining, clean GQA configuration, and a context extension that actually works at 128K.

Attention uses GQA with 8 KV heads on the 70B and 405B variants. That's the compression ratio the MHA-to-GQA lesson uses as its worked example — the 32 KV head → 8 group collapse that cuts the KV cache footprint by 4× with minimal quality loss. Llama 3.1 is the canonical production example of that compression ratio.

Context extension to 128K uses Meta's own llama3 rope_type with a scaling factor of 8 — specifically not YaRN, though the two approaches reach similar effective contexts. The Long-context lesson contrasts the two. Llama 3.1's approach is a frequency-band-split interpolation tailored to its base RoPE configuration.

Most of 2025's published reasoning and instruction-following recipes that distill from "a 70B teacher" or "a 405B teacher" without naming the teacher — it's this model. Worth knowing when reading the Distillation lesson: the teacher distributions the student is matching were almost certainly produced by Llama 3.1.

the shape in numbers

Sizes: 8B, 70B, 405B
Architecture: Dense, GQA
Attention: GQA with 8 KV heads (70B, 405B)
Context: 128K (llama3 rope_type, factor 8)
Role: Teacher for most 2025 distillations