Hard labels versus soft labels
Classical supervised learning uses hard labels — the training signal tells you the correct answer and nothing else. Distillation uses soft labels— the training signal is a whole probability distribution from a teacher model. The soft label tells you not just “the answer is Paris” but also “if it wasn't Paris it was very likely Lyon or Marseille and very unlikely London.”
That extra information — the runner-up probabilities— is what Hinton called “dark knowledge.” It encodes semantic structure that a hard label cannot: Lyon and Marseille and Bordeaux are all French cities; London and Berlin and Rome are European capitals; the teacher knows the topic and communicates that shape through its probability mass.
The distillation loss
denotes softmax at temperature . At you're matching the teacher's natural distribution. At higher you flatten it, which amplifies the runner-ups — exposing more dark knowledge. The factor in front of the KL is there because the softmax derivative shrinks by and the gradient magnitude would otherwise shrink with ; you multiply by to restore scale.
How Llama 3.2 1B/3B use this
Llama 3.2's 1B and 3B models were created by:
- Structured pruning from Llama 3.1 8B — remove entire layers, heads, and FFN dims guided by importance scores.
- Continued pretraining of the pruned student, with the loss augmented by a KL term against the teacher distributions of Llama 3.1 8B and Llama 3.1 70B.
- Standard post-training — SFT, rejection sampling, DPO.
The dual-teacher distillation (8B + 70B) is unusual and clever: the 8B gives strong signal on most tokens, but the 70B injects occasional higher-quality runners-up on tokens where the 8B and 70B disagree. It's a way of getting “some” of the 70B capability at 1B or 3B inference cost.
Forward KL, reverse KL, and why modern distillation switched
Hinton's original loss uses forward KL: . Forward KL is mean-seeking— wherever the teacher puts probability mass, the student is punished for putting near-zero mass there. A small student forced to cover every mode of a 405B teacher's distribution ends up averagingacross incompatible modes: when the teacher thinks the answer is “Paris” OR “Lyon” the student learns to put 40% on each, and then generates a blurry mixture that matches neither. This is the same pathology that makes VAEs produce blurry images.
MiniLLM (Gu et al. 2024) replaced forward KL with reverse KL: . Reverse KL is mode-seeking — the student is punished only for putting mass where the teacher disagrees, and is free to drop teacher modes it cannot represent. For a capacity-limited student this is exactly what you want: pick one mode confidently, get it right, and let the modes you cannot represent go to zero. The student ends up with asharper, not blurrier, distribution than naive distillation produces. Mode collapse is reframed from a bug into a feature.