Microscale
0
Act IVHow They Learn
lesson distillation · 11 min · 55 xp

Distillation & dark knowledge

Watch teacher distributions flow into students

Hard labels versus soft labels

Classical supervised learning uses hard labels — the training signal tells you the correct answer and nothing else. Distillation uses soft labels— the training signal is a whole probability distribution from a teacher model. The soft label tells you not just “the answer is Paris” but also “if it wasn't Paris it was very likely Lyon or Marseille and very unlikely London.”

That extra information — the runner-up probabilities— is what Hinton called “dark knowledge.” It encodes semantic structure that a hard label cannot: Lyon and Marseille and Bordeaux are all French cities; London and Berlin and Rome are European capitals; the teacher knows the topic and communicates that shape through its probability mass.

The distillation loss

LKD=αLCE(yhard)+(1α)T2KL(pTstudentpTteacher)\mathcal{L}_\text{KD} = \alpha \cdot \mathcal{L}_\text{CE}(y_\text{hard}) + (1 - \alpha) \cdot T^2 \cdot \text{KL}(p_T^\text{student} \,\|\, p_T^\text{teacher})

pTp_T denotes softmax at temperature TT. At T=1T = 1 you're matching the teacher's natural distribution. At higher TT you flatten it, which amplifies the runner-ups — exposing more dark knowledge. The T2T^2 factor in front of the KL is there because the softmax derivative shrinks by 1/T1/T and the gradient magnitude would otherwise shrink with TT; you multiply by T2T^2 to restore scale.

2.0
low T = sharp (argmax-like) · high T = flat (rich dark knowledge)
teacher distribution · p_T
Paris
60.9%
Lyon
11.1%
Marseille
10.1%
Bordeaux
7.1%
Nice
5.8%
London
2.1%
Berlin
1.6%
Rome
1.4%
student distribution · p_T
Paris
62.0%
Lyon
10.8%
Marseille
9.8%
Bordeaux
6.9%
Nice
5.6%
London
2.1%
Berlin
1.5%
Rome
1.3%
Slide T from low to high and watch the runner-ups (Lyon, Marseille, Bordeaux) light up. At T=1T=1 they're vanishingly small. At T=5T=5 the student can see that the teacher thinks Lyon is a real candidate — information a hard label would never transmit.

How Llama 3.2 1B/3B use this

Llama 3.2's 1B and 3B models were created by:

  1. Structured pruning from Llama 3.1 8B — remove entire layers, heads, and FFN dims guided by importance scores.
  2. Continued pretraining of the pruned student, with the loss augmented by a KL term against the teacher distributions of Llama 3.1 8B and Llama 3.1 70B.
  3. Standard post-training — SFT, rejection sampling, DPO.

The dual-teacher distillation (8B + 70B) is unusual and clever: the 8B gives strong signal on most tokens, but the 70B injects occasional higher-quality runners-up on tokens where the 8B and 70B disagree. It's a way of getting “some” of the 70B capability at 1B or 3B inference cost.

A naive full KL over a 128k vocabulary is memory-heavy — the teacher logits alone are BLVB \cdot L \cdot V floats. Production distillation uses top-k distillation(only match the teacher's top-k tokens per position) or offline logit caching. Quality is similar; memory footprint drops orders of magnitude.

Forward KL, reverse KL, and why modern distillation switched

Hinton's original loss uses forward KL: KL(pteacherpstudent)\text{KL}(p_\text{teacher}\,\|\,p_\text{student}). Forward KL is mean-seeking— wherever the teacher puts probability mass, the student is punished for putting near-zero mass there. A small student forced to cover every mode of a 405B teacher's distribution ends up averagingacross incompatible modes: when the teacher thinks the answer is “Paris” OR “Lyon” the student learns to put 40% on each, and then generates a blurry mixture that matches neither. This is the same pathology that makes VAEs produce blurry images.

MiniLLM (Gu et al. 2024) replaced forward KL with reverse KL: KL(pstudentpteacher)\text{KL}(p_\text{student}\,\|\,p_\text{teacher}). Reverse KL is mode-seeking — the student is punished only for putting mass where the teacher disagrees, and is free to drop teacher modes it cannot represent. For a capacity-limited student this is exactly what you want: pick one mode confidently, get it right, and let the modes you cannot represent go to zero. The student ends up with asharper, not blurrier, distribution than naive distillation produces. Mode collapse is reframed from a bug into a feature.

◆ paper
MiniLLM: Knowledge Distillation of Large Language Models
Gu, Dong, Wang, Hou, Huang, Chen, Wei · 2024 · ICLR 2024
arxiv:2306.08543
Introduces reverse-KL distillation for language models and shows that a 1.5B student distilled from a 13B teacher via reverse-KL matches or beats a forward-KL student on instruction-following evals. DistiLLM (Ko et al. 2024) extended this with a skew-KL objective that interpolates between forward and reverse.
comprehension check
comprehension · 1 / 2

What is 'dark knowledge' in the distillation sense?