lesson distillation · 11 min · 55 xp

Distillation & dark knowledge

Watch teacher distributions flow into students — temperature scaling, forward vs reverse KL, and dark knowledge

Hard labels versus soft labels

Classical supervised learning uses hard labels — the training signal tells you the correct answer and nothing else. Distillation uses soft labels— the training signal is a whole probability distribution from a teacher model. The soft label tells you not just “the answer is Paris” but also “if it wasn't Paris it was very likely Lyon or Marseille and very unlikely London.”

That extra information — the runner-up probabilities— is what Hinton called “dark knowledge.” It encodes semantic structure that a hard label cannot: Lyon and Marseille and Bordeaux are all French cities; London and Berlin and Rome are European capitals; the teacher knows the topic and communicates that shape through its probability mass.

The distillation loss

\mathcal{L}_\text{KD} = \alpha \cdot \mathcal{L}_\text{CE}(y_\text{hard}) + (1 - \alpha) \cdot T^2 \cdot \text{KL}(p_T^\text{teacher} \,\|\, p_T^\text{student})

$p_T$ denotes softmax at temperature $T$ . At $T = 1$ you're matching the teacher's natural distribution. At higher $T$ you flatten it, which amplifies the runner-ups — exposing more dark knowledge. The $T^2$ factor in front of the KL is there because the gradient of the soft-target term through softmax-with-temperature scales as $1/T^2$ ; multiplying the loss by $T^2$ cancels that shrinkage so distillation gradients stay comparable in magnitude to hard-label gradients.

distillation temperature T2.0

low T = sharp (argmax-like) · high T = flat (rich dark knowledge)

teacher distribution · p_T

Paris

60.9%

Lyon

11.1%

Marseille

10.1%

Bordeaux

7.1%

Nice

5.8%

London

2.1%

Berlin

1.6%

Rome

1.4%

student distribution · p_T

Paris

62.0%

Lyon

10.8%

Marseille

9.8%

Bordeaux

6.9%

Nice

5.6%

London

2.1%

Berlin

1.5%

Rome

1.3%

Slide T from low to high and watch the runner-ups (Lyon, Marseille, Bordeaux) light up. At

T=1

they're vanishingly small. At

T=5

the student can see that the teacher thinks Lyon is a real candidate — information a hard label would never transmit.

How Llama 3.2 1B/3B use this

Llama 3.2's 1B and 3B models were created by:

Structured pruning from Llama 3.1 8B — remove entire layers, heads, and FFN dims guided by importance scores.
Continued pretraining of the pruned student, with the loss augmented by a KL term against the teacher distributions of Llama 3.1 8B and Llama 3.1 70B.
Standard post-training — SFT, rejection sampling, DPO.

The dual-teacher distillation (8B + 70B) is unusual and clever: the 8B gives strong signal on most tokens, but the 70B injects occasional higher-quality runners-up on tokens where the 8B and 70B disagree. It's a way of getting “some” of the 70B capability at 1B or 3B inference cost.

A naive full KL over a 128k vocabulary is memory-heavy — the teacher logits alone are

B \cdot L \cdot V

floats. Production distillation uses top-k distillation(only match the teacher's top-k tokens per position) or offline logit caching. Quality is similar; memory footprint drops orders of magnitude.

Forward KL, reverse KL, and why modern distillation switched

Hinton's original loss uses forward KL: $\text{KL}(p_\text{teacher}\,\|\,p_\text{student})$ . Forward KL is mean-seeking— wherever the teacher puts probability mass, the student is punished for putting near-zero mass there. A small student forced to cover every mode of a 405B teacher's distribution ends up averagingacross incompatible modes: when the teacher thinks the answer is “Paris” OR “Lyon” the student learns to put 40% on each, and then generates a blurry mixture that matches neither. This is the same pathology that makes VAEs produce blurry images.

MiniLLM (Gu et al. 2024) replaced forward KL with reverse KL: $\text{KL}(p_\text{student}\,\|\,p_\text{teacher})$ . Reverse KL is mode-seeking — the student is punished only for putting mass where the teacher disagrees, and is free to drop teacher modes it cannot represent. For a capacity-limited student this is exactly what you want: pick one mode confidently, get it right, and let the modes you cannot represent go to zero. The student ends up with asharper, not blurrier, distribution than naive distillation produces. Mode collapse is reframed from a bug into a feature.

◆ paper

MiniLLM: Knowledge Distillation of Large Language Models

Gu, Dong, Wei, Huang · 2024 · ICLR 2024

arxiv:2306.08543

Introduces reverse-KL distillation for language models, paired with an on-policy gradient estimator that handles the variance reverse-KL would otherwise incur. Across GPT-2, OPT, and LLaMA families (students from 120M to 13B), the reverse-KL students match or beat forward-KL ones on instruction-following evals. DistiLLM (Ko et al. 2024, arXiv 2402.03898) extended this with a skew-KL objective that interpolates between forward and reverse.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 2

What is 'dark knowledge' in the distillation sense?

Hard labels versus soft labels

The distillation loss

\mathcal{L}_\text{KD} = \alpha \cdot \mathcal{L}_\text{CE}(y_\text{hard}) + (1 - \alpha) \cdot T^2 \cdot \text{KL}(p_T^\text{teacher} \,\|\, p_T^\text{student})

p_T

denotes softmax at temperature

T

. At

T = 1

you're matching the teacher's natural distribution. At higher

T

you flatten it, which amplifies the runner-ups — exposing more dark knowledge. The

T^2

factor in front of the KL is there because the gradient of the soft-target term through softmax-with-temperature scales as

1/T^2

; multiplying the loss by

T^2

cancels that shrinkage so distillation gradients stay comparable in magnitude to hard-label gradients.

How Llama 3.2 1B/3B use this

Llama 3.2's 1B and 3B models were created by:

Structured pruning from Llama 3.1 8B — remove entire layers, heads, and FFN dims guided by importance scores.

Continued pretraining of the pruned student, with the loss augmented by a KL term against the teacher distributions of Llama 3.1 8B and Llama 3.1 70B.

Standard post-training — SFT, rejection sampling, DPO.

A naive full KL over a 128k vocabulary is memory-heavy — the teacher logits alone are

B \cdot L \cdot V

Forward KL, reverse KL, and why modern distillation switched

Hinton's original loss uses forward KL:

\text{KL}(p_\text{teacher}\,\|\,p_\text{student})

. Forward KL is mean-seeking— wherever the teacher puts probability mass, the student is punished for putting near-zero mass there. A small student forced to cover every mode of a 405B teacher's distribution ends up averagingacross incompatible modes: when the teacher thinks the answer is “Paris” OR “Lyon” the student learns to put 40% on each, and then generates a blurry mixture that matches neither. This is the same pathology that makes VAEs produce blurry images.

MiniLLM (Gu et al. 2024) replaced forward KL with reverse KL:

\text{KL}(p_\text{student}\,\|\,p_\text{teacher})

. Reverse KL is mode-seeking — the student is punished only for putting mass where the teacher disagrees, and is free to drop teacher modes it cannot represent. For a capacity-limited student this is exactly what you want: pick one mode confidently, get it right, and let the modes you cannot represent go to zero. The student ends up with asharper, not blurrier, distribution than naive distillation produces. Mode collapse is reframed from a bug into a feature.

◆ paper

MiniLLM: Knowledge Distillation of Large Language Models

Gu, Dong, Wei, Huang · 2024 · ICLR 2024

arxiv:2306.08543