Microscale
0
Act IVHow They Learn
lesson scaling-laws · 10 min · 50 xp

Scaling laws, alive

Chinchilla vs inference-optimal

The Chinchilla loss function

MMXXVI
historical note
2020–2022 · Kaplan and then Hoffmann
The first scaling-law paper (Kaplan et al. 2020) fit a power law to GPT-style model training and concluded that model size was the dominant factor — for a given compute budget, build the biggest model you can and feed it as much data as you have time for. That conclusion shaped the GPT-3 era. Two years later, Hoffmann et al. (2022) re-fit the scaling law on a broader grid of model sizes and token counts and discovered that Kaplan's models had been massively undertrained. The new optimum was roughly 20 tokens per parameter — so much more data, so much smaller a model for the same compute.
◆ paper
Training Compute-Optimal Large Language Models (Chinchilla)
Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, et al. · 2022 · NeurIPS 2022
arxiv:2203.15556
The paper that dethroned the “bigger is better” era and introduced the 20-tokens-per-parameter rule. Chinchilla 70B (trained to the new law) outperformed Gopher 280B (trained Kaplan-style) on almost every benchmark.

Hoffmann et al. fit a three-term power law to training loss:

L(N,D)  =  E+ANα+BDβL(N, D) \;=\; E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

EE is the irreducible loss (how close to entropy you can get). A/NαA/N^\alpha is how much loss you pay for using fewer parameters. B/DβB/D^\beta is how much you pay for training on fewer tokens. They fit α0.34\alpha \approx 0.34 and β0.28\beta \approx 0.28. For a fixed training-compute budget Ctrain=6NDC_\text{train} = 6ND, minimising loss gives an optimal pair (N,D)(N^*, D^*) with D/N20D^*/N^* \approx 20.

Add inference and watch the optimum move

Sardana et al. (2024) added the inference term. Now we're minimising:

total cost(N,D;T)  =  6NDtraining  +  2NTinference\text{total cost}(N, D; T) \;=\; \underbrace{6 N D}_{\text{training}} \;+\; \underbrace{2 N T}_{\text{inference}}

subject to a total-cost budget. The plot below shows loss as a function of model size NN for a fixed total budget, for two cases: Chinchilla (zero inference volume) and the user-chosen inference volume. Drag the slider and watch the inference-aware optimum slide to the left — toward smaller, longer-trained models.

◆ paper
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Sardana, Portes, Doubov, Frankle · 2024
arxiv:2401.00448
The paper that re-derived the scaling optimum with inference cost included. Showed that for expected inference volumes above ~10⁹ tokens (standard for any deployed product), the optimum is substantially smaller and longer-trained than Chinchilla. This is why Llama 3 8B is trained at D/N ≈ 1,875 and why SmolLM3 3B is at D/N ≈ 3,700 — both far past Chinchilla's 20.
10B tokens
zero inference = Chinchilla. Billions of tokens shifts optimum left.
Chinchilla optimum N
31.6B
inference-aware optimum N
31.6B
loss vs model size, fixed total budget
1B3B10B30B100B1.82.02.22.4log₁₀ N (billions)lossChin*New*
Chinchilla (training only)
inference-aware (Sardana)

What this means for SLMs

As soon as your expected inference volume crosses a threshold, the inference-optimal model is much smallerthan the training-optimal one. Every modern SLM is trained far past Chinchilla's D/N20D/N \approx 20 — Llama 3 8B is at D/N1,875D/N \approx 1{,}875, SmolLM3 3B is at D/N3,700D/N \approx 3{,}700, Phi-4-mini is at D/N1,300D/N \approx 1{,}300. These are not mistakes. They are the new deployment-optimal.

The broader lesson: every “law” in ML is a law for some cost function. Chinchilla optimized training compute; Sardana added inference; some day someone will add energy cost, memory cost, latency cost. Each addition moves the optimum. Know which cost function you are optimizing before you pick a model size.

comprehension check
comprehension · 1 / 3

What does Chinchilla's D/N ≈ 20 optimum assume?