lesson scaling-laws · 10 min · 50 xp

Scaling laws, alive

Visualize LLM scaling laws interactively — why Chinchilla's 20:1 ratio broke, how inference-optimal uses 1000× more tokens per parameter, and the crossover

Lab 05 · The $1 Pretraining Run · 90–120 min

The Chinchilla loss function

MMXXVI

historical note

2020–2022 · Kaplan and then Hoffmann

The first scaling-law paper (Kaplan et al. 2020) fit a power law to GPT-style model training and concluded that model size was the dominant factor — for a given compute budget, build the biggest model you can and feed it as much data as you have time for. That conclusion shaped the GPT-3 era. Two years later, Hoffmann et al. (2022) re-fit the scaling law on a broader grid of model sizes and token counts and discovered that Kaplan's models had been massively undertrained. The new optimum was roughly 20 tokens per parameter — so much more data, so much smaller a model for the same compute.

◆ paper

Training Compute-Optimal Large Language Models (Chinchilla)

Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, et al. · 2022 · NeurIPS 2022

arxiv:2203.15556

The paper that dethroned the “bigger is better” era and introduced the 20-tokens-per-parameter rule. Chinchilla 70B (trained to the new law) outperformed Gopher 280B (trained Kaplan-style) on almost every benchmark.

Hoffmann et al. fit a three-term power law to training loss:

L(N, D) \;=\; E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

$E$ is the irreducible loss (how close to entropy you can get). $A/N^\alpha$ is how much loss you pay for using fewer parameters. $B/D^\beta$ is how much you pay for training on fewer tokens. They fit $\alpha \approx 0.34$ and $\beta \approx 0.28$ . For a fixed training-compute budget $C_\text{train} = 6ND$ , minimising loss gives an optimal pair $(N^*, D^*)$ with $D^*/N^* \approx 20$ .

Add inference and watch the optimum move

Sardana et al. (2024) added the inference term. Now we're minimising:

\text{total cost}(N, D; T) \;=\; \underbrace{6 N D}_{\text{training}} \;+\; \underbrace{2 N T}_{\text{inference}}

subject to a total-cost budget. The plot below shows loss as a function of model size $N$ for a fixed total budget, for two cases: Chinchilla (zero inference volume) and the user-chosen inference volume. Drag the slider and watch the inference-aware optimum slide to the left — toward smaller, longer-trained models.

◆ paper

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

Sardana, Portes, Doubov, Frankle · 2024

arxiv:2401.00448

The paper that re-derived the scaling optimum with inference cost included. Showed that for expected inference volumes above ~10⁹ tokens (standard for any deployed product), the optimum is substantially smaller and longer-trained than Chinchilla. This is why Llama 3 8B is trained at D/N ≈ 1,875 and why SmolLM3 3B is at D/N ≈ 3,700 — both far past Chinchilla's 20.

expected inference volume10B tokens

zero inference = Chinchilla. Billions of tokens shifts optimum left.

Chinchilla optimum N

31.6B

inference-aware optimum N

31.6B

loss vs model size, fixed total budget

Chinchilla (training only)

inference-aware (Sardana)

What this means for SLMs

As soon as your expected inference volume crosses a threshold, the inference-optimal model is much smallerthan the training-optimal one. Every modern SLM is trained far past Chinchilla's $D/N \approx 20$ — Llama 3 8B is at $D/N \approx 1{,}875$ , SmolLM3 3B is at $D/N \approx 3{,}700$ , Phi-4-mini is at $D/N \approx 1{,}300$ . These are not mistakes. They are the new deployment-optimal.

The broader lesson: every “law” in ML is a law for some cost function. Chinchilla optimized training compute; Sardana added inference; some day someone will add energy cost, memory cost, latency cost. Each addition moves the optimum. Know which cost function you are optimizing before you pick a model size.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

What does Chinchilla's D/N ≈ 20 optimum assume?

The Chinchilla loss function

MMXXVI

historical note

2020–2022 · Kaplan and then Hoffmann

◆ paper

Training Compute-Optimal Large Language Models (Chinchilla)

Hoffmann, Borgeaud, Mensch, Buchatskaya, Cai, et al. · 2022 · NeurIPS 2022

arxiv:2203.15556

Hoffmann et al. fit a three-term power law to training loss:

L(N, D) \;=\; E + \frac{A}{N^\alpha} + \frac{B}{D^\beta}

E

is the irreducible loss (how close to entropy you can get).

A/N^\alpha

is how much loss you pay for using fewer parameters.

B/D^\beta

is how much you pay for training on fewer tokens. They fit

\alpha \approx 0.34

and

\beta \approx 0.28

. For a fixed training-compute budget

C_\text{train} = 6ND

, minimising loss gives an optimal pair

(N^*, D^*)

with

D^*/N^* \approx 20

Add inference and watch the optimum move

Sardana et al. (2024) added the inference term. Now we're minimising:

\text{total cost}(N, D; T) \;=\; \underbrace{6 N D}_{\text{training}} \;+\; \underbrace{2 N T}_{\text{inference}}

subject to a total-cost budget. The plot below shows loss as a function of model size

N

for a fixed total budget, for two cases: Chinchilla (zero inference volume) and the user-chosen inference volume. Drag the slider and watch the inference-aware optimum slide to the left — toward smaller, longer-trained models.

◆ paper

Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws

Sardana, Portes, Doubov, Frankle · 2024

arxiv:2401.00448

What this means for SLMs

As soon as your expected inference volume crosses a threshold, the inference-optimal model is much smallerthan the training-optimal one. Every modern SLM is trained far past Chinchilla's

D/N \approx 20

— Llama 3 8B is at

D/N \approx 1{,}875

, SmolLM3 3B is at

D/N \approx 3{,}700

, Phi-4-mini is at

D/N \approx 1{,}300

. These are not mistakes. They are the new deployment-optimal.