The first scaling-law paper (Kaplan et al. 2020) fit a power law to GPT-style model training and concluded that model size was the dominant factor — for a given compute budget, build the biggest model you can and feed it as much data as you have time for. That conclusion shaped the GPT-3 era. Two years later, Hoffmann et al. (2022) re-fit the scaling law on a broader grid of model sizes and token counts and discovered that Kaplan's models had been massively undertrained. The new optimum was roughly 20 tokens per parameter — so much more data, so much smaller a model for the same compute.
◆ paper
Training Compute-Optimal Large Language Models (Chinchilla)
The paper that dethroned the “bigger is better” era and introduced the 20-tokens-per-parameter rule. Chinchilla 70B (trained to the new law) outperformed Gopher 280B (trained Kaplan-style) on almost every benchmark.
Hoffmann et al. fit a three-term power law to training loss:
L(N,D)=E+NαA+DβB
E is the irreducible loss (how close to entropy you can get). A/Nα is how much loss you pay for using fewer parameters. B/Dβ is how much you pay for training on fewer tokens. They fit α≈0.34 and β≈0.28. For a fixed training-compute budget Ctrain=6ND, minimising loss gives an optimal pair (N∗,D∗) with D∗/N∗≈20.
Add inference and watch the optimum move
Sardana et al. (2024) added the inference term. Now we're minimising:
total cost(N,D;T)=training6ND+inference2NT
subject to a total-cost budget. The plot below shows loss as a function of model size N for a fixed total budget, for two cases: Chinchilla (zero inference volume) and the user-chosen inference volume. Drag the slider and watch the inference-aware optimum slide to the left — toward smaller, longer-trained models.
◆ paper
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
The paper that re-derived the scaling optimum with inference cost included. Showed that for expected inference volumes above ~10⁹ tokens (standard for any deployed product), the optimum is substantially smaller and longer-trained than Chinchilla. This is why Llama 3 8B is trained at D/N ≈ 1,875 and why SmolLM3 3B is at D/N ≈ 3,700 — both far past Chinchilla's 20.
10B tokens
zero inference = Chinchilla. Billions of tokens shifts optimum left.
Chinchilla optimum N
31.6B
inference-aware optimum N
31.6B
loss vs model size, fixed total budget
Chinchilla (training only)
inference-aware (Sardana)
What this means for SLMs
As soon as your expected inference volume crosses a threshold, the inference-optimal model is much smallerthan the training-optimal one. Every modern SLM is trained far past Chinchilla's D/N≈20 — Llama 3 8B is at D/N≈1,875, SmolLM3 3B is at D/N≈3,700, Phi-4-mini is at D/N≈1,300. These are not mistakes. They are the new deployment-optimal.
The broader lesson: every “law” in ML is a law for some cost function. Chinchilla optimized training compute; Sardana added inference; some day someone will add energy cost, memory cost, latency cost. Each addition moves the optimum. Know which cost function you are optimizing before you pick a model size.