lesson cost-curve · 6 min · 25 xp

The cost curve

Chinchilla scaling laws vs inference-optimal training — drag a slider from 1B to 500B parameters and see how model size drives compute cost and latency

Why this matters in dollars

Before any math, the number that decides every architecture choice in 2026: a million output tokens, priced across today's inference market.

the price ladder · per million output tokens

last reviewed apr 2026

Llama-3.1-8B$0.10 / M tok

1× — the workhorse

Llama-3-70B$0.80 / M tok

8× more

GPT-4o$10 / M tok

100× more

Claude Opus 4$75 / M tok

750× more

sources · open-weight: together / fireworks / deepinfra spot pricing · closed: openai / anthropic public rates

Same task, ~750× price gap at the April-2026 snapshot above. If your workload — intent classification, RAG rewrite, function-call extraction, summarisation — fits inside what an 8B can do, paying Opus prices is lighting almost allof that inference bill on fire for tokens the user can't tell apart. The entire commercial case for SLMs lives in that ratio.

But how small can you go before the model stops being able to do the work at all? That's the rest of this lesson.

What's actually happening underneath

The price gap above isn't a markup. It's a mechanical consequence of three trade-offs every model designer is making at once:

Training compute — one-time cost, paid on a cluster for weeks.
Inference compute — paid on every single token served, for the entire life of the model.
Memory footprint — determines what hardware can host it at all.

Training is one-time. Inference is forever. Once a model crosses a few weeks of mid-traffic serving, every parameter starts paying rent on every token it generates — and that rent is what shows up in the price ladder above. Smaller models pay less rent, faster.

Play with the dial

Slide the parameter count and the serving volume. Watch the numbers on the right. The dashed teal curve at the bottom shows what fraction of lifetime compute is inference (not training) as a function of model size. For any meaningful deployment volume, that curve rises fast.

parameters (N, billions)3.0B

from Qwen3-0.6B to frontier GPT-4-class

lifetime tokens served (billions)10B

from a research demo to a global product

memory FP16

6.00 GB

memory Q4 (gguf)

1.50 GB

train H100-hours

12.5 d

serve H100-hours

16.7 h

what fraction of lifetime cost is inference, at your current serving volume?

inference share of lifetime compute

your current (N, serving volume)

Where 3B sits in the world

Compute decides what costs money; memory decides what fits at all. FP16 weights are 2 bytes per parameter, so a 3B model is 6 GB of weights alone. A 70B model is 140 GB — already past any single consumer GPU. 4-bit quantization (Q4) slashes this by 4× — a 70B becomes 35 GB, fitting a single A100-80 or a Mac Studio with 96 GB unified memory. That's the whole reason Act VII exists.

But memory is only the first hurdle. The reason the field keeps clustering at 3B specifically — Phi-3.5-mini, Gemma-3, Qwen3, Llama-3.2 all sitting in a tight band of model sizes — isn't consensus. It's where three hard budgets intersect. Each one has a concrete number a 3B model clears comfortably and a 13B model misses entirely:

Disk

~1.8 GB

A Q4-quantized 3B fits inside an iOS app bundle. Apple's soft cellular-download cap is 200 MB but on-device downloads routinely hit a few GB — Gemma-2-2B-Q4 at 1.6 GB ships in Google AI Edge today.

RAM

~3 GB

The working set a mid-tier Android phone will give you without the OOM killer terminating your process. A 7B model at ~4 GB already needs a Pixel 9 Pro or better.

Latency

~60 tok/s

A 3B model on M3-class Apple silicon (MLX on the GPU), by the $2N$ FLOPs-per-token approximation; H100 single-stream is similar order of magnitude. Fast enough to stay under the 300 ms first-token budget a voice agent needs.

A 13B model misses all three of these budgets at once. That's why Phi-3.5-mini at 3.8B, Gemma-3-4B, Qwen3-4B, and Llama-3.2-3B sit in the same competitive band on published benchmarks — Phi-3.5-mini-instruct at MMLU ~69 Phi-3 tech report, Gemma-3-4B-IT at IFEval 90.2 Gemma 3 tech report, Qwen3-4B with comparable scores in its own tech-report Table 4 Qwen3 tech report, and Llama-3.2-3B state-of-the-art for its size class on summarization and instruction-following Llama 3.2 announcement. They are each other's competitive set because the hardware envelope says so. The competition isn't about benchmarks; it's about which 3B model best uses the same fixed budget every other 3B has to live inside.

Sources · primary references · 9

Scaling Laws for Neural Language Models
Kaplan, McCandlish, Henighan et al. · 2020 · arXiv (OpenAI)
Origin of the 6ND training-compute heuristic. Forward + backward FLOPs accounting.
Training Compute-Optimal Large Language Models
Hoffmann, Borgeaud, Mensch et al. · 2022 · NeurIPS 2022
Compute-optimal D/N ≈ 20. Trained ~400 models from 70M to 16B params to fit the scaling law.
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Sardana, Portes, Doubov, Frankle · 2024 · ICML 2024
Adds inference cost to the Chinchilla objective. Quality keeps improving up to D/N ≈ 10,000.
The Llama 3 Herd of Models
Grattafiori et al. · 2024 · arXiv (Meta AI)
Llama 3 8B / 70B / 405B trained on 15T multilingual tokens — D/N = 1,875 for the 8B.
Phi-3 Technical Report
Abdin et al. (Microsoft) · 2024 · arXiv (Microsoft)
Phi-3.5-mini (3.8B) — MMLU ~69 on the instruct variant.
Gemma 3 Technical Report
Gemma Team (Google DeepMind) · 2025 · arXiv (Google DeepMind)
5:1 local:global sliding-window ratio. IFEval 90.2 on the 4B-IT.
Qwen3 Technical Report
Qwen Team (Alibaba) · 2025 · arXiv (Alibaba)
Qwen3-4B-Base benchmarks across MMLU, GSM8K, IFEval — see Table 4.
Llama 3.2: Revolutionizing edge AI and vision
Meta AI · 2024 · Meta AI announcement
Llama 3.2 1B/3B — Meta's edge-deployment line.
Gemma-2-2B-it GGUF (Q4_K_M = 1.64 GB)
bartowski (community quantization) · 2024 · Hugging Face
Q4_K_M Gemma-2-2B at 1.64 GB — the size that fits inside a mobile app bundle.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

What does Chinchilla say the compute-optimal ratio of training tokens to parameters is?