The cost curve
Chinchilla scaling laws vs inference-optimal training — drag a slider from 1B to 500B parameters and see how model size drives compute cost and latency
Why this matters in dollars
Before any math, the number that decides every architecture choice in 2026: a million output tokens, priced across today's inference market.
Same task, ~750× price gap at the April-2026 snapshot above. If your workload — intent classification, RAG rewrite, function-call extraction, summarisation — fits inside what an 8B can do, paying Opus prices is lighting almost allof that inference bill on fire for tokens the user can't tell apart. The entire commercial case for SLMs lives in that ratio.
But how small can you go before the model stops being able to do the work at all? That's the rest of this lesson.
What's actually happening underneath
The price gap above isn't a markup. It's a mechanical consequence of three trade-offs every model designer is making at once:
- Training compute — one-time cost, paid on a cluster for weeks.
- Inference compute — paid on every single token served, for the entire life of the model.
- Memory footprint — determines what hardware can host it at all.
Training is one-time. Inference is forever. Once a model crosses a few weeks of mid-traffic serving, every parameter starts paying rent on every token it generates — and that rent is what shows up in the price ladder above. Smaller models pay less rent, faster.
Play with the dial
Slide the parameter count and the serving volume. Watch the numbers on the right. The dashed teal curve at the bottom shows what fraction of lifetime compute is inference (not training) as a function of model size. For any meaningful deployment volume, that curve rises fast.
Where 3B sits in the world
Compute decides what costs money; memory decides what fits at all. FP16 weights are 2 bytes per parameter, so a 3B model is 6 GB of weights alone. A 70B model is 140 GB — already past any single consumer GPU. 4-bit quantization (Q4) slashes this by 4× — a 70B becomes 35 GB, fitting a single A100-80 or a Mac Studio with 96 GB unified memory. That's the whole reason Act VII exists.
But memory is only the first hurdle. The reason the field keeps clustering at 3B specifically — Phi-3.5-mini, Gemma-3, Qwen3, Llama-3.2 all sitting in a tight band of model sizes — isn't consensus. It's where three hard budgets intersect. Each one has a concrete number a 3B model clears comfortably and a 13B model misses entirely:
A Q4-quantized 3B fits inside an iOS app bundle. Apple's soft cellular-download cap is 200 MB but on-device downloads routinely hit a few GB — Gemma-2-2B-Q4 at 1.6 GB ships in Google AI Edge today.
The working set a mid-tier Android phone will give you without the OOM killer terminating your process. A 7B model at ~4 GB already needs a Pixel 9 Pro or better.
A 3B model on M3-class Apple silicon (MLX on the GPU), by the FLOPs-per-token approximation; H100 single-stream is similar order of magnitude. Fast enough to stay under the 300 ms first-token budget a voice agent needs.
A 13B model misses all three of these budgets at once. That's why Phi-3.5-mini at 3.8B, Gemma-3-4B, Qwen3-4B, and Llama-3.2-3B sit in the same competitive band on published benchmarks — Phi-3.5-mini-instruct at MMLU ~69 Phi-3 tech report, Gemma-3-4B-IT at IFEval 90.2 Gemma 3 tech report, Qwen3-4B with comparable scores in its own tech-report Table 4 Qwen3 tech report, and Llama-3.2-3B state-of-the-art for its size class on summarization and instruction-following Llama 3.2 announcement. They are each other's competitive set because the hardware envelope says so. The competition isn't about benchmarks; it's about which 3B model best uses the same fixed budget every other 3B has to live inside.
- Scaling Laws for Neural Language ModelsKaplan, McCandlish, Henighan et al. · 2020 · arXiv (OpenAI)Origin of the 6ND training-compute heuristic. Forward + backward FLOPs accounting.
- Training Compute-Optimal Large Language ModelsHoffmann, Borgeaud, Mensch et al. · 2022 · NeurIPS 2022Compute-optimal D/N ≈ 20. Trained ~400 models from 70M to 16B params to fit the scaling law.
- Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling LawsSardana, Portes, Doubov, Frankle · 2024 · ICML 2024Adds inference cost to the Chinchilla objective. Quality keeps improving up to D/N ≈ 10,000.
- The Llama 3 Herd of ModelsGrattafiori et al. · 2024 · arXiv (Meta AI)Llama 3 8B / 70B / 405B trained on 15T multilingual tokens — D/N = 1,875 for the 8B.
- Phi-3 Technical ReportAbdin et al. (Microsoft) · 2024 · arXiv (Microsoft)Phi-3.5-mini (3.8B) — MMLU ~69 on the instruct variant.
- Gemma 3 Technical ReportGemma Team (Google DeepMind) · 2025 · arXiv (Google DeepMind)5:1 local:global sliding-window ratio. IFEval 90.2 on the 4B-IT.
- Qwen3 Technical ReportQwen Team (Alibaba) · 2025 · arXiv (Alibaba)Qwen3-4B-Base benchmarks across MMLU, GSM8K, IFEval — see Table 4.
- Llama 3.2: Revolutionizing edge AI and visionMeta AI · 2024 · Meta AI announcementLlama 3.2 1B/3B — Meta's edge-deployment line.
- Gemma-2-2B-it GGUF (Q4_K_M = 1.64 GB)bartowski (community quantization) · 2024 · Hugging FaceQ4_K_M Gemma-2-2B at 1.64 GB — the size that fits inside a mobile app bundle.