“Small” is a question about deployment, not a parameter count
When you decide how big to train a model, you're not optimising one thing — you're trading off at least three:
- Training compute — one-time cost, paid on a cluster for weeks.
- Inference compute — paid on every single token served, for the entire life of the model.
- Memory footprint — determines what hardware can host it at all.
The landmark Chinchilla result (Hoffmann et al., 2022) showed that given a fixed training budget, you should scale model size and training tokens roughly equally, at a ratio near . That gives the minimum training loss per training FLOP. But it says nothing about inference.
The 2024 follow-up “Beyond Chinchilla-Optimal” (Sardana et al.) added the missing term. Once you include the cost of serving, the optimum moves toward smaller, longer-trained models, because inference cost dominates lifetime compute the moment you serve more than a handful of billion tokens. Llama 3 8B is trained at — about 94× more than Chinchilla says is optimal. That's not a mistake. It's the new frontier.
Play with the dial
Slide the parameter count and the serving volume. Watch the numbers on the right. In particular, watch the dashed teal curve at the bottom: it shows what fraction of lifetime compute is inference, not training, as a function of model size. For any meaningful deployment volume, that curve rises fast.
The math, made explicit
Three well-known approximations underlie the plot:
Combining the first two gives the Chinchilla compute. The inference cost over served tokens is. The crossover (inference equals training) happens when — in words: once you serve more tokens than 60× your parameter count, inference starts dominating. For a 3B model that's 180B tokens — just a few weeks of a mid-traffic product. After that, every one of those parameters is paying rent every token, and shrinking N pays back linearly.
Memory is a separate constraint
Compute tells you what costs money. Memory tells you what fits at all. FP16 weights are 2 bytes per parameter, so a 3B model is 6 GB of weights alone. A 70B model is 140 GB — already past any single consumer GPU. 4-bit quantization (Q4) slashes this by 4× — a 70B becomes 35 GB, fitting a single A100-80 or a Mac Studio with 96 GB unified memory. That's the whole reason Act VII exists.
Translate the FLOPs into the number a CFO actually asks for. At April-2026 spot pricing on open-weight inference providers (Together, Fireworks, DeepInfra), a Llama-3.1-8B endpoint runs roughly $0.10–0.20 per million output tokens; a 70B is $0.80–0.90; GPT-4o is $10; Claude Opus 4 is $75. That's a 500× gap from the cheapest 8B workhorse to the top of the frontier, for tokens that — on a well-scoped intent classification or a RAG rewrite — are indistinguishable to the end user. The entire commercial case for SLMs lives in that ratio: if your workload fits inside what an 8B can do, paying Opus-per-token is lighting ~99.8% of your inference bill on fire.
The “pragmatic frontier” of sub-3B isn't an aesthetic call — it's where three hard budgets intersect. Disk:a Q4-quantized 3B is ~1.8 GB, small enough to ship inside an iOS app bundle (Apple's soft limit on cellular download is 200 MB, but on-device downloads routinely hit a few GB — Gemma-2-2B-Q4 at 1.6 GB ships in Google AI Edge today). RAM: a 3B Q4 fits in the ~3 GB working set a mid-tier Android phone will give you without being killed by the OOM killer; a 7B at ~4 GB already needs a Pixel 9 Pro or better. Latency:at the 2N-FLOPs-per-token approximation, a 3B model decodes at ~60 tok/s on an M3-class NPU and ~120 tok/s on an H100 — fast enough to stay under the ~300 ms first-token budget a voice agent needs. A 13B model misses all three of those budgets at once. That's why Phi-3.5-mini (3.8B), Gemma-3-4B, Qwen3-4B, and Llama-3.2-3B cluster so tightly — they are each other's competitive set because the hardware envelope says so.