Microscale
0
Act IThe Landscape
lesson five-drivers · 7 min · 25 xp

The five drivers

Inference · edge · latency · specialization · velocity

Why did the frontier come back for the small?

From 2020–2023 the rule was obvious: bigger is better. Every step up the scaling ladder — GPT-2 → GPT-3 → GPT-4 — revealed capabilities the previous step didn't have. So why, in 2024–2026, did the frontier labs all release sub-5B-parameter variants (Phi-4-mini, Llama 3.2, Qwen3, Gemma 3) as first-class citizens of their model families?

Five reasons, none of them “small is cute.” Flip through them on the right — each is a distinct force, and understanding them individually is how you pick the right model for a given job in the rest of the curriculum.

step through the drivers
Inference economics
Training is one-time; inference is forever.
A 70B model serving a billion tokens a day dwarfs its own training cost in inference within weeks. Once the deployment curve dominates, the optimisation changes: you want fewer active parameters, not more. Chinchilla says D/N ≈ 20 for compute-optimal training; inference-optimal is D/N in the thousands. Every modern SLM is trained far past Chinchilla because the inference ledger makes it cheaper over the life of the model.
key metricCinferCtrainC_{\text{infer}} \gg C_{\text{train}} once you cross T60NT \approx 60N

None of these are small-is-better in general

Pay attention to what I'm notclaiming. I'm not saying SLMs outperform LLMs at open-domain reasoning. They don't. I'm not saying you should replace your production GPT-4 with a 3B model and hope for the best. That will disappoint you.

What I'm saying is that each of these five drivers is a real, domain-specific force that can make a smaller model the right answer. In practice most production systems end up with an SLM somewhere in the pipeline (intent classification, tool-call execution, summarization of retrieved passages) even if a larger model handles the free-form generation hop. The next lesson makes that concrete.

A concrete example of the five-driver stack in one system: an April-2026 voice agent at a mid-market SaaS company typically looks like Deepgram (ASR) → Qwen3-1.7B intent classifier → Phi-4-mini tool-caller → GPT-4o-mini fallback for open-ended answers → ElevenLabs (TTS). Four models in the hot path, and only one of them is over 4B parameters. The 1.7B classifier runs in <40 ms on a co-located A10G and handles ~85% of turns without ever waking the fallback; the fallback fires on the ~15% where the user said something the classifier scored below its confidence threshold. Every driver you just read is in there: inference economics (the classifier is ~200× cheaper per token than the fallback), edge-ish deployment (the SLMs colocate with the audio pipeline, eliminating a network hop), latency (first-token under 300 ms because prefill is tiny), specialization (the tool-caller was fine-tuned on the company's own function registry), and velocity (they re-ran the classifier fine-tune in three hours last Tuesday). The frontier model isn't banished — it's rationed.

comprehension check
comprehension · 1 / 3

Which statement best captures the inference-economics driver?