Twelve moments where someone has to pick a model
You're going to read twelve little stories below — real things people actually ask language models to do, day to day. For each one, you're making the call that an engineer in the room would make: does this sound like an SLM job, is this one where the big frontier model earns its keep, or is this one of those hybrid cases where the right answer is “both, but each doing a different thing”?
Production is usually uglier than a clean choice — there are budget constraints, latency SLAs, risk tolerance, compliance, the team's comfort level. The answers I've marked as correct are the most defensible starting point. If someone handed you a blank whiteboard and said “design this from scratch in April 2026,” where would you begin?
Don't worry about being wrong — each explanation tells you why the starting point is what it is, and more importantly, what signal in the story tipped the answer one way or the other. That's the pattern you're training.
What you're actually pattern-matching on
Look at the ones you got wrong. If most of your misses went bigger than the right answer— if you thought “I'll just reach for the LLM, it's safer” — that's the real lesson of this exercise. For about five years the industry default was “use the biggest thing you can afford.” The default is quietly flipping: use the smallest thing that can do the job reliably, and reach for a bigger one only when something in the task genuinely requires it. That's not a contrarian take; it's the direct consequence of the cost curve you saw earlier.
Signals that tip a task toward a small model: narrow label set, known schema, fixed vocabulary, high volume, tight latency, you own the input distribution. Signals that tip it toward a big model: open-ended input, genuine reasoning across multiple facts, creative writing of any meaningful length, or a task the system has never seen before and can't be retrained on. Hybrid shows up whenever one hop wants one and a different hop wants the other — which in practice is more often than you'd think.
There's a more precise way to think about “narrow.” Every production task has two separable difficulties. The first is distribution shift: how far the real-world inputs sit from the base model's pretraining distribution. The second is a capability ceiling: whether the task requires a cognitive operation (multi-hop reasoning, novel code synthesis, long-range coherence) that the architecture can't perform at any amount of fine-tuning. A narrow task — in the sense this lesson means it — is one with high distribution shift but a low capability ceiling. Routing a bank's 40 support queues is wildly out-of-distribution for pretraining (no web corpus contains your internal queue names), but it's cognitively trivial: the model just needs to learn a mapping from phrasings to labels. That's the shape where a fine-tuned SLM dominates, because every parameter you have is spent absorbing your distribution instead of holding onto Shakespeare and Python and Hindi grammar. The xLAM-7B-vs-GPT-4 result on the Berkeley Function Calling Leaderboard is the same phenomenon: tool-calling is massively out-of-distribution for web pretraining, but its capability ceiling is low enough that a specialised 7B saturates it.
The tasks where SLMs still lose are the ones where the capability ceiling is the binding constraint. AIME math isn't out-of-distribution — the problems look like the ones in training — but the cognitive operation (multi-step deductive search with backtracking) is one that seems to require a model big enough to host parallel reasoning traces. Phi-4-mini-reasoning at 3.8B tops out around AMC-10 level (~80% on GSM8K, ~35% on MATH-500); o3-mini and DeepSeek-R1 clear AIME. No amount of fine-tuning a 1B model on math data closes that gap, because the ceiling isn't about what the model has seen — it's about what the forward pass can compute in one sweep. Knowing which failure you're looking at — distribution shift or capability ceiling — is the whole diagnostic.
The rest of Microscale is the working-out of this instinct. Act II shows you how SLMs actually work inside. Act III is the bestiary of ones you can use today. Act IV is how they learn to be that good. Act V is where they still break. Acts VI through IX take you from “I want a specialist” to “the specialist is live, serving traffic, paying for itself.”