Microscale
0
Act IThe Landscape
lesson when-slm-fits · 6 min · 30 xp

When an SLM fits

Drag 12 scenarios into the right buckets

Twelve moments where someone has to pick a model

You're going to read twelve little stories below — real things people actually ask language models to do, day to day. For each one, you're making the call that an engineer in the room would make: does this sound like an SLM job, is this one where the big frontier model earns its keep, or is this one of those hybrid cases where the right answer is “both, but each doing a different thing”?

Production is usually uglier than a clean choice — there are budget constraints, latency SLAs, risk tolerance, compliance, the team's comfort level. The answers I've marked as correct are the most defensible starting point. If someone handed you a blank whiteboard and said “design this from scratch in April 2026,” where would you begin?

Don't worry about being wrong — each explanation tells you why the starting point is what it is, and more importantly, what signal in the story tipped the answer one way or the other. That's the pattern you're training.

0 / 12 placed · unscored
A customer just messaged your bank's help line. You have forty possible support queues to route them into — loan defaults, card replacement, business onboarding, the lot. Who handles this?
A designer says 'build me a settings panel with a theme picker and two tabs'. You want the actual React component back, compiled, ready to drop in.
Someone asks your agent 'what's the weather in Tokyo tomorrow and should I bring an umbrella'. You have fifty tools sitting in a registry, one of which is a weather API. You need to pick the right one and fill out its arguments as valid JSON.
A junior associate drops a 50-page contract on your desk and asks for a real analysis — the risks, the unusual clauses, how it compares to the standard template.
Your docs site gets a question. You've already retrieved the five most relevant passages from the vector store. Now you need to weave them into a single clear answer.
A user is writing a short story and wants a scene drafted — 1500 words, specific tone, a twist at the end, characters they've been developing across several conversations.
Your voice agent hears 'remind me about the Friday review at 3pm'. You need the date, the time, and the intent extracted in under 200 milliseconds, before the user notices a pause.
Someone hands you a hard AIME problem — think olympiad math, multiple non-obvious steps, the kind of thing that would stump most undergraduates.
Your inbox provider processes ten million incoming emails a day. Each one needs to be flagged as spam or legit, and you're paying per call.
A clinician asks 'what's the current first-line treatment for Stage II colon cancer in patients over 70?' — they want a real answer, with citations to actual papers, and no confabulation.
A user hands your agent an ambiguous goal — 'plan my trip to Kyoto next month, I want it to be relaxed but I don't want to miss anything real'. You need to break this into a ten-step workflow and actually execute it.
Your support desk receives short chat messages in a dozen languages — Spanish, Hindi, Vietnamese, Arabic — and needs to translate them into English so the support team can answer.

What you're actually pattern-matching on

Look at the ones you got wrong. If most of your misses went bigger than the right answer— if you thought “I'll just reach for the LLM, it's safer” — that's the real lesson of this exercise. For about five years the industry default was “use the biggest thing you can afford.” The default is quietly flipping: use the smallest thing that can do the job reliably, and reach for a bigger one only when something in the task genuinely requires it. That's not a contrarian take; it's the direct consequence of the cost curve you saw earlier.

Signals that tip a task toward a small model: narrow label set, known schema, fixed vocabulary, high volume, tight latency, you own the input distribution. Signals that tip it toward a big model: open-ended input, genuine reasoning across multiple facts, creative writing of any meaningful length, or a task the system has never seen before and can't be retrained on. Hybrid shows up whenever one hop wants one and a different hop wants the other — which in practice is more often than you'd think.

There's a more precise way to think about “narrow.” Every production task has two separable difficulties. The first is distribution shift: how far the real-world inputs sit from the base model's pretraining distribution. The second is a capability ceiling: whether the task requires a cognitive operation (multi-hop reasoning, novel code synthesis, long-range coherence) that the architecture can't perform at any amount of fine-tuning. A narrow task — in the sense this lesson means it — is one with high distribution shift but a low capability ceiling. Routing a bank's 40 support queues is wildly out-of-distribution for pretraining (no web corpus contains your internal queue names), but it's cognitively trivial: the model just needs to learn a mapping from phrasings to labels. That's the shape where a fine-tuned SLM dominates, because every parameter you have is spent absorbing your distribution instead of holding onto Shakespeare and Python and Hindi grammar. The xLAM-7B-vs-GPT-4 result on the Berkeley Function Calling Leaderboard is the same phenomenon: tool-calling is massively out-of-distribution for web pretraining, but its capability ceiling is low enough that a specialised 7B saturates it.

The tasks where SLMs still lose are the ones where the capability ceiling is the binding constraint. AIME math isn't out-of-distribution — the problems look like the ones in training — but the cognitive operation (multi-step deductive search with backtracking) is one that seems to require a model big enough to host parallel reasoning traces. Phi-4-mini-reasoning at 3.8B tops out around AMC-10 level (~80% on GSM8K, ~35% on MATH-500); o3-mini and DeepSeek-R1 clear AIME. No amount of fine-tuning a 1B model on math data closes that gap, because the ceiling isn't about what the model has seen — it's about what the forward pass can compute in one sweep. Knowing which failure you're looking at — distribution shift or capability ceiling — is the whole diagnostic.

The rest of Microscale is the working-out of this instinct. Act II shows you how SLMs actually work inside. Act III is the bestiary of ones you can use today. Act IV is how they learn to be that good. Act V is where they still break. Acts VI through IX take you from “I want a specialist” to “the specialist is live, serving traffic, paying for itself.”