lesson when-slm-fits · 6 min · 30 xp

When an SLM fits

Drag 12 real-world scenarios into SLM-wins, LLM-needed, or hybrid-pipeline buckets — an interactive exercise for model size decisions

Twelve moments where someone has to pick a model

You're going to read twelve little stories below — real things people actually ask language models to do, day to day. For each one, you're making the call that an engineer in the room would make: does this sound like an SLM job, is this one where the big frontier model earns its keep, or is this one of those hybrid cases where the right answer is “both, but each doing a different thing”?

Production is usually uglier than a clean choice — there are budget constraints, latency SLAs, risk tolerance, compliance, the team's comfort level. The answers I've marked as correct are the most defensible starting point. If someone handed you a blank whiteboard and said “design this from scratch in April 2026,” where would you begin?

Don't worry about being wrong — each explanation tells you why the starting point is what it is, and more importantly, what signal in the story tipped the answer one way or the other. That's the pattern you're training.

0 / 12 placed · unscored

A customer just messaged your bank's help line. You have forty possible support queues to route them into — loan defaults, card replacement, business onboarding, the lot. Who handles this?

A designer says 'build me a settings panel with a theme picker and two tabs'. You want the actual React component back, compiled, ready to drop in.

Someone asks your agent 'what's the weather in Tokyo tomorrow and should I bring an umbrella'. You have fifty tools sitting in a registry, one of which is a weather API. You need to pick the right one and fill out its arguments as valid JSON.

A junior associate drops a 50-page contract on your desk and asks for a real analysis — the risks, the unusual clauses, how it compares to the standard template.

Your docs site gets a question. You've already retrieved the five most relevant passages from the vector store. Now you need to weave them into a single clear answer.

A user is writing a short story and wants a scene drafted — 1500 words, specific tone, a twist at the end, characters they've been developing across several conversations.

Your voice agent hears 'remind me about the Friday review at 3pm'. You need the date, the time, and the intent extracted in under 200 milliseconds, before the user notices a pause.

Someone hands you a hard AIME problem — think olympiad math, multiple non-obvious steps, the kind of thing that would stump most undergraduates.

Your inbox provider processes ten million incoming emails a day. Each one needs to be flagged as spam or legit, and you're paying per call.

A clinician asks 'what's the current first-line treatment for Stage II colon cancer in patients over 70?' — they want a real answer, with citations to actual papers, and no confabulation.

A user hands your agent an ambiguous goal — 'plan my trip to Kyoto next month, I want it to be relaxed but I don't want to miss anything real'. You need to break this into a ten-step workflow and actually execute it.

Your support desk receives short chat messages in a dozen languages — Spanish, Hindi, Vietnamese, Arabic — and needs to translate them into English so the support team can answer.

0 / 12 placed· place all 12 to grade

What you're actually pattern-matching on

Look at the ones you got wrong. If most of your misses went bigger than the right answer— if you thought “I'll just reach for the LLM, it's safer” — that's the real lesson of this exercise. For about five years the industry default was “use the biggest thing you can afford.” The default is quietly flipping: use the smallest thing that can do the job reliably, and reach for a bigger one only when something in the task genuinely requires it. That's not a contrarian take; it's the direct consequence of the cost curve you saw earlier.

Signals that tip a task toward a small model: narrow label set, known schema, fixed vocabulary, high volume, tight latency, you own the input distribution. Signals that tip it toward a big model: open-ended input, genuine reasoning across multiple facts, creative writing of any meaningful length, or a task the system has never seen before and can't be retrained on. Hybrid shows up whenever one hop wants one and a different hop wants the other — which in practice is more often than you'd think.

There's a more precise way to think about “narrow.” Every production task has two separable difficulties. The first is distribution shift: how far the real-world inputs sit from the base model's pretraining distribution. The second is a capability ceiling: whether the task requires a cognitive operation (multi-hop reasoning, novel code synthesis, long-range coherence) that the architecture can't perform at any amount of fine-tuning. A narrow task — in the sense this lesson means it — is one with high distribution shift but a low capability ceiling. Routing a bank's 40 support queues is wildly out-of-distribution for pretraining (no web corpus contains your internal queue names), but it's cognitively trivial: the model just needs to learn a mapping from phrasings to labels. That's the shape where a fine-tuned SLM dominates, because every parameter you have is spent absorbing your distribution instead of holding onto Shakespeare and Python and Hindi grammar. The xLAM-7B-vs-GPT-4 result on the Berkeley Function Calling Leaderboard xLAM 2024 BFCL leaderboard is the same phenomenon: tool-calling is massively out-of-distribution for web pretraining, but its capability ceiling is low enough that a specialised 7B saturates it.

The tasks where SLMs still lose are the ones where the capability ceiling is the binding constraint. AIME math isn't out-of-distribution — the problems look like the ones in training — but the cognitive operation (multi-step deductive search with backtracking) is one whose limit appears to lift with scale and reasoning-RL on top of scale Dziri 2023 (Faith and Fate) DeepSeek-R1 2025. Phi-4-mini-reasoning at 3.8B is genuinely strong — 94.6% on MATH-500 and 57.5% on AIME 2024, comparable to o1-mini Phi-4-reasoning 2025. But the ceiling above that — full AIME mastery, frontier MATH problems, IMO-level work — is where DeepSeek-R1 and o3-mini still pull ahead. No amount of fine-tuning a 1B model on math data closes that gap, because the ceiling isn't about what the model has seen— it's about what the forward pass can compute in one sweep. Knowing which failure you're looking at — distribution shift or capability ceiling — is the whole diagnostic.

The rest of Microscale is the working-out of this instinct. Act II shows you how SLMs actually work inside. Act III is the bestiary of ones you can use today. Act IV is how they learn to be that good. Act V is where they still break. Acts VI through IX take you from “I want a specialist” to “the specialist is live, serving traffic, paying for itself.”

Sources · primary references · 7

Faith and Fate: Limits of Transformers on Compositionality
Dziri et al. · 2023 · NeurIPS 2023 Spotlight
Transformer performance on compositional tasks decays rapidly with complexity. The capability-ceiling argument.
xLAM: A Family of Large Action Models to Empower AI Agent Systems
Liu et al. (Salesforce AI Research) · 2024 · arXiv 2024 (NAACL 2025 industry track)
xLAM-7B-fc-r scored 88.24% on BFCL v1, ranked #3. Trained on the xlam-function-calling-60k dataset.
Berkeley Function Calling Leaderboard
Gorilla LLM Team (UC Berkeley) · 2024 · BFCL (live leaderboard)
The function-calling benchmark xLAM, Phi-4-mini, and Qwen3-4B compete on.
Scaling neural machine translation to 200 languages
NLLB Team (Meta AI) · 2024 · Nature 2024
Dedicated translation models outperform general LLMs on FLORES-200 low-resource pairs.
Phi-4-reasoning Technical Report
Abdin et al. (Microsoft Research) · 2025 · arXiv (Microsoft Research)
Phi-4-mini-reasoning at 3.8B: AIME 57.5, MATH-500 94.6, GPQA Diamond 52.0 — distilled from DeepSeek-R1.
Phi-4-mini-reasoning (model card)
Microsoft · 2025 · Hugging Face
Source of the published 94.6% MATH-500 / 57.5% AIME numbers used in the lesson.
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL
DeepSeek-AI · 2025 · arXiv / Nature 2025
RL-driven CoT lifts AIME from 15.6 → 71.0% pass@1. The teacher in Phi-4-mini-reasoning's distillation pipeline.

Twelve moments where someone has to pick a model

What you're actually pattern-matching on