The model museum
Explore every major SLM — Phi-4, Llama 3.2, Qwen3, Gemma 3, SmolLM3, BitNet — with architecture diagrams, training recipes, and benchmarks
Lab 04 · Model Autopsy · 45–60 minThe bestiary
Eight current specimens. These are the models I would actually consider for an SLM specialization project in April 2026. Every entry has a full technical report, open weights, and demonstrated production use.
Pattern recognition
Look across the dossiers and you'll notice every model makes the same small number of choices: SwiGLU, GQA, RoPE, RMSNorm. Where they differ:
- Data strategy — textbook synthetic (Phi), distillation from a bigger sibling (Llama 3.2), pure quality curriculum (SmolLM3), multilingual-heavy (Qwen3)
- Post-training — SFT + DPO is the minimum; reasoning variants add RLVR
- Architecture twists — Gemma 3's local-global, SmolLM3's NoPE, BitNet's ternary
- License — Apache 2.0 preferred when possible
The family trees matter more than the individual dossiers. Phi runs Phi-1 (1.3B, 2023) → Phi-1.5 → Phi-2 (2.7B) → Phi-3-mini (3.8B) → Phi-3.5-mini → Phi-4-mini; every generation doubles down on Microsoft's “textbooks are all you need” thesis and the SFT recipe is more valuable than the weights. Llama runs Llama-1 (Feb 2023 research release, leaked on 4chan March 3) → Llama-2 (commercial; the paper applies GQA to 34B and 70B but only 70B was publicly released) → Llama-3 (128k vocab, untied embeddings) → Llama-3.1 (405B teacher) → Llama-3.2 (1B / 3B distilled from both 3.1-8B and 70B with logit KD; tied embeddings reintroduced for the small variants). Qwen runs 1 → 1.5 → 2 → 2.5 → 3, picking up YaRN long context at Qwen2 and the thinking-mode toggle at Qwen3. When a “new” SLM drops, the first question is which lineage does it extend, and what did the parent already know? — because most of what the child can do, it inherited.
In the next lesson we run head-to-head comparisons between some of these on specific benchmarks — and learn why most of those numbers lie a little.