lesson museum · 10 min · 40 xp

The model museum

Explore every major SLM — Phi-4, Llama 3.2, Qwen3, Gemma 3, SmolLM3, BitNet — with architecture diagrams, training recipes, and benchmarks

Lab 04 · Model Autopsy · 45–60 min

The bestiary

Eight current specimens. These are the models I would actually consider for an SLM specialization project in April 2026. Every entry has a full technical report, open weights, and demonstrated production use.

specimen dossier

Phi-4-miniMicrosoft · 3.8B · MIT

Headline: 200k vocab + tied embeddings + Phi-4's synthetic recipe.

Training: ~5T tokens incorporating Phi-4 (14B)'s synthetic data (the 50-types / 400B-tokens recipe is from the Phi-4 large report). Post-SFT + DPO; reasoning variant adds RL with verifiable rewards.

Best for: Structured output, reasoning, tool calling.

Pattern recognition

Look across the dossiers and you'll notice every model makes the same small number of choices: SwiGLU, GQA, RoPE, RMSNorm. Where they differ:

Data strategy — textbook synthetic (Phi), distillation from a bigger sibling (Llama 3.2), pure quality curriculum (SmolLM3), multilingual-heavy (Qwen3)
Post-training — SFT + DPO is the minimum; reasoning variants add RLVR
Architecture twists — Gemma 3's local-global, SmolLM3's NoPE, BitNet's ternary
License — Apache 2.0 preferred when possible

The family trees matter more than the individual dossiers. Phi runs Phi-1 (1.3B, 2023) → Phi-1.5 → Phi-2 (2.7B) → Phi-3-mini (3.8B) → Phi-3.5-mini → Phi-4-mini; every generation doubles down on Microsoft's “textbooks are all you need” thesis and the SFT recipe is more valuable than the weights. Llama runs Llama-1 (Feb 2023 research release, leaked on 4chan March 3) → Llama-2 (commercial; the paper applies GQA to 34B and 70B but only 70B was publicly released) → Llama-3 (128k vocab, untied embeddings) → Llama-3.1 (405B teacher) → Llama-3.2 (1B / 3B distilled from both 3.1-8B and 70B with logit KD; tied embeddings reintroduced for the small variants). Qwen runs 1 → 1.5 → 2 → 2.5 → 3, picking up YaRN long context at Qwen2 and the thinking-mode toggle at Qwen3. When a “new” SLM drops, the first question is which lineage does it extend, and what did the parent already know? — because most of what the child can do, it inherited.

MMXXVI

historical note

Feb 2023 → Apr 2026 · three years of open SLMs

Feb 2023: Meta releases LLaMA-1 to approved researchers; the weights leak to 4chan on March 3, kicking off the open era. Sep 2023: Mistral-7B beats Llama-2-13B and proves European labs can ship. Apr 2024: Phi-3-mini lands and is the first <4Bmodel anyone takes seriously. Jul 2024: Llama-3.1 405B ships as a teacher-for-distillation. Sep 2024: Llama-3.2-1B/3B are that distillation. 2025: thinking-mode toggles (Qwen3, Gemma 3), ternary weights (BitNet), and hybrid-attention (Gemma 3's 5:1) become standard ideas. The field went from “can a 7B be useful?” to “which 1.5B reasoning distillate should I fine-tune?” in thirty-six months.

In the next lesson we run head-to-head comparisons between some of these on specific benchmarks — and learn why most of those numbers lie a little.

this lesson appears in

Pattern recognition

Look across the dossiers and you'll notice every model makes the same small number of choices: SwiGLU, GQA, RoPE, RMSNorm. Where they differ:

Data strategy — textbook synthetic (Phi), distillation from a bigger sibling (Llama 3.2), pure quality curriculum (SmolLM3), multilingual-heavy (Qwen3)

Post-training — SFT + DPO is the minimum; reasoning variants add RLVR

Architecture twists — Gemma 3's local-global, SmolLM3's NoPE, BitNet's ternary

License — Apache 2.0 preferred when possible

MMXXVI

historical note

Feb 2023 → Apr 2026 · three years of open SLMs

In the next lesson we run head-to-head comparisons between some of these on specific benchmarks — and learn why most of those numbers lie a little.