lesson battle · 8 min · 40 xp

Head-to-head battle

Pick two small language models and compare benchmark scores on MMLU, HumanEval, GSM8K, and more — interactive head-to-head with real data

Benchmarks are noisy, directions are real

Public benchmarks are rough — they capture something real but they're measured with different prompts, different eval harnesses, different random seeds. Do not read them as “Model A is 2 points better than Model B.” Read them as “Model A is in the neighbourhood of Model B's cluster.”

Worse: the same model scored by two different eval harnesses can differ by 4–8 points. Meta's Llama-3 report uses an internal harness; HuggingFace's Open LLM Leaderboard uses lm-evaluation-harness with 5-shot MMLU and a specific answer-extraction regex. The regex is load-bearing — the same weights, scored against the same 14,042 questions, can move 3 points depending on whether the extractor accepts “The answer is B” as a match for B. When you compare numbers across two blog posts, you are often comparing harnesses, not models. The only fair comparison is a single harness running both candidates on the same hardware with the same seeds.

With that caveat firmly in mind, pick two contenders and a task. The bars below are drawn from publicly-reported numbers on the standard leaderboards.

contender A

contender B

task

Phi-4-mini 3.8B▲ winner

70.3/ 100

050100

Qwen3-4B-Instruct-2507

61.9/ 100

050100

verdict

Phi-4-mini 3.8B edges ahead

8.4 point gap. For a specialization project, this is the kind of gap that closes with good fine-tuning.

The lesson of the arena

No SLM dominates every task — and the rankings shift the moment you swap benchmark variants or instruction-tuned checkpoints. On the numbers above: Qwen3-4B leads on general knowledge and code; Gemma 3 4B IT leads on math (GSM8K 89.2) and instruction-following (IFEval 90.2); SmolLM3 in thinking mode tops tool-calling (BFCL 88.8); Phi-4-mini is steady across the board without dominating any single column; Llama 3.2 3B is the smallest of the cluster and trades quality for that footprint. Pick the base that's closest to your target task — and remember a 3-point gap on these benchmarks closes faster with good fine-tuning than with model selection.

The more useful move, once you've shortlisted two or three candidates, is to stop reading leaderboards and start reading model cardsthe way you'd read a spec sheet. A good card (Llama-3.2, Gemma-3, Qwen3, Phi-4) tells you the pretraining token count, the data cutoff, the tokenizer vocab size, the context length it was trainedat versus the length it was stretched to with RoPE rescaling, the eval harness used, and the specific safety-tuning stack. A blog post tells you the headline number. A technical report (the PDF on arxiv or the GitHub repo) tells you the ablations — which choices moved the needle and which were neutral. When a release ships only a blog post and a weights drop with no technical report (looking at you, most 2024 Chinese-lab releases before the DeepSeek-V2 paper), treat the numbers as marketing until somebody else reproduces them. Reproduction is the only benchmark that matters.

Benchmarks are noisy, directions are real

With that caveat firmly in mind, pick two contenders and a task. The bars below are drawn from publicly-reported numbers on the standard leaderboards.

The lesson of the arena