Microscale
0
Act IIIThe Current Champions
lesson battle · 8 min · 40 xp

Head-to-head battle

Simulated showdowns on real benchmarks

Benchmarks are noisy, directions are real

Public benchmarks are rough — they capture something real but they're measured with different prompts, different eval harnesses, different random seeds. Do not read them as “Model A is 2 points better than Model B.” Read them as “Model A is in the neighbourhood of Model B's cluster.”

Worse: the same model scored by two different eval harnesses can differ by 4–8 points. Meta's Llama-3 report uses an internal harness; HuggingFace's Open LLM Leaderboard uses lm-evaluation-harness with 5-shot MMLU and a specific answer-extraction regex. The regex is load-bearing — the same weights, scored against the same 14,042 questions, can move 3 points depending on whether the extractor accepts “The answer is B” as a match for B. When you compare numbers across two blog posts, you are often comparing harnesses, not models. The only fair comparison is a single harness running both candidates on the same hardware with the same seeds.

With that caveat firmly in mind, pick two contenders and a task. The bars below are drawn from publicly-reported numbers on the standard leaderboards.

contender A
contender B
task
Phi-4-mini 3.8B
70.0/ 100
050100
Qwen3-4B▲ winner
73.0/ 100
050100
verdict
Qwen3-4B edges ahead
3.0 point gap. For a specialization project, this is the kind of gap that closes with good fine-tuning.

The lesson of the arena

No SLM dominates every task. Phi-4-mini crushes reasoning and tool-calling; Qwen3 is strongest on multilingual and general knowledge; Llama 3.2 has the best instruction following; Gemma 3 excels at long-context; SmolLM3 is the most reproducible. Pick the base that's closest to your target task.Fine-tuning closes small gaps but can't teach a model what its pretraining never showed it.

The more useful move, once you've shortlisted two or three candidates, is to stop reading leaderboards and start reading model cardsthe way you'd read a spec sheet. A good card (Llama-3.2, Gemma-3, Qwen3, Phi-4) tells you the pretraining token count, the data cutoff, the tokenizer vocab size, the context length it was trainedat versus the length it was stretched to with RoPE rescaling, the eval harness used, and the specific safety-tuning stack. A blog post tells you the headline number. A technical report (the PDF on arxiv or the GitHub repo) tells you the ablations — which choices moved the needle and which were neutral. When a release ships only a blog post and a weights drop with no technical report (looking at you, most 2024 Chinese-lab releases before the DeepSeek-V2 paper), treat the numbers as marketing until somebody else reproduces them. Reproduction is the only benchmark that matters.