lesson emergence · 10 min · 50 xp

The emergence cliff

Scrub the parameter slider and watch capabilities appear — then learn why 92% of BIG-Bench emergence dissolves under log-prob metrics

Some abilities appear to turn on all at once

Wei et al. 2022 coined “emergent abilities” for capabilities that are absent below a certain model scale and present above it — with a sharp, almost discontinuous-looking transition. Arithmetic, multi-step reasoning, chain-of-thought-following, format compliance: all of these appear to pop into existence around specific parameter counts.

Schaeffer, Miranda, and Koyejo (2023, “Are Emergent Abilities of Large Language Models a Mirage?”) pushed back: emergence is partly a metric effect. If you measure exact-match accuracy (a 0/1 judgment), tiny improvements compound into a sudden apparent jump. If you measure log-probability of the correct answer — or even token-level edit distance — the same capability grows smoothly. Their argument has two parts. First, the diagnostic: Schaeffer et al. found that more than 92% of the BIG-Bench tasks Wei flagged as emergent had been scored using just two metrics — Multiple Choice Grade (discontinuous) and Exact String Match (nonlinear). Both gate continuous capability behind a 0/1 decision. Second, the demonstration: when they re-scored representative tasks with linear, continuous metrics like token edit distance, the cliffs flattened into smooth power laws. Emergence was an artifact of how we measured, not how models scaled.

The underlying mechanism is simple once you see it. Exact match only flips to 1 when the correct token's log-probability exceeds the max log-probability of every distractor. A model can be steadily climbing from p=0.001 to p=0.49 on the right answer and score zero the entire way; the moment it crosses p=0.51 (assuming a single dominant distractor), the metric snaps to 1. The smoothness is in the capability; the discontinuity is in the yardstick. Both stories are partly right — a handful of abilities (modular arithmetic, multi-step symbolic manipulation) do look threshold-like even under log-prob, but most of the famous “emergent” BIG-Bench tasks were metric artifacts.

log₁₀(N) — parameters10^9.0 = 1.0B

3-digit addition

14%

exact

log-prob: 21%

chain-of-thought

exact

log-prob: 0%

reliable tool calling

exact

log-prob: 0%

exact-match metric — looks like a cliff

log-probability metric — smooth growth

3-digit addition

chain-of-thought

reliable tool calling

Why the threshold moves over time

The 2022 CoT threshold was ~62B params. Today's 3B Phi-4-mini reasons better than that 62B did. The threshold didn't disappear — it moved. Better data (Phi's textbook synthetic), distillation from larger teachers (Llama 3.2), and curriculum pretraining (SmolLM3) all push the threshold down. Emergence is a function of training quality, not just parameter count.

But the shape of the phenomenon survives. Today's 3B can do what 2022's 60B could; today's 3B cannot do what today's 70B can. There is always a next tier of capability that only the next tier of scale reliably unlocks.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 1

Why does the same capability look 'emergent' on exact-match metrics but 'smooth' on log-probability?

Some abilities appear to turn on all at once

Why the threshold moves over time