lesson three-stages · 9 min · 45 xp

Three-stage curriculum

Walk through SmolLM3's 11T-token training pipeline — web-scale breadth, math/code specialization, and reasoning annealing — with a scrubbable timeline

Pretraining is not a single phase

SmolLM3's 11.2-trillion-token pretraining run is the clearest public example of curriculum pretraining for an SLM. Three stages, each with a different data mix, designed so that the most signal-dense data is fed last — when the learning rate is low and updates land disproportionately on output-adjacent layers.

pretraining progress4.0 T tokens

Stage 1 — stable foundation

data mix at this point

85%

12%

web

code

math

reasoning

0–8T

Stable

85% web (FineWeb-Edu, DCLM), 12% code, 3% math. Broad world model.

8–10T

Shift

Lift code (15%) and math (10%); web tightens to higher-quality sources. Begin cooldown.

10–11.1T

Anneal

63% web, 24% code, 13% math (with OpenMathReasoning upsampled inside math). LR decays toward zero.

Context length is on its own curriculum

Data mix is only one axis of the curriculum. SmolLM3 also ramps context length: Stages 1–2 train at a 4k context window, then a dedicated long-context extension phase switches to 32k by raising the RoPE θ base to 1.5M, followed by a final push to 64k with θ raised again to 5M — straight RoPE-base interpolation, not NTK-aware scaling. SmolLM3 then reaches inference contexts up to 128k via YaRN, on top of a NoPE backbone (rotary position info dropped from every fourth layer). The reason for the ramp is not just hardware: training at 64k from scratch wastes attention compute on sequences that have no long-range dependencies yet worth learning. Short context early teaches local syntax; long context late teaches retrieval and multi-document reasoning on top of an already-competent base.

Why the annealing stage is load-bearing

The last 10% of pretraining is not just “more of the same.” Learning rate has decayed far enough that weight updates are small and focused. Those small updates land predominantly on output-adjacent layers — the layers closest to the logits that determine what the model actually emits.

The mechanism, concretely: SmolLM3 uses a WSD (warmup-stable-decay) schedule rather than a cosine. The LR stays flat at its peak (2e-4) through stage 1, holds through stage 2, then decays linearly to near-zero across stage 3. Because the decay is linear and short, the total parameter update integrated over stage 3 is roughly the LR area under the curve — a small fraction of the total training update budget. That small budget gets spent on whatever tokens you feed it. Hägele et al. 2024 (“Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations”) established that the WSD cooldown phase is a predictable, well-scaling alternative to cosine. Hu et al. 2024 (MiniCPM) and the SmolLM3 team then showed the practical move: swapping the cooldown data toward higher-quality math, code, and reasoning lifts downstream benchmarks several points without increasing total tokens. The annealing stage is, effectively, a free extra training run in a different distribution.

Feed those layers reasoning traces and the model learns to produce reasoning traces. Feed them instruction-formatted dialogue and the model learns to answer instructions. Feed them junk and junk gets burned into the output. This is why you can sometimes get a surprisingly instruction-following base model out of SmolLM3 — part of the instruction signal is baked into pretraining itself.

When you do continued pretraining on a pretrained base, you are effectively starting a new annealing stage. Whatever data you feed during continued pretraining lands disproportionately on output-adjacent layers — so curate ruthlessly, even more than for SFT.

this lesson appears in

SmolLM3

Pretraining is not a single phase

Context length is on its own curriculum

Why the annealing stage is load-bearing