Three-stage curriculum
Walk through SmolLM3's 11T-token training pipeline — web-scale breadth, math/code specialization, and reasoning annealing — with a scrubbable timeline
Pretraining is not a single phase
SmolLM3's 11.2-trillion-token pretraining run is the clearest public example of curriculum pretraining for an SLM. Three stages, each with a different data mix, designed so that the most signal-dense data is fed last — when the learning rate is low and updates land disproportionately on output-adjacent layers.
Context length is on its own curriculum
Data mix is only one axis of the curriculum. SmolLM3 also ramps context length: Stages 1–2 train at a 4k context window, then a dedicated long-context extension phase switches to 32k by raising the RoPE θ base to 1.5M, followed by a final push to 64k with θ raised again to 5M — straight RoPE-base interpolation, not NTK-aware scaling. SmolLM3 then reaches inference contexts up to 128k via YaRN, on top of a NoPE backbone (rotary position info dropped from every fourth layer). The reason for the ramp is not just hardware: training at 64k from scratch wastes attention compute on sequences that have no long-range dependencies yet worth learning. Short context early teaches local syntax; long context late teaches retrieval and multi-document reasoning on top of an already-competent base.
Why the annealing stage is load-bearing
The last 10% of pretraining is not just “more of the same.” Learning rate has decayed far enough that weight updates are small and focused. Those small updates land predominantly on output-adjacent layers — the layers closest to the logits that determine what the model actually emits.
The mechanism, concretely: SmolLM3 uses a WSD (warmup-stable-decay) schedule rather than a cosine. The LR stays flat at its peak (2e-4) through stage 1, holds through stage 2, then decays linearly to near-zero across stage 3. Because the decay is linear and short, the total parameter update integrated over stage 3 is roughly the LR area under the curve — a small fraction of the total training update budget. That small budget gets spent on whatever tokens you feed it. Hägele et al. 2024 (“Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations”) established that the WSD cooldown phase is a predictable, well-scaling alternative to cosine. Hu et al. 2024 (MiniCPM) and the SmolLM3 team then showed the practical move: swapping the cooldown data toward higher-quality math, code, and reasoning lifts downstream benchmarks several points without increasing total tokens. The annealing stage is, effectively, a free extra training run in a different distribution.
Feed those layers reasoning traces and the model learns to produce reasoning traces. Feed them instruction-formatted dialogue and the model learns to answer instructions. Feed them junk and junk gets burned into the output. This is why you can sometimes get a surprisingly instruction-following base model out of SmolLM3 — part of the instruction signal is baked into pretraining itself.