Pretraining is not a single phase
SmolLM3's 11.2-trillion-token pretraining run is the clearest public example of curriculum pretraining for an SLM. Three stages, each with a different data mix, designed so that the most signal-dense data is fed last — when the learning rate is low and updates land disproportionately on output-adjacent layers.
Context length is on its own curriculum
Data mix is only one axis of the curriculum. SmolLM3 also ramps context length: Stages 1–2 train at a 4k context window, then a dedicated long-context extension phase switches to 32k, followed by a final push to 64k using NTK-aware RoPE scaling (the inverse-frequency base θ is rescaled so the model interpolates rather than extrapolates the rotary embeddings). The reason for the ramp is not just hardware: training a model at 64k from scratch wastes attention compute on sequences that have no long-range dependencies yet worth learning. Short context early teaches local syntax; long context late teaches retrieval and multi-document reasoning on top of an already-competent base.
Why the annealing stage is load-bearing
The last 10% of pretraining is not just “more of the same.” Learning rate has decayed far enough that weight updates are small and focused. Those small updates land predominantly on output-adjacent layers — the layers closest to the logits that determine what the model actually emits.
The mechanism, concretely: SmolLM3 uses a WSD (warmup-stable-decay) schedule rather than a cosine. The LR stays flat at its peak (~3e-4) through stage 1, holds through stage 2, then decays linearly to near-zero across stage 3. Because the decay is linear and short, the total parameter update integrated over stage 3 is roughly the LR area under the curve — a small fraction of the total training update budget. That small budget gets spent on whatever tokens you feed it. Hägele et al. 2024 (“Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations”) showed that the WSD schedule's decay phase is where most of the benchmark improvement lands — swap the decay-phase data from web to reasoning traces and you can gain several points on MMLU without any increase in total tokens. The annealing stage is, effectively, a free extra training run in a different distribution.
Feed those layers reasoning traces and the model learns to produce reasoning traces. Feed them instruction-formatted dialogue and the model learns to answer instructions. Feed them junk and junk gets burned into the output. This is why you can sometimes get a surprisingly instruction-following base model out of SmolLM3 — part of the instruction signal is baked into pretraining itself.