Microscale
0
Act IVHow They Learn
lesson three-stages · 9 min · 45 xp

Three-stage curriculum

SmolLM3's scrubbable timeline

Pretraining is not a single phase

SmolLM3's 11.2-trillion-token pretraining run is the clearest public example of curriculum pretraining for an SLM. Three stages, each with a different data mix, designed so that the most signal-dense data is fed last — when the learning rate is low and updates land disproportionately on output-adjacent layers.

4.0 T tokens
Stage 1 — stable foundation
data mix at this point
85%
12%
web
code
math
reasoning
0–8T
Stable
85% web (FineWeb-Edu, DCLM), 12% code, 3% math. Broad world model.
8–10T
Shift
Drop low-quality web, boost code and math. Begin LR decay.
10–11.2T
Anneal
Inject reasoning traces, clean math, textbook-style. LR near zero.

Context length is on its own curriculum

Data mix is only one axis of the curriculum. SmolLM3 also ramps context length: Stages 1–2 train at a 4k context window, then a dedicated long-context extension phase switches to 32k, followed by a final push to 64k using NTK-aware RoPE scaling (the inverse-frequency base θ is rescaled so the model interpolates rather than extrapolates the rotary embeddings). The reason for the ramp is not just hardware: training a model at 64k from scratch wastes attention compute on sequences that have no long-range dependencies yet worth learning. Short context early teaches local syntax; long context late teaches retrieval and multi-document reasoning on top of an already-competent base.

Why the annealing stage is load-bearing

The last 10% of pretraining is not just “more of the same.” Learning rate has decayed far enough that weight updates are small and focused. Those small updates land predominantly on output-adjacent layers — the layers closest to the logits that determine what the model actually emits.

The mechanism, concretely: SmolLM3 uses a WSD (warmup-stable-decay) schedule rather than a cosine. The LR stays flat at its peak (~3e-4) through stage 1, holds through stage 2, then decays linearly to near-zero across stage 3. Because the decay is linear and short, the total parameter update integrated over stage 3 is roughly the LR area under the curve — a small fraction of the total training update budget. That small budget gets spent on whatever tokens you feed it. Hägele et al. 2024 (“Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations”) showed that the WSD schedule's decay phase is where most of the benchmark improvement lands — swap the decay-phase data from web to reasoning traces and you can gain several points on MMLU without any increase in total tokens. The annealing stage is, effectively, a free extra training run in a different distribution.

Feed those layers reasoning traces and the model learns to produce reasoning traces. Feed them instruction-formatted dialogue and the model learns to answer instructions. Feed them junk and junk gets burned into the output. This is why you can sometimes get a surprisingly instruction-following base model out of SmolLM3 — part of the instruction signal is baked into pretraining itself.

When you do continued pretraining on a pretrained base, you are effectively starting a new annealing stage. Whatever data you feed during continued pretraining lands disproportionately on output-adjacent layers — so curate ruthlessly, even more than for SFT.