SmolLM3
SmolLM3 is a 3B dense model that uses NoPE layers every 4th block: three RoPE layers, one NoPE layer, repeat. The justification: not every attention layer needs positional bias to learn what it needs to learn.
The 3:1 RoPE-to-NoPE ratio is SmolLM3's signature choice, and it's counterintuitive enough to be worth unpacking. Most transformer layers get positional information via RoPE rotation applied to their query and key vectors. SmolLM3 skips the RoPE on every fourth layer entirely — no position signal at all in those attention blocks. The NoPE lessonshows the long-context benchmark results and the intuition for why this works: letting some layers see tokens "positionally uninformed" seems to improve generalisation to sequence lengths past the training context.
SmolLM3 is also the go-to model for studying the three-stage training curriculum. The pretraining pipeline is documented in unusual detail by Hugging Face: 11T tokens split across web-scale general data, math/code specialisation, and a reasoning-annealing final phase. Every other SLM in this list trained on 11-15T tokens too, but the published breakdown of what goes in each phase is SmolLM3-specific. Read it alongside the Scaling laws lesson to see why 11T at 3B is inference-optimal territory, not Chinchilla-optimal — SmolLM3 is explicitly over-trained to spend compute on tokens rather than parameters.
At 3B dense, SmolLM3 lives in the same bracket as Llama 3.2 3B and Gemma 3 4B. Different architectural bets, similar wall-clock behaviour — the benchmark battle is where the head-to-head numbers live.
- Size
- 3B dense
- Architecture
- Dense, GQA
- Positional encoding
- 3:1 RoPE → NoPE ratio
- Context
- 64K
- Training
- 11T tokens, 3-stage curriculum
- Act II · 8 min · 40 xpNoPE layersSometimes the best position encoding is none. Why SmolLM3 drops RoPE every 4th layer — the 3:1 NoPE ratio that boosts long-context without hurting short
- Act III · 10 min · 40 xpThe model museumExplore every major SLM — Phi-4, Llama 3.2, Qwen3, Gemma 3, SmolLM3, BitNet — with architecture diagrams, training recipes, and benchmarks
- Act IV · 9 min · 45 xpThree-stage curriculumWalk through SmolLM3's 11T-token training pipeline — web-scale breadth, math/code specialization, and reasoning annealing — with a scrubbable timeline