Microscale
0
model

SmolLM3

by Hugging Face · Released July 2025 · LAST REVIEWED APR 2026

SmolLM3 is a 3B dense model that uses NoPE layers every 4th block: three RoPE layers, one NoPE layer, repeat. The justification: not every attention layer needs positional bias to learn what it needs to learn.

what's new in this one

The 3:1 RoPE-to-NoPE ratio is SmolLM3's signature choice, and it's counterintuitive enough to be worth unpacking. Most transformer layers get positional information via RoPE rotation applied to their query and key vectors. SmolLM3 skips the RoPE on every fourth layer entirely — no position signal at all in those attention blocks. The NoPE lessonshows the long-context benchmark results and the intuition for why this works: letting some layers see tokens "positionally uninformed" seems to improve generalisation to sequence lengths past the training context.

SmolLM3 is also the go-to model for studying the three-stage training curriculum. The pretraining pipeline is documented in unusual detail by Hugging Face: 11T tokens split across web-scale general data, math/code specialisation, and a reasoning-annealing final phase. Every other SLM in this list trained on 11-15T tokens too, but the published breakdown of what goes in each phase is SmolLM3-specific. Read it alongside the Scaling laws lesson to see why 11T at 3B is inference-optimal territory, not Chinchilla-optimal — SmolLM3 is explicitly over-trained to spend compute on tokens rather than parameters.

At 3B dense, SmolLM3 lives in the same bracket as Llama 3.2 3B and Gemma 3 4B. Different architectural bets, similar wall-clock behaviour — the benchmark battle is where the head-to-head numbers live.

the shape in numbers
Size
3B dense
Architecture
Dense, GQA
Positional encoding
3:1 RoPE → NoPE ratio
Context
64K
Training
11T tokens, 3-stage curriculum
read alongside