model

SmolLM3

by Hugging Face · Released July 2025 · LAST REVIEWED APR 2026

SmolLM3 is a 3B dense model that uses NoPE layers every 4th block: three RoPE layers, one NoPE layer, repeat. The justification: not every attention layer needs positional bias to learn what it needs to learn.

why this one matters

SmolLM3 is the answer to "what does a 3B model that doesn't punt the hard problems look like?" Two choices set it apart from Llama 3.2 3B and Gemma 3 4B: every fourth layer drops positional encoding entirely (NoPE), and the training pipeline is published in unusual detail. Both choices touch lessons we already taught in Act 2.

the architecture, walked

SmolLM3 · 36 layers · 3B params

Token embed
↓
Layer × 3 · RoPE attention (GQA 16q → 4kv)
Layer × 1 · NoPE attention (positional encoding dropped)
Layers 4, 8, 12, 16, 20, 24, 28, 32, 36 apply attention without rotary positional encoding — every fourth layer in the 36-layer stack. The 3:1 RoPE/NoPE ratio is the design choice isolated by the NoPE lesson.
…repeat 9× (36 layers total)
↓
MLP · SwiGLU (2048 → 11008)
↓
RMSNorm + tied unembed

Boring blocks (residual streams, normalisation) covered in Act 2.

The blocks worth examining are positional handling, attention, the MLP, and the tokenizer. Boring blocks (residual, RMSNorm, embedding lookup) are covered cold in build-a-block — the diagram links them out rather than re-explaining them here.

Positional handling.27 of SmolLM3's 36 layers apply RoPE rotation to query and key vectors. The remaining 9 — every fourth layer, counting from layer 4 — skip the rotation entirely. The NoPE lesson walks the intuition: layers seeing positionally-uninformed attention seem to generalise better past the trained context length.

Attention — GQA 16q/4kv. 16 query heads per layer collapse into 4 KV groups, a 4:1 compression ratio that matches Llama 3.2 3B. The From MHA to GQA lesson shows why this ratio sits at the sweet spot: KV cache footprint drops 4× with negligible quality cost. SmolLM3 inherits the recipe rather than experimenting.

MLP — SwiGLU, 2048 → 11008. The MLP uses SwiGLUwith hidden dimension 2048 and intermediate 11008. That's a 5.37× expansion — noticeably fatter than the canonical SwiGLU (8/3)·d ≈ 2.67× expansion used by Llama 3 at the same per-parameter budget.

Tokenizer.SmolLM3 reuses the Llama 3.2 tokenizer wholesale (128K vocab) — the BOS token aside, it's the same tokenizer. The tokenization lesson covers the BPE recipe; SmolLM3's choice here is deliberate reuse, not innovation.

the choices, examined

Why dense, not MoE?

At 3B parameters, the routing overhead and training instability of mixture-of-experts architectures eat the parameter-efficiency gains. The MoE crossover with dense is well past 3B — DeepSeek-V3 starts at 671B total parameters, Qwen3 MoE starts at 30B total.

Why NoPE on every fourth layer?

Long-context generalisation. SmolLM3 cites Yang et al. 2025 ("Rope to Nope and Back Again"), which shows a hybrid RoPE/NoPE pattern outperforms RoPE-on-every-layer on long-context tasks. The NoPE lessonwalks the intuition: positionally-uninformed layers seem to extend the competence ceiling past the trained context length. SmolLM3's 3:1 ratio (RoPE on three, NoPE on one, repeat) is the production-scale design point.

Why over-train at 11.2T tokens?

The model sees 11.2T tokens against 3B parameters — a D/N ratio of ~3,733, far past Chinchilla-optimal's ~20. The Scaling laws lesson covers the inference-cost-aware reframing: Sardana 2024 shows quality keeps improving up to D/N ≈ 10,000 once inference cost is added to the optimisation objective.

Why a three-stage curriculum?

Capability annealing. The three-stages lesson uses SmolLM3 as the canonical example because HF publishes the data mix per stage in unusual detail: stage 1 is web-scale general data (85/12/3 web/code/math at 0→8T tokens), stage 2 lifts the math and code share (75/15/10 at 8→10T), stage 3 anneals on reasoning data (63/24/13 at 10→11.1T). Each stage locks in capabilities the next one builds on.

the training story

The 11.2T tokens split across three stages, each with a deliberate data-mix shift. Stage one is web-scale general data — broad fluency, multilingual coverage across six European languages, the long tail of human writing. Stage two raises the math and code proportion, pushing the model toward formal-reasoning capability without losing the general fluency stage one locked in. Stage three is the reasoning anneal: upsampled math and code, plus instruction and reasoning datasets like OpenMathReasoning.

Three-stage training data composition for SmolLM3. Stage 1 (0-8T tokens): 85% web, 12% code, 3% math. Stage 2 (8-10T tokens): 75% web, 15% code, 10% math. Stage 3 (10-11.1T tokens): 63% web, 24% code, 13% math. — The three-stage data composition. Math and code climb from 3% in stage 1 to 13% in stage 3, taking share back from web data as the model anneals toward reasoning capability.
From Hugging Face SmolLM3 blog — three-stage pretraining data composition.
Reproduced under fair use for educational commentary on SmolLM3's training pipeline.

The context-extension recipe is the other published-in- detail piece. SmolLM3 pretrains at 4K sequence length, then mid-trains an extension to 32K with rope_theta raised to 1.5M, then to 64K with rope_theta raised to 5M. At inference, YaRN extrapolates the 64K-trained model to 128K. The long-context lesson walks the YaRN math.

SmolLM3's long-context extension pipeline: 4K base context, mid-training extension to 32K at rope_theta 1.5M, then 64K at rope_theta 5M, then YaRN extrapolation to 128K at inference. — The full context-extension pipeline: 4K → 32K → 64K during training (with the rope_theta progression), then YaRN doubles it to 128K at inference time.
From Hugging Face SmolLM3 blog — long-context extension pipeline.
Reproduced under fair use for educational commentary on SmolLM3's context-extension recipe.

The whole pipeline is the canonical worked example for the three-stages lesson. Other 2025-era SLMs trained on similar token budgets but published far less detail about the per-stage breakdown — SmolLM3's transparency is what makes it the teaching target, not the architecture per se.

the shape in numbers

Size: 3B dense
Architecture: Dense, GQA
Positional encoding: 3:1 RoPE → NoPE ratio
Context: 64K
Training: 11T tokens, 3-stage curriculum

Sources · primary references · 7

Transformer Language Models without Positional Encodings Still Learn Positional Information
Haviv, Ram, Goldberg, Chen, Levy · 2022 · EMNLP Findings 2022
NoPE debut at 125M–1.3B. Models without explicit position encodings still learn position from causal masking.
Training Compute-Optimal Large Language Models
Hoffmann, Borgeaud, Mensch et al. · 2022 · NeurIPS 2022
Compute-optimal D/N ≈ 20. Trained ~400 models from 70M to 16B params to fit the scaling law.
Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling Laws
Sardana, Portes, Doubov, Frankle · 2024 · ICML 2024
Adds inference cost to the Chinchilla objective. Quality keeps improving up to D/N ≈ 10,000.
The Llama 3 Herd of Models
Grattafiori et al. · 2024 · arXiv (Meta AI)
Llama 3 8B / 70B / 405B trained on 15T multilingual tokens — D/N = 1,875 for the 8B.
Rope to Nope and Back Again: A New Hybrid Attention Strategy
Yang, Venkitesh, Talupuru, Lin, Cairuz · 2025 · arXiv (preprint)
Hybrid RoPE/NoPE outperforms RoPE-on-every-layer on long-context tasks. The paper SmolLM3's 3:1 RoPE/NoPE ratio cites.
Gemma 3 Technical Report
Gemma Team (Google DeepMind) · 2025 · arXiv (Google DeepMind)
5:1 local:global sliding-window ratio. IFEval 90.2 on the 4B-IT.
SmolLM3: Smol, multilingual, long-context reasoner
Hugging Face · 2025 · Hugging Face blog (with model card + tech notes)
First scaled adoption of NoPE — every fourth layer drops RoPE. 3B params.