SmolLM3
SmolLM3 is a 3B dense model that uses NoPE layers every 4th block: three RoPE layers, one NoPE layer, repeat. The justification: not every attention layer needs positional bias to learn what it needs to learn.
SmolLM3 is the answer to "what does a 3B model that doesn't punt the hard problems look like?" Two choices set it apart from Llama 3.2 3B and Gemma 3 4B: every fourth layer drops positional encoding entirely (NoPE), and the training pipeline is published in unusual detail. Both choices touch lessons we already taught in Act 2.
- Token embed
- Layer × 3 · RoPE attention (GQA 16q → 4kv)
Layer × 1 · NoPE attention (positional encoding dropped)
Layers 4, 8, 12, 16, 20, 24, 28, 32, 36 apply attention without rotary positional encoding — every fourth layer in the 36-layer stack. The 3:1 RoPE/NoPE ratio is the design choice isolated by the NoPE lesson.
- …repeat 9× (36 layers total)
- MLP · SwiGLU (2048 → 11008)
- RMSNorm + tied unembed
Boring blocks (residual streams, normalisation) covered in Act 2.
The blocks worth examining are positional handling, attention, the MLP, and the tokenizer. Boring blocks (residual, RMSNorm, embedding lookup) are covered cold in build-a-block — the diagram links them out rather than re-explaining them here.
Positional handling.27 of SmolLM3's 36 layers apply RoPE rotation to query and key vectors. The remaining 9 — every fourth layer, counting from layer 4 — skip the rotation entirely. The NoPE lesson walks the intuition: layers seeing positionally-uninformed attention seem to generalise better past the trained context length.
Attention — GQA 16q/4kv. 16 query heads per layer collapse into 4 KV groups, a 4:1 compression ratio that matches Llama 3.2 3B. The From MHA to GQA lesson shows why this ratio sits at the sweet spot: KV cache footprint drops 4× with negligible quality cost. SmolLM3 inherits the recipe rather than experimenting.
MLP — SwiGLU, 2048 → 11008. The MLP uses SwiGLUwith hidden dimension 2048 and intermediate 11008. That's a 5.37× expansion — noticeably fatter than the canonical SwiGLU (8/3)·d ≈ 2.67× expansion used by Llama 3 at the same per-parameter budget.
Tokenizer.SmolLM3 reuses the Llama 3.2 tokenizer wholesale (128K vocab) — the BOS token aside, it's the same tokenizer. The tokenization lesson covers the BPE recipe; SmolLM3's choice here is deliberate reuse, not innovation.
Why dense, not MoE?
At 3B parameters, the routing overhead and training instability of mixture-of-experts architectures eat the parameter-efficiency gains. The MoE crossover with dense is well past 3B — DeepSeek-V3 starts at 671B total parameters, Qwen3 MoE starts at 30B total.
Why NoPE on every fourth layer?
Long-context generalisation. SmolLM3 cites Yang et al. 2025 ("Rope to Nope and Back Again"), which shows a hybrid RoPE/NoPE pattern outperforms RoPE-on-every-layer on long-context tasks. The NoPE lessonwalks the intuition: positionally-uninformed layers seem to extend the competence ceiling past the trained context length. SmolLM3's 3:1 ratio (RoPE on three, NoPE on one, repeat) is the production-scale design point.
Why over-train at 11.2T tokens?
The model sees 11.2T tokens against 3B parameters — a D/N ratio of ~3,733, far past Chinchilla-optimal's ~20. The Scaling laws lesson covers the inference-cost-aware reframing: Sardana 2024 shows quality keeps improving up to D/N ≈ 10,000 once inference cost is added to the optimisation objective.
Why a three-stage curriculum?
Capability annealing. The three-stages lesson uses SmolLM3 as the canonical example because HF publishes the data mix per stage in unusual detail: stage 1 is web-scale general data (85/12/3 web/code/math at 0→8T tokens), stage 2 lifts the math and code share (75/15/10 at 8→10T), stage 3 anneals on reasoning data (63/24/13 at 10→11.1T). Each stage locks in capabilities the next one builds on.
The 11.2T tokens split across three stages, each with a deliberate data-mix shift. Stage one is web-scale general data — broad fluency, multilingual coverage across six European languages, the long tail of human writing. Stage two raises the math and code proportion, pushing the model toward formal-reasoning capability without losing the general fluency stage one locked in. Stage three is the reasoning anneal: upsampled math and code, plus instruction and reasoning datasets like OpenMathReasoning.

The context-extension recipe is the other published-in- detail piece. SmolLM3 pretrains at 4K sequence length, then mid-trains an extension to 32K with rope_theta raised to 1.5M, then to 64K with rope_theta raised to 5M. At inference, YaRN extrapolates the 64K-trained model to 128K. The long-context lesson walks the YaRN math.

The whole pipeline is the canonical worked example for the three-stages lesson. Other 2025-era SLMs trained on similar token budgets but published far less detail about the per-stage breakdown — SmolLM3's transparency is what makes it the teaching target, not the architecture per se.
- Size
- 3B dense
- Architecture
- Dense, GQA
- Positional encoding
- 3:1 RoPE → NoPE ratio
- Context
- 64K
- Training
- 11T tokens, 3-stage curriculum
- Act II · 8 min · 40 xpNoPE layersSometimes the best position encoding is none. Why SmolLM3 drops RoPE every 4th layer — the 3:1 NoPE ratio that boosts long-context without hurting short
- Act III · 10 min · 40 xpThe model museumExplore every major SLM — Phi-4, Llama 3.2, Qwen3, Gemma 3, SmolLM3, BitNet — with architecture diagrams, training recipes, and benchmarks
- Act IV · 9 min · 45 xpThree-stage curriculumWalk through SmolLM3's 11T-token training pipeline — web-scale breadth, math/code specialization, and reasoning annealing — with a scrubbable timeline
- Transformer Language Models without Positional Encodings Still Learn Positional InformationHaviv, Ram, Goldberg, Chen, Levy · 2022 · EMNLP Findings 2022NoPE debut at 125M–1.3B. Models without explicit position encodings still learn position from causal masking.
- Training Compute-Optimal Large Language ModelsHoffmann, Borgeaud, Mensch et al. · 2022 · NeurIPS 2022Compute-optimal D/N ≈ 20. Trained ~400 models from 70M to 16B params to fit the scaling law.
- Beyond Chinchilla-Optimal: Accounting for Inference in Language Model Scaling LawsSardana, Portes, Doubov, Frankle · 2024 · ICML 2024Adds inference cost to the Chinchilla objective. Quality keeps improving up to D/N ≈ 10,000.
- The Llama 3 Herd of ModelsGrattafiori et al. · 2024 · arXiv (Meta AI)Llama 3 8B / 70B / 405B trained on 15T multilingual tokens — D/N = 1,875 for the 8B.
- Rope to Nope and Back Again: A New Hybrid Attention StrategyYang, Venkitesh, Talupuru, Lin, Cairuz · 2025 · arXiv (preprint)Hybrid RoPE/NoPE outperforms RoPE-on-every-layer on long-context tasks. The paper SmolLM3's 3:1 RoPE/NoPE ratio cites.
- Gemma 3 Technical ReportGemma Team (Google DeepMind) · 2025 · arXiv (Google DeepMind)5:1 local:global sliding-window ratio. IFEval 90.2 on the 4B-IT.
- SmolLM3: Smol, multilingual, long-context reasonerHugging Face · 2025 · Hugging Face blog (with model card + tech notes)First scaled adoption of NoPE — every fourth layer drops RoPE. 3B params.