Qwen3
Qwen3 ships at every scale: 0.6B, 1.7B, 4B, 8B, 14B, 32B dense, plus a 30B-A3B and a 235B-A22B MoE. The pick-your-size family for open-weights work in 2026.
Qwen3's design choice is breadth — every compute budget is covered. A 0.6B dense model fits on a laptop; a 235B-A22B MoE runs on a single 8×H100 node. The architecture stays consistent across sizes (RoPE, GQA, SwiGLU, RMSNorm) so a recipe that works on 4B transfers to 32B without surgery.
The MoE variants use a familiar shape: top-8 routing of 128 experts, shared experts to carry universal features. The MoE lessonwalks through the routing math that applies here identically; only the expert count differs from DeepSeek-V3's 256 or Kimi K2's 384.
Qwen3-Next is the interesting variant architecturally — it's the first production model to ship MTP with depth ≥ 2, pushing the speculative decoding acceptance rate above what DeepSeek-V3's depth-1 module achieves. The MTP lessonshows why depth-2 is non-obvious (the second module's acceptance is conditional on the first, multiplicatively reducing expected speedup). Qwen3-Next also extends context to 1M tokens — far past the 128K ceiling that most 2025 models stopped at.
- Sizes (dense)
- 0.6B, 1.7B, 4B, 8B, 14B, 32B
- Sizes (MoE)
- 30B-A3B, 235B-A22B
- Routing (MoE)
- top-8 of 128 routed
- Context
- 128K+ (Qwen3-Next: 1M)
- Notable
- Strong multilingual + Qwen3-Next MTP
- Act II · 22 min · 65 xpEight stations, two lanternsWhy DeepSeek-V3 claims 671B parameters but only activates 37B per token. Top-k routing, shared experts, and the load-balance thermostat
- Act III · 10 min · 40 xpThe model museumExplore every major SLM — Phi-4, Llama 3.2, Qwen3, Gemma 3, SmolLM3, BitNet — with architecture diagrams, training recipes, and benchmarks
- Act VIII · 22 min · 70 xpPredicting further ahead: MTP breaks the one-token-per-step contractDeepSeek-V3 and Qwen3-Next predict 2-4 tokens in one forward pass — sequential modules that chain, acceptance rates, and the speed tradeoff