Kimi K2
Kimi K2 scales the DeepSeek-V3 recipe from 256 routed experts to 384, with one always-on shared expert and a top-8 router. Same bones, bigger room.
The 256 → 384 routed-expert jump is where the "1T total, 37B active" number comes from. Every token still activates 8 of those experts plus the shared one, so per-token compute stays in the ~37B range while parameter capacity climbs by 50%. The Mixture of Experts lesson walks through the top-k routing math Kimi uses — Kimi K2 is the current scale ceiling of that same recipe, not a new mechanism.
The interesting lever is the training optimizer. Moonshot trained Kimi K2 with MuonClip — a Muon-based optimizer with gradient-norm clipping — instead of AdamW. Training stability improved noticeably at the 1T scale, and the published loss curves are smoother than contemporaneous AdamW MoE runs. Most MoE papers hand-wave over optimizer choice; Moonshot made it the headline.
Context length is 128K via YaRN-family position scaling — same extension approach as DeepSeek-V3, different base RoPE configuration. See Stretching context for why YaRN-style frequency-band rescaling avoids the high-frequency extrapolation collapse, and RoPE as rotation for the rotation math the extension relies on.
- Total params
- ~1T
- Active per token
- ~37B
- Routing
- top-8 of 384 routed, 1 shared
- Context
- 128K
- Trained with
- MuonClip optimizer