model

Kimi K2

by Moonshot AI · Released February 2026 · LAST REVIEWED APR 2026

Kimi K2 scales the DeepSeek-V3 recipe from 256 routed experts to 384, with one always-on shared expert and a top-8 router. Same bones, bigger room.

what's new in this one

The 256 → 384 routed-expert jump is where the "1T total, 37B active" number comes from. Every token still activates 8 of those experts plus the shared one, so per-token compute stays in the ~37B range while parameter capacity climbs by 50%. The Mixture of Experts lesson walks through the top-k routing math Kimi uses — Kimi K2 is the current scale ceiling of that same recipe, not a new mechanism.

The interesting lever is the training optimizer. Moonshot trained Kimi K2 with MuonClip — a Muon-based optimizer with gradient-norm clipping — instead of AdamW. Training stability improved noticeably at the 1T scale, and the published loss curves are smoother than contemporaneous AdamW MoE runs. Most MoE papers hand-wave over optimizer choice; Moonshot made it the headline.

Context length is 128K via YaRN-family position scaling — same extension approach as DeepSeek-V3, different base RoPE configuration. See Stretching context for why YaRN-style frequency-band rescaling avoids the high-frequency extrapolation collapse, and RoPE as rotation for the rotation math the extension relies on.

the shape in numbers

Total params: ~1T
Active per token: ~37B
Routing: top-8 of 384 routed, 1 shared
Context: 128K
Trained with: MuonClip optimizer

Act II · 22 min · 65 xp
Eight stations, two lanterns
Why DeepSeek-V3 claims 671B parameters but only activates 37B per token. Top-k routing, shared experts, and the load-balance thermostat