model

DeepSeek-V3

by DeepSeek · Released December 2024 · LAST REVIEWED APR 2026

DeepSeek-V3 is the reference architecture this site keeps pointing at: MLA for attention, MoE for the MLP, and MTP bolted on the output for inference speed. 671B parameters total, 37B active per token.

what's new in this one

DeepSeek-V3 matters because it's the first big open-weights model to ship all three of the heavy optimizations Microscale teaches as independent lessons. The MLA lesson shows how down-projecting keys and values into a shared latent cuts the 128K-context KV cache by 93% vs Llama-style MHA. The MoE lesson shows how 8-of-256 routing plus a single shared expert lets 671B parameters activate just 37B per token. The MTP lesson shows how one extra depth-1 MTP module pushes decode throughput ~1.8× with no quality loss.

None of those three ideas is unique to DeepSeek. MLA traces to earlier compressed-attention research; MoE routing is a decade old; MTP (multi-token prediction) was first proposed by Meta in mid-2024. DeepSeek-V3's contribution is the assembly: making all three work together at 671B scale, on a training budget roughly 1/10 the reported GPT-4-class compute, with weights published openly.

Context is 128K via YaRN extension — same technique as Kimi K2, earlier base. See Stretching context for the frequency-band interpolation that keeps the model coherent past its trained context length. If you read one hub page alongside one lesson, read this alongside the MLA lesson: the "Why MLA matters for DeepSeek-V3" framing is where the architectural pieces click.

the shape in numbers

Total params: 671B
Active per token: 37B
Routing: top-8 of 256 routed, 1 shared
Context: 128K (YaRN)
Unique stack: MLA + MoE + MTP in production

model

DeepSeek-V3

by DeepSeek · Released December 2024 · LAST REVIEWED APR 2026

what's new in this one

the shape in numbers

Total params: 671B
Active per token: 37B
Routing: top-8 of 256 routed, 1 shared
Context: 128K (YaRN)
Unique stack: MLA + MoE + MTP in production