DeepSeek-V3
DeepSeek-V3 is the reference architecture this site keeps pointing at: MLA for attention, MoE for the MLP, and MTP bolted on the output for inference speed. 671B parameters total, 37B active per token.
DeepSeek-V3 matters because it's the first big open-weights model to ship all three of the heavy optimizations Microscale teaches as independent lessons. The MLA lesson shows how down-projecting keys and values into a shared latent cuts the 128K-context KV cache by 93% vs Llama-style MHA. The MoE lesson shows how 8-of-256 routing plus a single shared expert lets 671B parameters activate just 37B per token. The MTP lesson shows how one extra depth-1 MTP module pushes decode throughput ~1.8× with no quality loss.
None of those three ideas is unique to DeepSeek. MLA traces to earlier compressed-attention research; MoE routing is a decade old; MTP (multi-token prediction) was first proposed by Meta in mid-2024. DeepSeek-V3's contribution is the assembly: making all three work together at 671B scale, on a training budget roughly 1/10 the reported GPT-4-class compute, with weights published openly.
Context is 128K via YaRN extension — same technique as Kimi K2, earlier base. See Stretching context for the frequency-band interpolation that keeps the model coherent past its trained context length. If you read one hub page alongside one lesson, read this alongside the MLA lesson: the "Why MLA matters for DeepSeek-V3" framing is where the architectural pieces click.
- Total params
- 671B
- Active per token
- 37B
- Routing
- top-8 of 256 routed, 1 shared
- Context
- 128K (YaRN)
- Unique stack
- MLA + MoE + MTP in production
- Act II · 12 min · 55 xpMLA: compressing the KV cache into a latentHow DeepSeek compresses 128K-context KV cache by 93% — down-project keys and values into a shared latent, reconstruct per-head on the fly
- Act II · 22 min · 65 xpEight stations, two lanternsWhy DeepSeek-V3 claims 671B parameters but only activates 37B per token. Top-k routing, shared experts, and the load-balance thermostat
- Act II · 14 min · 55 xpStretching context: YaRN, NTK-by-parts, and attention sinksPosition Interpolation vs NTK-aware vs YaRN — why low-freq and high-freq RoPE dimensions need different treatment. Plus attention sinks
- Act VIII · 22 min · 70 xpPredicting further ahead: MTP breaks the one-token-per-step contractDeepSeek-V3 and Qwen3-Next predict 2-4 tokens in one forward pass — sequential modules that chain, acceptance rates, and the speed tradeoff