DeepSeek-R1
DeepSeek-R1 is DeepSeek-V3's base model with a reasoning post-training stage on top. The architecture doesn't change. The training recipe does.
R1 is about the training pipeline, not the model shape. The whole architecture discussion lives in the DeepSeek-V3 stack — R1 inherits all of it unchanged. What R1 adds is a post-training stage that teaches the model to think longer: emit long chains of tokens between a <think> opening and the final answer, then self-correct during that thinking. The Reasoning lessonwalks through the whole pipeline — R1-Zero's pure-RL "aha moment," the cold-start distillation that makes R1 readable, and the final RL-on-verifiable-rewards step.
The RL algorithm is GRPO — Group Relative Policy Optimization. The GRPO lessonexplains why it works without a critic network: you sample a group of completions, score them against the verifiable reward (code compiles, math answer checks), and use the group's own mean as the advantage baseline. No value model to train, no actor-critic instability, just direct gradient on the policy-likelihood ratio.
The "verifiable rewards" framing matters because it's what makes R1's training domain narrower than OpenAI o1's. R1 is strong on math, code, and formal reasoning where correctness is mechanically checkable. It's less load-bearing for open-ended generation or stylistic preference tasks — those need DPO or RLHF, which the DPO lesson covers. R1 picks its lane and wins it.
- Total params
- 671B (same as V3)
- Active per token
- 37B
- Base model
- DeepSeek-V3
- Post-training
- GRPO on verifiable rewards
- Speciality
- Test-time reasoning
- Act IV · 18 min · 60 xpTeaching a model to think longerHow o1 and DeepSeek-R1 trade train-time for test-time compute — best-of-N sampling, process reward models, GRPO-on-verifiable-rewards
- Act VI · 11 min · 55 xpGRPO and RLVRGroup-relative advantages without a critic network — the RL algorithm behind DeepSeek-R1 and reasoning SLMs, visually explained