model

DeepSeek-R1

by DeepSeek · Released January 2025 · LAST REVIEWED APR 2026

DeepSeek-R1 is DeepSeek-V3's base model with a reasoning post-training stage on top. The architecture doesn't change. The training recipe does.

what's new in this one

R1 is about the training pipeline, not the model shape. The whole architecture discussion lives in the DeepSeek-V3 stack — R1 inherits all of it unchanged. What R1 adds is a post-training stage that teaches the model to think longer: emit long chains of tokens between a <think> opening and the final answer, then self-correct during that thinking. The Reasoning lessonwalks through the whole pipeline — R1-Zero's pure-RL "aha moment," the cold-start distillation that makes R1 readable, and the final RL-on-verifiable-rewards step.

The RL algorithm is GRPO — Group Relative Policy Optimization. The GRPO lessonexplains why it works without a critic network: you sample a group of completions, score them against the verifiable reward (code compiles, math answer checks), and use the group's own mean as the advantage baseline. No value model to train, no actor-critic instability, just direct gradient on the policy-likelihood ratio.

The "verifiable rewards" framing matters because it's what makes R1's training domain narrower than OpenAI o1's. R1 is strong on math, code, and formal reasoning where correctness is mechanically checkable. It's less load-bearing for open-ended generation or stylistic preference tasks — those need DPO or RLHF, which the DPO lesson covers. R1 picks its lane and wins it.

the shape in numbers

Total params: 671B (same as V3)
Active per token: 37B
Base model: DeepSeek-V3
Post-training: GRPO on verifiable rewards
Speciality: Test-time reasoning