lesson mtp · 22 min · 70 xp

Predicting further ahead: MTP breaks the one-token-per-step contract

DeepSeek-V3 and Qwen3-Next predict 2-4 tokens in one forward pass — sequential modules that chain, acceptance rates, and the speed tradeoff

Predicting further ahead

Speculative decoding gave us our first answer to the memory wall: ask a cheap draft model to propose several tokens, then verify them in one big forward pass of the target. Multi-token prediction — MTP — is a different move from the same playbook. Instead of a separate draft model, the base model itself grows a short chain of prediction modules and learns, during pretraining, to emit two, three, even four tokens per step.

DeepSeek-V3 trained its MTP modules alongside the main trunk from the start. Qwen3-Next (Sept 2025) did the same and shipped MTP in its default serving path — the first major non-DeepSeek model to do so. Both quote inference speedups in the 1.5–2× range at production batch sizes, without touching the quality of the main head's output. The trick is elegant, the math is clean, and the why it works is the whole memory-wall lesson cashing its check.

the timeline that compresses

one forward pass, several tokens — if the chain accepts

top track: vanilla AR, one token per step. bottom track: MTP with depth

D = 3

live ticker

45 → 113 tok/s

accept@k=1: 0.88
accept@k=2: 0.72
accept@k=3: 0.59
effective speedup: 2.51×

enable MTP

depth D

acceptance preset

batch size

E[accepted / pass]

2.89

speedup

2.51×

live tok/s

113

At $D = 3$ , preset DeepSeek-V3 (~0.85), batch 1: one forward pass produces ~2.89 tokens in expectation — 2.51× faster than vanilla AR once the per-head compute tax and batch-size scaling are factored in.

Why decode waits on memory

Remember the roofline. An H100's fp16 matrix core can do ~990 TFLOPs/s; its HBM3 feeds ~3.35 TB/s. The ratio — the machine's arithmetic intensity balance point — sits around 295 FLOPs per byte. Above that intensity your kernel is compute-bound; below it, memory-bound.

Decode at batch 1 on a 70B model has an arithmetic intensity of roughly 1–2 FLOPs per byte. A single token forward reads ~140 GB of weights from HBM and does maybe 140 GFLOPs of actual arithmetic on them. Two hundred to six hundred times below the roofline. The GPU spends most of its cycles just waiting for the next weight shard to land in its registers, then crunches on that shard for a microsecond, then waits again.

Decode isn't slow because models are big. It's slow because we ask for one token at a time from a memory system that would rather hand us thousands.

That is the memory-wall lesson. Speculative decoding exploits it by amortising the weight-load across k verified tokens; MTP exploits it by amortising across Dpredicted tokens in one pass. Both are fighting the same enemy from different angles — one at serving time with a draft model, the other in the architecture with extra heads. When you see the 45 → 125 tok/s jump on the ticker above, you're watching the roofline gap close.

The one-matmul-per-token contract

A vanilla autoregressive transformer produces exactly one token per forward pass. The reason is not a law of physics; it is an accidental contract. Each layer of the trunk emits a hidden state $h_t$ , the LM head projects it to a vocabulary distribution, you sample a token $w_{t+1}$ , and that token becomes the next input. To predict $w_{t+2}$ you must know $w_{t+1}$ — causal attention demands it — so you run the whole stack again.

The model has, in fact, more information than it uses. The trunk at step t already has a rich context representation; it could plausibly guess severalnext tokens. The standard LM head just doesn't ask it to. Everything MTP does, at bottom, is ask it to.

the autoregressive loss, verbatim

\mathcal{L}_{\text{AR}} \;=\; -\sum_{t} \log p_\theta(w_{t+1} \mid w_{\le t})

Only the next token. Nothing further. The supervised signal never touches what the model thinks about $w_{t+2}$ or $w_{t+3}$ .

MTP's insight: train on further targets

Gloeckle, Youbi Idrissi, Rozière, Lopez-Paz & Synnaeve (Meta, April 2024) asked the minimal question: what if we just add more output heads? Put n linear heads on top of the same shared trunk, have head k predict $w_{t+k+1}$ , and sum the losses. No new hyperparameters. No routing. No separate draft model. Train that, and see what happens.

What happened was a small pretraining quality win, a non-trivial inference speedup, and a new architectural primitive that nobody had been using. On 13B-scale code models Gloeckle et al. reported +12% HumanEvaland +17% MBPP at pass@1, with MTP trained from scratch versus a control next-token-only model — and up to 3× wall-clock inference speedup when the extra heads were used as speculative drafters at serving time.

◆ paper

Better & Faster Large Language Models via Multi-token Prediction

Gloeckle, Youbi Idrissi, Rozière, Lopez-Paz, Synnaeve · 2024 · arXiv · Meta FAIR

arxiv:2404.19737

Section 3.2 shows the quality deltas; Section 5 is the speculative-decoding deployment experiment. Two key observations from the paper: (a) the wins concentrate on reasoning and code tasks, where planning several tokens ahead actually matters, and (b) small models (300M) do not benefit — capacity seems to be a precondition.

The Meta design is deliberately minimal: parallel linear heads, all reading the same final trunk hidden. That's the same shape as Medusa (Cai et al., 2024) but trained jointly with the backbone rather than bolted on after the fact. It leaves an obvious axis unexplored: what if the heads could see each other?

DeepSeek-V3's upgrade — sequential modules

DeepSeek-V3 (Dec 2024) did exactly that. Their MTP implementation diverges from Gloeckle's parallel heads in one crucial way: the modules form a chain, not a fan. Each module $T_k$ takes as input both the previous module's hidden state and the embedding of the token that module just predicted, runs them through a full transformer block, and emits its own hidden — which the next module then consumes.

the DeepSeek-V3 module, mechanically

Take the previous hidden $h^{k-1}_t \in \mathbb{R}^{d}$ (for $k = 1$ , this is the trunk's final hidden $h^0_t$ ).
Take the embedding of the previously-predicted token, $E(w_{t+k})$ , through the shared embedding table.
RMSNorm both independently. Concatenate on the feature axis → $[2d]$ .
Project through the unshared $M_k \in \mathbb{R}^{2d \times d}$ to get a $d$ -dim input to the transformer block.
Run the unshared transformer block $T_k$ → hidden $h^k_t$ .
Project through the shared $\text{lm\_head}$ → $[V]$ logits. Sample (or argmax) $w_{t+k+1}$ . Feed everything to module $T_{k+1}$ .

Notice what is shared and what is not. The embedding $E$ and the output projection $\text{lm\_head}$ are shared with the main model — no extra parameters for the token vocabulary, no extra parameters for output logits. But each module's projection matrix $M_k$ and transformer block $T_k$ are unshared across depths. This matters: the job of predicting t+2 is qualitatively different from predicting t+3 (the model knows one less actual token and has to commit to a further-out guess), so giving each depth its own block lets it specialize.

The cost is modest. For DeepSeek-V3 — 671B total, 7168 hidden dim, 61 trunk layers, 129280 vocab — adding one MTP module (D=1) adds roughly ~12–14B parameters: one 14336 × 7168 projection plus one full transformer block. That's ~2% of total parameters, in exchange for a ~1.8× inference speedup under self-speculative decoding. The paper ships with $D = 1$ and does not publish ablations against $D \geq 2$ , so the public evidence for how deeper chains would have performed is simply absent — the architecture supports them, but the trade-off wasn't measured in this work.

◆ paper

DeepSeek-V3 Technical Report

DeepSeek-AI · 2024 · arXiv

arxiv:2412.19437

Section 2.2 details the MTP architecture. The schematic in Figure 3 is worth staring at: look for the RMSNorm'd trunk hidden, the RMSNorm'd token embedding, the unshared

M_k

next to the shared

\text{lm\_head}

the module chain

shared E, shared lm_head · unshared projection and T_k

click any module to see the concrete shapes.

MTP mode: modules chain. T_k's input is T_k−1's hidden ⧺ the embedding of the token T_k−1 just predicted. The chain lets the model reason about what it said before predicting the next token — which is why acceptance stays high further out.

Training: one trunk, D+1 losses

Training an MTP model looks almost exactly like training a plain LM, with one change. The total loss is the main next-token loss plus the average of the D MTP module losses, weighted by a schedule that decays over training:

\mathcal{L}_{\text{total}} \;=\; \mathcal{L}_{\text{main}} \;+\; \frac{\lambda}{D} \sum_{k=1}^{D} \mathcal{L}^{k}_{\text{MTP}}

where $\mathcal{L}^{k}_{\text{MTP}}$ is the cross-entropy of module k's logits against the true token at position t+k+1. The $1/D$ factor is a fairness knob: at $D = 4$ the MTP losses are averaged, not summed, so deeper chains don't overwhelm the main loss.

$\lambda$ controls how much the model caresabout MTP. DeepSeek-V3's schedule is specific: λ = 0.3 for the first 10T training tokens, then λ = 0.1 for the final 4.8T. The intuition is that early on the MTP signal provides a useful auxiliary objective — the model has to think slightly further ahead, which improves its representations. Later, as the main model crystalises, MTP is kept around primarily for inference speedup; a smaller weight prevents it from pulling the trunk toward optimising for the wrong distribution.

At inference: discard, or speculate

Now for the payoff. A trained MTP model can be served in one of two modes, and the choice matters.

discard mode

At serving time, ignore the MTP modules entirely. Use just the main head for next-token prediction.

You still get the pretraining quality boost (the MTP loss regularised the trunk's representations), but no inference speedup.

Best when: latency already fine, batch is large, MTP modules were trained but you don't want the verification complexity in serving.

speculate mode

Run the full chain per forward pass. Each module produces a candidate token; the next step re-verifies them against the main-head distribution (rejection sampling, exactly as speculative decoding).

Accepted tokens advance the sequence; rejected tokens are replaced with a main-head sample. Zero quality loss. Speed gain scales with acceptance rate and D.

Best when: latency-bound, batch 1–8, H100/H200 or MI300X serving.

Production numbers from SGLang + H200 TP8 (July 2025 blog): 1.8× throughput at batch 1, 1.5× at batch 32. The speedup compresses at higher batch because the workload moves from memory-bound to compute-bound — there are enough tokens in flight to saturate the matrix cores even without MTP, so the “free” tokens the chain produces don't reduce the critical path as much. On AMD MI300X (SGLang benchmarks late 2025), MTP delivered 1.25–2.11× on random prompts — the wide range reflects sensitivity to prompt distribution: code and math get the upper end; chatty small-talk gets the lower end.

Qwen3-Next(Alibaba, September 2025) is the inflection point. It's the first major non-DeepSeek model to ship native MTP, and it ships with MTP on by default in the vLLM-hosted serving path. That matters: MTP is no longer a DeepSeek-specific quirk but a generic technique a serving engineer is expected to know. Expect more 2026 releases to quietly adopt it.

◆ paper

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Cai, Li, Geng, Peng, Lee, Chen, Wang, Dao · 2024 · arXiv

arxiv:2401.10774

The parallel-heads-bolted-on design. Medusa heads are added to a frozen base model and trained in a few hours on a single GPU. The contrast with MTP is instructive: Medusa is operationally cheap but caps out at lower acceptance because the heads are structurally unable to see each other's picks.

◆ paper

Fast Inference from Transformers via Speculative Decoding

Leviathan, Kalman, Matias · 2022 · arXiv

arxiv:2211.17192

The rejection-sampling proof that underpins all three approaches above. Every deployment of MTP, Medusa, or classical spec-dec uses the same acceptance rule:

p_\text{accept} = \min\!\big(1, p_\text{target}/p_\text{draft}\big)

. Lossless by construction.

The serving take

MTP is the second answer, after speculative decoding, to the question “how do we decode more than one token per memory round-trip?” Speculative decoding answers with a draft model; MTP answers by modifying the target architecture. Both stack cleanly: you can run speculative decoding on top of an MTP-trained model, using its MTP modules as the draft. Some recent SGLang deployments do exactly that, chaining the gains.

For the SLM practitioner: if you are training from scratch and can afford the 1–2% parameter overhead plus the joint loss, MTP is close to a free speedup at serving time plus a small quality bump during pretraining. If you are wrapping an existing model and cannot retrain, Medusa is the cheaper compromise. And if the target model is someone else's pretrained checkpoint on HuggingFace, classical speculative decoding with a small draft remains the pragmatic choice. The three techniques are not rivals so much as points on a single curve: how much architectural commitment are you willing to trade for how much speedup.

MMXXVI

historical note

September 2025 · Alibaba · Qwen3-Next

The first major non-DeepSeek model to ship native MTP in its default serving path. At this point MTP graduates from “DeepSeek's thing” to a general expectation for any 2026 frontier serving stack — vLLM, SGLang, and TensorRT-LLM all shipped MTP-aware schedulers in the months after the Qwen3-Next release.

At $D = 3$ , DeepSeek-V3 preset, batch 1: expected accepted tokens per forward pass ≈ 2.89, effective speedup ≈ 2.51×. The memory wall didn't move, but we learned to ask it for more per knock.

comprehension check

Three tiers. Three ways to test the same ideas.

Recall checks the shapes and schedules. Apply runs the speedup math on new numbers. Reason transfers MTP to scenarios the lesson didn't cover.

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

In DeepSeek-V3's MTP implementation, which components are shared between the main head and each MTP module?

this lesson appears in

lesson mtp · 22 min · 70 xp

Predicting further ahead: MTP breaks the one-token-per-step contract

DeepSeek-V3 and Qwen3-Next predict 2-4 tokens in one forward pass — sequential modules that chain, acceptance rates, and the speed tradeoff