lesson mla · 12 min · 55 xp

MLA: compressing the KV cache into a latent

How DeepSeek compresses 128K-context KV cache by 93% — down-project keys and values into a shared latent, reconstruct per-head on the fly

act ii · inside the machine · lesson

MLA: compressing the KV cache into a latent

At 128K context, a DeepSeek-V2-shape MHA cache would eat half a terabyte of HBM per sequence. DeepSeek-V2 cuts that by 93% — not by sharing heads more aggressively, but by factoring keys and values through a low-rank latent.

kv cache per sequence · deepseek-v2 shape

drag the slider. watch MLA refuse to grow.

MHA · 128 KV heads

GQA · h/g=4

MLA · d_c=512

sequence length62K tokens

log scale · 1K → 128K · the DeepSeek-V2 target window

MHAh_kv = 128

231.8 GB

2 · 60 · 128 · 128 · 62K · 2B

GQAh_kv = 32

58.0 GB

2 · 60 · 32 · 128 · 62K · 2B

MLAd_c = 512

4.1 GB

60 · (512 + 64) · 62K · 2B

seq length

62Ktoks

MHA total

231.8 GB

MLA total

4.1 GB

reduction

98.2%

At this setting, MLA uses 56.9× less HBM than MHA — the same math, rewritten through a shared 512-dim latent.

the MLA projection pipeline

one latent, two up-projections, a runtime shortcut

Training time: the token fingerprint gets compressed by

W_{DKV}

to a latent

c_{kv}

of width 512. From

c_{kv}

the block reconstructs per-head keys via

W_{UK}

and per-head values via

W_{UV}

. Only

c_{kv}

is cached.

A quick recap, because MLA only makes sense as a third step

In From MHA to GQA we watched the KV-head count collapse from $h=32$ down through $h/g=4$ and eventually to $h_{kv}=1$ (MQA). Each step shrank the per-token cache, but the sharing ratio had a ceiling — squeeze too hard and summarisation scores fell off a cliff. GQA with $h/g=4$ became the 2023–2024 default precisely because it was the last safe rung.

By mid-2024 the binding constraint had moved. Context windows leapt from 4K to 128K. The per-token cost was already fine — the sequence-length multiplier was what dominated. A Llama-class model at 128K with GQA still needed tens of gigabytes of KV per sequence. At serving concurrency of 10–50 users, that was unworkable. Something more aggressive than head-sharing was needed.

MMXXVI

historical note

May 2024 · DeepSeek-AI, Hangzhou

DeepSeek-V2 ships with a block no one had seen in a production model before: Multi-head Latent Attention. Instead of sharing keys and values across query heads, MLA factors them through a shared low-rank latent

c_{kv}

of width

d_c = 512

, one per token — plus a tiny rotated key slice of width

d_r = 64

to preserve RoPE. The paper's abstract reports a 93.3% cut in KV cache versus DeepSeek-67B (a GQA-based predecessor) and a 5.76× throughput boost. DeepSeek-V3 (Dec 2024) and R1 (Jan 2025) both keep the block unchanged.

◆ paper

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI · 2024 · Technical Report

arxiv:2405.04434

MLA is defined in section 2.1, pages 5–7. The derivation of the query-absorption rewrite is in Appendix C. The paper also introduces DeepSeekMoE on the FFN side — MLA is the attention half of the efficiency story.

The MLA idea: factor K and V through a shared latent

Start with the attention score for one query, one key:

s_{q,i} \;=\; Q_i \cdot K_i \;=\; (x_i W_Q) \cdot (x_i W_K)

In vanilla attention, we cache $K_i$ and $V_i$ for every head, every layer, every token. MLA's move is to notice that $K_i$ and $V_i$ across all heads can be reconstructed from a single, much smaller per-token summary. Define a down-projection $W_{DKV} \in \mathbb{R}^{d_{\text{model}} \times d_c}$ , which takes a token to the shared latent:

c^{KV}_i \;=\; x_i W_{DKV} \;\in\; \mathbb{R}^{d_c}

Then two up-projections, one for keys and one for values, reconstruct what the attention block actually needs:

K_i \;=\; c^{KV}_i \, W_{UK}, \qquad V_i \;=\; c^{KV}_i \, W_{UV}

the one-line summary

K_i,\;V_i \;=\; (x_i W_{DKV})\,W_{UK},\;\;(x_i W_{DKV})\,W_{UV}

Only $c^{KV}_i = x_i W_{DKV}$ lives in the KV cache. Everything else — every per-head key, every per-head value, every layer — is rebuilt on the fly from this one shared latent.

The per-token cache is no longer $2 \cdot L \cdot h_{kv} \cdot d_h$ bytes. It's $L \cdot (d_c + d_r)$ bytes — no 2× factor, because DeepSeek-V2's MLA caches a single joint K-V latent plus a small RoPE-rotated key slice, not separate K and V tensors. With $L=60, d_c=512, d_r=64$ , that's about 67 KB per token (60 · 576 · 2 B) versus MHA's ~4 MB — a ~60× cut.

The latent

c^{KV}_i

is shared across every attention head at this layer. All 128 heads read their own personalised K and V out of the same 512-dim summary, through their own private

W_{UK}

and

W_{UV}

. The per-head diversity that GQA preserves in

W_Q

still exists in MLA — it's just moved into the up-projections.

The decoupled-RoPE problem (and its ugly-but-necessary fix)

There is one snag, and every tutorial on MLA either glosses it or gets it wrong. RoPE is multiplicative and position-dependent. In a RoPE attention block, $K_i$ is rotated by a token-position matrix $R_i$ before the dot product:

s_{q,i} \;=\; (Q_i R_i^{\text{query}}) \cdot (K_i R_i^{\text{key}})

If you try to absorb $W_{UK}$ into the query path (the trick that makes MLA runtime-free — we'll get there in the next section), the rotation $R_i$ sits between $W_{UK}$ and $c^{KV}_i$ and breaks the associativity. You can't just move $W_{UK}$ over — it no longer commutes with the rotation.

DeepSeek-V2's fix: give RoPE its own little, un-compressed channel. The per-token state is split into two pieces:

$c^{KV}_i$ — the fat, compressed, unrotated latent. Width $d_c = 512$ . Goes through the up-projections for content.
$k^R_i$ — a small, separately computed, rotated key slice. Width $d_r = 64$ . Carries the positional signal per se.

At score time, the two pieces are concatenated: content similarity (through the latent) plus positional similarity (through the RoPE slice), summed together. The cache stores both, but $d_c + d_r = 576$ is still an order of magnitude smaller than $h \cdot d_h = 16\,384$ .

If this decoupling feels like a kludge — it is. DeepSeek-V3 kept it because nobody found a cleaner factorisation that survived RoPE's rotation structure. See the RoPE lesson for why position encoding is multiplicative in the first place, and why the rotation matrix refuses to be commuted through a dense linear map.

The query-absorption trick — why MLA is free at inference

Here is the piece that flips MLA from clever to obviously the right answer. The naive way to run MLA is:

Read $c^{KV}_i$ from the cache.
Up-project: $K_i = c^{KV}_i W_{UK}$ .
Dot: $s = Q_i \cdot K_i$ .

Step 2 materialises a full $h \cdot d_h$ key vector per head, per token — which is exactly the memory traffic we were trying to avoid. If MLA cost that much at inference, nothing would have been gained.

The saving rides on a single algebraic identity. Writing it out for the unrotated part:

Q_i \cdot K_i \;=\; Q_i \cdot (c^{KV}_i W_{UK}) \;=\; (Q_i W_{UK}^{\top}) \cdot c^{KV}_i

The rewrite on the right is just the associativity of the matrix multiply. But it has a completely different runtime profile. We no longer need to materialise $K_i$ at all. The factor $Q_i W_{UK}^{\top}$ can be pre-merged into the query projection itself: define $\tilde W_Q = W_Q W_{UK}^{\top}$ , and then the query is computed directly against the stored latent:

s \;=\; (x_i \tilde W_Q) \cdot c^{KV}_i

At inference time, $W_{UK}$ never runs. Only $c^{KV}_i$ is loaded. The per-token compute is back to GQA-class, but the KV footprint is still the 30× smaller one. This is why MLA scales to 128K.

training time

All projections are explicit. The down-projection produces $c^{KV}$ ; the up-projections $W_{UK}, W_{UV}$ reconstruct per-head K and V; attention proceeds as normal. Training adds ~5% parameters (the extra matrices) but recycles the same gradient machinery.

inference time

$W_{UK}$ is folded into $W_Q$ ; $W_{UV}$ is folded into the output projection. Only $c^{KV}$ and the small RoPE slice $k^R$ load from HBM per decode step. Bandwidth per token drops ~20×.

The real numbers, at DeepSeek-V2 scale

Set $L = 60, h = 128, d_h = 128, d_c = 512$ , fp16 ( $2$ bytes). Plug a full 128K context in and the arithmetic is savage:

MHA · 128K

~515 GB

2 · 60 · 128 · 128 · 128K · 2B (K+V)

GQA h/g=4 · 128K

~129 GB

four-way head sharing

MLA · 128K

~9 GB

60 · (512 + 64) · 128K · 2B (joint latent + RoPE key)

A ~98% cut versus a same-shape MHA baseline. The paper's headline 93.3% figure is a slightly different comparison — DeepSeek-V2 (MLA) versus its immediate predecessor DeepSeek-67B, which already used GQA. Across every reasonable baseline the reduction lands in the same order-of-magnitude bracket: MLA cuts attention-cache memory by roughly one to two orders of magnitude relative to full MHA at long context.

The throughput story is equally dramatic. Decode is memory- bandwidth bound: each generated token requires reading the entire KV cache of every prior token. Shrinking the cache shrinks the per-token bandwidth. DeepSeek-V2 reports 5.76× higher generation throughput than their MHA baseline in the paper, and the Chinese API saw steep per-token price cuts as the MLA-based models rolled out through 2024.

MLA composes with FP8 KV cache and with prefix-cache reuse (RadixAttention): stacking them can get per-token KV into the 2–4 KB range, which is what makes DeepSeek-V3's ~2¢-per-million-token API price physically possible. Act VIII returns to this serving-cost math in detail.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

What does MLA cache per token, per layer?

this lesson appears in

DeepSeek-V3

lesson mla · 12 min · 55 xp

MLA: compressing the KV cache into a latent

How DeepSeek compresses 128K-context KV cache by 93% — down-project keys and values into a shared latent, reconstruct per-head on the fly

act ii · inside the machine · lesson

MLA: compressing the KV cache into a latent

kv cache per sequence · deepseek-v2 shape

drag the slider. watch MLA refuse to grow.

MHA · 128 KV heads

GQA · h/g=4

MLA · d_c=512

sequence length62K tokens

log scale · 1K → 128K · the DeepSeek-V2 target window

MHAh_kv = 128

231.8 GB

2 · 60 · 128 · 128 · 62K · 2B

GQAh_kv = 32

58.0 GB

2 · 60 · 32 · 128 · 62K · 2B

MLAd_c = 512

4.1 GB

60 · (512 + 64) · 62K · 2B

seq length

62Ktoks

MHA total

231.8 GB

MLA total

4.1 GB

reduction

98.2%

At this setting, MLA uses 56.9× less HBM than MHA — the same math, rewritten through a shared 512-dim latent.

the MLA projection pipeline

one latent, two up-projections, a runtime shortcut

Training time: the token fingerprint gets compressed by

W_{DKV}

to a latent

c_{kv}

of width 512. From

c_{kv}

the block reconstructs per-head keys via

W_{UK}

and per-head values via

W_{UV}

. Only

c_{kv}

is cached.

A quick recap, because MLA only makes sense as a third step

MMXXVI

historical note

May 2024 · DeepSeek-AI, Hangzhou

c_{kv}

of width

d_c = 512

, one per token — plus a tiny rotated key slice of width

d_r = 64

◆ paper

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

DeepSeek-AI · 2024 · Technical Report

arxiv:2405.04434

The MLA idea: factor K and V through a shared latent

Start with the attention score for one query, one key:

s_{q,i} \;=\; Q_i \cdot K_i \;=\; (x_i W_Q) \cdot (x_i W_K)

c^{KV}_i \;=\; x_i W_{DKV} \;\in\; \mathbb{R}^{d_c}

Then two up-projections, one for keys and one for values, reconstruct what the attention block actually needs:

K_i \;=\; c^{KV}_i \, W_{UK}, \qquad V_i \;=\; c^{KV}_i \, W_{UV}

the one-line summary

K_i,\;V_i \;=\; (x_i W_{DKV})\,W_{UK},\;\;(x_i W_{DKV})\,W_{UV}

Only $c^{KV}_i = x_i W_{DKV}$ lives in the KV cache. Everything else — every per-head key, every per-head value, every layer — is rebuilt on the fly from this one shared latent.

The latent

c^{KV}_i

is shared across every attention head at this layer. All 128 heads read their own personalised K and V out of the same 512-dim summary, through their own private

W_{UK}

and

W_{UV}

. The per-head diversity that GQA preserves in

W_Q

still exists in MLA — it's just moved into the up-projections.

The decoupled-RoPE problem (and its ugly-but-necessary fix)

s_{q,i} \;=\; (Q_i R_i^{\text{query}}) \cdot (K_i R_i^{\text{key}})

DeepSeek-V2's fix: give RoPE its own little, un-compressed channel. The per-token state is split into two pieces:

$c^{KV}_i$ — the fat, compressed, unrotated latent. Width $d_c = 512$ . Goes through the up-projections for content.
$k^R_i$ — a small, separately computed, rotated key slice. Width $d_r = 64$ . Carries the positional signal per se.

The query-absorption trick — why MLA is free at inference

Here is the piece that flips MLA from clever to obviously the right answer. The naive way to run MLA is:

Read $c^{KV}_i$ from the cache.
Up-project: $K_i = c^{KV}_i W_{UK}$ .
Dot: $s = Q_i \cdot K_i$ .

The saving rides on a single algebraic identity. Writing it out for the unrotated part:

Q_i \cdot K_i \;=\; Q_i \cdot (c^{KV}_i W_{UK}) \;=\; (Q_i W_{UK}^{\top}) \cdot c^{KV}_i

s \;=\; (x_i \tilde W_Q) \cdot c^{KV}_i

At inference time, $W_{UK}$ never runs. Only $c^{KV}_i$ is loaded. The per-token compute is back to GQA-class, but the KV footprint is still the 30× smaller one. This is why MLA scales to 128K.

training time

inference time

$W_{UK}$ is folded into $W_Q$ ; $W_{UV}$ is folded into the output projection. Only $c^{KV}$ and the small RoPE slice $k^R$ load from HBM per decode step. Bandwidth per token drops ~20×.

The real numbers, at DeepSeek-V2 scale

Set $L = 60, h = 128, d_h = 128, d_c = 512$ , fp16 ( $2$ bytes). Plug a full 128K context in and the arithmetic is savage:

MHA · 128K

~515 GB

2 · 60 · 128 · 128 · 128K · 2B (K+V)

GQA h/g=4 · 128K

~129 GB

four-way head sharing

MLA · 128K

~9 GB

60 · (512 + 64) · 128K · 2B (joint latent + RoPE key)

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

What does MLA cache per token, per layer?

this lesson appears in

DeepSeek-V3