MLA: compressing the KV cache into a latent
At 128K context, a DeepSeek-V2-shape MHA cache would eat half a terabyte of HBM per sequence. DeepSeek-V2 cuts that by 93% — not by sharing heads more aggressively, but by factoring keys and values through a low-rank latent.
A quick recap, because MLA only makes sense as a third step
In From MHA to GQA we watched the KV-head count collapse from down through and eventually to (MQA). Each step shrank the per-token cache, but the sharing ratio had a ceiling — squeeze too hard and summarisation scores fell off a cliff. GQA with became the 2023–2024 default precisely because it was the last safe rung.
By mid-2024 the binding constraint had moved. Context windows leapt from 4K to 128K. The per-token cost was already fine — the sequence-length multiplier was what dominated. A Llama-class model at 128K with GQA still needed tens of gigabytes of KV per sequence. At serving concurrency of 10–50 users, that was unworkable. Something more aggressive than head-sharing was needed.
The MLA idea: factor K and V through a shared latent
Start with the attention score for one query, one key:
In vanilla attention, we cache and for every head, every layer, every token. MLA's move is to notice that and across all heads can be reconstructed from a single, much smaller per-token summary. Define a down-projection , which takes a token to the shared latent:
Then two up-projections, one for keys and one for values, reconstruct what the attention block actually needs:
Only lives in the KV cache. Everything else — every per-head key, every per-head value, every layer — is rebuilt on the fly from this one shared latent.
The per-token cache is no longer bytes. It's just bytes (we drop the 2× because DeepSeek-V2's actual MLA caches a single joint latent, not separate K and V latents). With , that's 60 KB per token versus MHA's ~2 MB — a 30× cut, before RoPE bookkeeping.
The decoupled-RoPE problem (and its ugly-but-necessary fix)
There is one snag, and every tutorial on MLA either glosses it or gets it wrong. RoPE is multiplicative and position-dependent. In a RoPE attention block, is rotated by a token-position matrix before the dot product:
If you try to absorb into the query path (the trick that makes MLA runtime-free — we'll get there in the next section), the rotation sits between and and breaks the associativity. You can't just move over — it no longer commutes with the rotation.
DeepSeek-V2's fix: give RoPE its own little, un-compressed channel. The per-token state is split into two pieces:
- — the fat, compressed, unrotated latent. Width . Goes through the up-projections for content.
- — a small, separately computed, rotated key slice. Width . Carries the positional signal per se.
At score time, the two pieces are concatenated: content similarity (through the latent) plus positional similarity (through the RoPE slice), summed together. The cache stores both, but is still an order of magnitude smaller than .
The query-absorption trick — why MLA is free at inference
Here is the piece that flips MLA from clever to obviously the right answer. The naive way to run MLA is:
- Read from the cache.
- Up-project: .
- Dot: .
Step 2 materialises a full key vector per head, per token — which is exactly the memory traffic we were trying to avoid. If MLA cost that much at inference, nothing would have been gained.
The saving rides on a single algebraic identity. Writing it out for the unrotated part:
The rewrite on the right is just the associativity of the matrix multiply. But it has a completely different runtime profile. We no longer need to materialise at all. The factor can be pre-merged into the query projection itself: define , and then the query is computed directly against the stored latent:
At inference time, never runs. Only is loaded. The per-token compute is back to GQA-class, but the KV footprint is still the 30× smaller one. This is why MLA scales to 128K.
All projections are explicit. The down-projection produces ; the up-projections reconstruct per-head K and V; attention proceeds as normal. Training adds ~5% parameters (the extra matrices) but recycles the same gradient machinery.
is folded into ; is folded into the output projection. Only and the small RoPE slice load from HBM per decode step. Bandwidth per token drops ~20×.
The real numbers, at DeepSeek-V2 scale
Set , fp16 ( bytes). Plug a full 128K context in and the arithmetic is savage:
A ~97% cut versus a same-shape MHA baseline. The paper's headline 93.3% figure is a slightly different comparison — DeepSeek-V2 (MLA) versus its immediate predecessor DeepSeek-67B, which already used GQA. Across every reasonable baseline the reduction lands in the same order-of-magnitude bracket: MLA cuts attention-cache memory by roughly one to two orders of magnitude relative to full MHA at long context.
The throughput story is equally dramatic. Decode is memory- bandwidth bound: each generated token requires reading the entire KV cache of every prior token. Shrinking the cache shrinks the per-token bandwidth. DeepSeek-V2 reports 5.76× higher generation throughput than their MHA baseline in the paper, and the Chinese API saw steep per-token price cuts as the MLA-based models rolled out through 2024.