lesson kv-cache · 10 min · 50 xp

The KV cache

What the KV cache stores, why it dominates GPU memory at long contexts, and how quantization and eviction strategies reduce the footprint

Lab 11 · KV Cache Budget Calculator · 60 min

The cache you didn't know dominated everything

During autoregressive generation, every step requires attention over every previous token. Recomputing K and V for all prior tokens on every step would be quadratically wasteful — each of $L$ new tokens would recompute $O(L)$ attention, for total $O(L^2)$ work per generation.

The fix is the KV cache: store $K$ and $V$ from every past token so you only compute them once. Generation per token becomes $O(L)$ attention over cached keys — much better.

\text{KV bytes per token} = 2 \cdot L_{\text{layers}} \cdot h_\text{kv} \cdot d_h \cdot \text{bytes}

For a 3B-class model (32 layers, 8 KV heads, $d_h=128$ , FP16): that's 128 KB per token. Sounds small until you multiply by sequence length and concurrent users.

context length (tokens)4096

concurrent sequences8

KV / tok

128 KB

KV / seq

512 MB

KV total

4.0 GB

weights + KV

10.0 GB

memory breakdown

At 4096 tokens × 8 concurrent sequences, KV cache is 40% of total memory. Scaling concurrency is a KV problem, not a weights problem.

Why this is the binding constraint in production

Once KV cache exceeds weights, every additional concurrent user directly eats into your available slots. A 24 GB GPU with a 6 GB model can host 18 GB of KV cache. That's enough for maybe ~12 concurrent 4k-context sequences — without any paged attention, continuous batching, or FP8 KV tricks. Those techniques all exist specifically to raise that ceiling.

And memory is only half the story — every decode step has to read the accumulated KV back from HBM to compute attention. At sequence length 4,096, each forward pass pulls 512MB per sequence just for KV reads, on top of the ~6 GB weight read. On an H100's 3.35 TB/s HBM3 that KV read alone costs 149.25 ms — which is why long-context decode gets progressively slower as the sequence grows. It also explains why GQA exists: the $h_{\text{kv}}=8$ in the formula above used to be $h_{\text{kv}}=32$ (full MHA), and shrinking it 4× is a straight 4× cut on both KV memory andKV bandwidth. Llama-2-70B ships 8 KV heads for exactly this reason — Shazeer's 2019 multi-query paper called the shot a full four years before it became production-standard.

The next lesson shows PagedAttention, which turns the KV cache from a contiguous block into a paged virtual memory system. The fragmentation savings alone yield 2–4× more concurrent sessions on the same hardware.

The cache you didn't know dominated everything

During autoregressive generation, every step requires attention over every previous token. Recomputing K and V for all prior tokens on every step would be quadratically wasteful — each of

L

new tokens would recompute

O(L)

attention, for total

O(L^2)

work per generation.

The fix is the KV cache: store

K

and

V

from every past token so you only compute them once. Generation per token becomes

O(L)

attention over cached keys — much better.

\text{KV bytes per token} = 2 \cdot L_{\text{layers}} \cdot h_\text{kv} \cdot d_h \cdot \text{bytes}

For a 3B-class model (32 layers, 8 KV heads,

d_h=128

, FP16): that's 128 KB per token. Sounds small until you multiply by sequence length and concurrent users.

Why this is the binding constraint in production

h_{\text{kv}}=8

in the formula above used to be

h_{\text{kv}}=32

(full MHA), and shrinking it 4× is a straight 4× cut on both KV memory andKV bandwidth. Llama-2-70B ships 8 KV heads for exactly this reason — Shazeer's 2019 multi-query paper called the shot a full four years before it became production-standard.