The cache you didn't know dominated everything
During autoregressive generation, every step requires attention over every previous token. Recomputing K and V for all prior tokens on every step would be quadratically wasteful — each of new tokens would recompute attention, for total work per generation.
The fix is the KV cache: store and from every past token so you only compute them once. Generation per token becomes attention over cached keys — much better.
For a 3B-class model (32 layers, 8 KV heads, , FP16): that's 128 KB per token. Sounds small until you multiply by sequence length and concurrent users.
Why this is the binding constraint in production
Once KV cache exceeds weights, every additional concurrent user directly eats into your available slots. A 24 GB GPU with a 6 GB model can host 18 GB of KV cache. That's enough for maybe ~12 concurrent 4k-context sequences — without any paged attention, continuous batching, or FP8 KV tricks. Those techniques all exist specifically to raise that ceiling.
And memory is only half the story — every decode step has to read the accumulated KV back from HBM to compute attention. At sequence length 4,096, each forward pass pulls 512MB per sequence just for KV reads, on top of the ~6 GB weight read. On an H100's 3.35 TB/s HBM3 that KV read alone costs 149.25 ms — which is why long-context decode gets progressively slower as the sequence grows. It also explains why GQA exists: the in the formula above used to be (full MHA), and shrinking it 4× is a straight 4× cut on both KV memory andKV bandwidth. Llama-2-70B ships 8 KV heads for exactly this reason — Shazeer's 2019 multi-query paper called the shot a full four years before it became production-standard.
The next lesson shows PagedAttention, which turns the KV cache from a contiguous block into a paged virtual memory system. The fragmentation savings alone yield 2–4× more concurrent sessions on the same hardware.