Microscale
0
Act VIIIServing the Model
lesson memory-wall · 15 min · 60 xp

Inside the GPU

The memory pyramid, the roofline, and why decode is bandwidth-bound — the prerequisite

A question most inference tutorials skip

Here is a clean, specific question. You have a 7-billion parameter language model in FP16. You have an H100 SXM — Nvidia's top-of-line Hopper GPU, 989 TFLOPs of BF16 compute, 3.35 TB/s HBM3 bandwidth. You want to decode one token. How long does it take, at best?

the naive answer — flops only

Decoding one token requires a forward pass through the model. For a dense transformer, the forward pass takes roughly 2N2N FLOPs where NN is the parameter count:

FLOPs=27×109=1.4×1010\text{FLOPs} = 2 \cdot 7 \times 10^9 = 1.4 \times 10^{10}
tcompute=1.4×1010989×1012FLOPs/s14μst_\text{compute} = \frac{1.4 \times 10^{10}}{989 \times 10^{12}\,\text{FLOPs/s}} \approx 14\,\mu\text{s}

Fourteen microseconds. Seventy thousand tokens per second. That would be incredible.

It is also completely wrong. Real H100 inference on a 7B model in FP16 is roughly 170 tokens per second, or about 5.9 milliseconds per token. That's 400 times slower than the compute bound.

the real answer — bytes

The compute model missed the actual binding constraint: to compute the forward pass, the GPU has to read every weight in the model from HBM into the cores. Once per token.

bytes to read=7×1092bytes=14GB\text{bytes to read} = 7 \times 10^9 \cdot 2\,\text{bytes} = 14\,\text{GB}
tmemory=14GB3.35TB/s4.2mst_\text{memory} = \frac{14\,\text{GB}}{3.35\,\text{TB/s}} \approx 4.2\,\text{ms}

The bandwidth bound is about 4.2 milliseconds. That is consistent — within scheduling overhead — with the real 5.9 ms number. The compute was never the bottleneck; the memory bus was.

This is the memory wall. It is the single most important fact in Act VIII. Every optimisation you are about to learn — PagedAttention, FlashAttention, KV-cache quantization, speculative decoding, chunked prefill — is a specific attack on this number. If you don't understand where the 4.2 ms comes from, the rest of the act reads like a pile of disconnected tricks.

MMXXVI
historical note
1995 · William A. Wulf and Sally A. McKee
The term “memory wall” was coined by Wulf and McKee in a one-page SIGARCH paper. Their observation: processor speed was doubling every 18 months (Moore's law), but DRAM speed was improving at only 7% per year. They predicted the gap would eventually become the dominant performance bottleneck. They were right — and the gap is now so large that modern AI hardware is designed first around memory bandwidth and second around arithmetic capability.

The memory pyramid — a tour from register to HBM

Every GPU has a hierarchy of memory levels. Each level is faster but smaller than the one below it. The top levels sit inside the compute units; the bottom levels are physically separate chips connected by a bus. The further you go down the pyramid, the slower access gets, and the more the bytes cost you in time.

Click through the layers on the right — the numbers are the actual H100 figures. These are the constraints every inference engine is actually optimising against.

H100 memory hierarchy · click any level
RegistersSRAM / Shared memoryL2 CacheHBM3 (80 GB)System RAM / PCIe
level 4 of 5
HBM3 (80 GB)
capacity
80 GB
bandwidth
3.35 TB/s
latency
~500 cycles
Physically separate memory stacks, connected by a 1024-bit-wide bus per stack through a silicon interposer. Where your model weights, your KV cache, and your activations all live.

The roofline — one picture, every bottleneck

Now we can state the memory wall precisely. A GPU has two peak numbers:

  • Peak compute π\pi (FLOPs per second) — how fast it can multiply-accumulate if nothing else gets in the way.
  • Peak bandwidth β\beta (bytes per second) — how fast it can move bytes from HBM to the compute units.

An operation has an arithmetic intensity II: FLOPs divided by bytes of data the op has to touch.

the roofline
achievable FLOPs/s  =  min ⁣(π,  βI)\text{achievable FLOPs/s} \;=\; \min\!\big( \pi, \; \beta \cdot I \big)

If your intensity is low (memory-bound), you get βI\beta \cdot I: performance grows linearly with intensity. If your intensity is high (compute-bound), you get π\pi: the peak, flat ceiling. The crossover happens at the ridge point: I=π/βI^* = \pi / \beta.

For an H100: π=989\pi = 989 TFLOPs/s, β=3.35\beta = 3.35 TB/s, so I=989/3.35295I^* = 989/3.35 \approx 295 FLOPs per byte. Any operation with arithmetic intensity below 295 is memory-bound on an H100. LLM decode has an intensity of about 1 (two FLOPs per FP16 weight byte). It is 295 times below the ridge point.

The interactive plot below is the roofline. Pick a GPU on the right. Each workload marker sits at its arithmetic intensity; its y-value is the achievable performance. Notice how LLM decode lives in the steep bandwidth-limited region and is at the mercy of β\beta, not π\pi.

roofline — NVIDIA H100 SXM (Hopper, 2023)
0.11101001k10k1101001000log₁₀ arithmetic intensity (FLOPs/byte)log₁₀ achievable TFLOPs/sLLM decodeLLM prefill (batch=32)Dense matmul (BLAS)Element-wise (ReLU)
ridge point I*
295FLOPs/byte
decode is
295×under ridge
prefill is
4.6×under ridge

Decode vs prefill — different sides of the roofline

Look at where the four markers sit on the plot. Notice what happens as arithmetic intensity changes:

  • Element-wise ops (ReLU, add, normalise) have AI ≈ 0.25. Almost no compute per byte. They sit on the floor of the bandwidth-limited region.
  • LLM decodehas AI ≈ 1. Decoding one token reads all weights once and does 2 FLOPs per parameter. In FP16, that's 2 FLOPs per 2 bytes = 1 FLOP/byte.
  • LLM prefill (batched over many tokens) has AI ≈ 64 or higher. The same weight read is amortised over many token computations, so AI scales with batch size and sequence length. Prefill is usually compute-bound.
  • Dense BLAS matmul (e.g., GEMM in training on big matrices) has AI ≈ 500. Deep in the compute-bound regime. GPUs are designed to run at peak here, not during LLM decode.

Compute vs bandwidth — see it in wall-clock time

Now actually play with it. Pick a model size, a precision, and a GPU. Watch the compute time and the memory-read time. For decode, memory time is always dramatically larger. The ratio is your bandwidth-bound multiplier.

7B
precision
Each level of quantization doubles effective decode throughput — by halving the bytes per weight, not by doing less math.
weights to read
14.0GB
memory-bound time
4.18ms
compute-bound time
14.2 μs
you're bandwidth-bound by
295×
where the time goes (one decoded token)
bandwidth (read 14.0 GB)
compute
Every one of the 295compute-times you see is time the GPU's cores sit idle waiting for the next slab of weights to arrive from HBM. That idle time is what Act VIII's optimizations attack.

The physical GPU — where every byte actually lives

The memory pyramid is a useful abstraction, but it hides the geography. An H100 is a physical 814 mm² chip — if you hold one in your hand, you can see the pieces. Eight processing clusters ring the die. A split L2 cache sits in the middle. Five HBM3 memory stacks hug the outside, each connected by thousands of micro-wires through a silicon interposer. This layout is not cosmetic; it determines the topology of data movement.

MMXXVI
historical note
2022 · NVIDIA's Hopper team
The Hopper GH100 die is built on TSMC's 4N process and packs 80 billion transistors onto 814 mm² — the second-largest reticle-limit die ever shipped. The full die has 144 SMs, but yield is never 100%, so Nvidia disables 12 and ships the consumer/datacenter H100 SXM5 with 132 SMs — still more than any prior accelerator. One of the six possible HBM3 sites is disabled for the same yield reason, leaving five active stacks and the 80 GB capacity you know.
H100 SXM5 die · click a component
silicon interposerHBM3 1HBM3 2HBM3 3HBM3 4HBM3 5L2 cache · 50 MB
component
HBM3 stack
5 active, 16 GB each, 80 GB total
Each stack is a vertically-stacked set of DRAM dies bonded to a silicon interposer through thousands of through-silicon vias. Each stack has a 1024-bit data bus running at ~6.4 GT/s, giving ~819 GB/s per stack. Five active stacks × 819 = ~4 TB/s theoretical, 3.35 TB/s sustained. One of the six possible HBM3 sites on the die is disabled for yield.

One Llama layer, one tile at a time

Now to answer the question you should have been asking all along: when the GPU “reads” the weights of a layer from HBM, what actually happens?Not all of the layer streams in at once — that would need gigabytes of SRAM, which we don't have. Instead, the matrix is tiled. A small tile of the weight matrix streams from HBM into an SM's shared memory, gets multiplied against a tile of the input, and produces a partial output. Meanwhile the next tile is already arriving. The 132 SMs all do this in parallel, each owning its own tiles.

Here is the math made completely concrete, for one Q projection in one Llama-7B layer.

worked example — one Q projection, one Llama-7B layer, one decoded token, H100 SXM5
step 1 · the weight matrix
shape
4096 × 4096
elements
16.78M
FP16 bytes
32.00MB
step 2 · tile it into 128×128 pieces so each tile fits in an SM's shared memory
tile size
128 × 128
tile bytes
32KB
total tiles
1024
tiles per SM
7.8
4096 × 4096 weight matrix, tiled 32 × 32 = 1,024 tiles
step 3 · time to stream all 33.55 MB from HBM at 3.35 TB/s
tread=32.00MB3.35TB/s=33.55×1063.35×1012s10.02μst_\text{read} = \frac{32.00\,\text{MB}}{3.35\,\text{TB/s}} = \frac{33.55 \times 10^6}{3.35 \times 10^{12}}\,\text{s} \approx 10.02\,\mu\text{s}
All 132 SMs load their tiles in parallel through the L2 into their local SRAM. The bandwidth is shared, so the total time is dominated by the serial cost of streaming 32.0 MB through the HBM pipe, not by the per-tile cost.
step 4 · time to do the math on a single query vector
FLOPs=2d2=240962=33.55×106\text{FLOPs} = 2 d^2 = 2 \cdot 4096^2 = 33.55 \times 10^6
tcompute=33.55×106989×1012FLOPs/s34nst_\text{compute} = \frac{33.55 \times 10^6}{989 \times 10^{12}\,\text{FLOPs/s}} \approx 34\,\text{ns}
step 5 · the memory wall, in wall-clock time
read time
10.02μs
compute time
34ns
memory-bound by
295×
For every 34 ns of arithmetic, the SMs spend 10.02 μs waiting for bytes to arrive from HBM. This is one Q projection. A full Llama 7B forward pass does ~7 of these per layer × 32 layers, and the same memory wall applies to every one of them. Stacked up, the whole forward pass takes ~4.2 ms per decoded token — which is, within a constant factor, exactly what you measure on a real H100.

Why the tile size is 128 and not 4096

The matrix is 4096 × 4096, but tiles are 128 × 128. Why? The answer sits in the memory hierarchy you just saw on the pyramid. An H100 SM has about 228 KB of shared memory (SRAM). A single 128 × 128 tile of FP16 weights is 128×128×2=32128 \times 128 \times 2 = 32 KB — it fits comfortably with room to spare for input tiles, output accumulators, and software pipelining. A 256×256256 \times 256tile would be 128 KB — also fits, which is why Hopper's tensor cores often use that size. But you cannot fit the full 4096 × 4096 matrix in SRAM. It is 33 MB, two orders of magnitude too big. So you tile.

The tiling is what makes the data-flow pipeline possible. While tile tt is being computed, the Tensor Memory Accelerator (TMA) is already streaming tile t+1t+1 from HBM into a different SRAM buffer. When compute finishes tile tt, the data for tile t+1t+1 is already waiting. This is double buffering, and it's what keeps the 132 SMs fed. Without it, every SM would spend most of its time waiting for HBM.

Watch it move — data flow in motion

All the static diagrams are true, but they hide the one thing that makes the pipeline actually work: double buffering. While one tile is being computed in the tensor cores, the next tile is already streaming in from HBM. The pipeline is always full, which is the only reason the SMs don't spend their lives waiting for memory.

Press play below. Watch the copper tiles flow from HBM through L2 into SRAM and finally into the tensor cores, flash briefly while they compute, and exit as results. The dwell ratiois real — tiles spend most of their time in the HBM zone because that's where most of the wall-clock time is actually spent. Speed up the animation and you can see three, four tiles in flight at once — that's the pipeline filling.

H100 data-flow pipeline — 32 KB weight tiles, H100 SXM5
HBM380 GB · 3.35 TB/s0 in flightL2 cache50 MB · ~3 TB/s0 in flightSRAM228 KB · ~15 TB/s0 in flightTensor Core989 TFLOPs BF160 in flightresult ← writes back to HBM
tiles processed
0
in the pipeline
0
computing now
0
~ simulated time
0 μs
weight tile (32 KB)
currently computing in a tensor core
output path back to HBM

What the animation is actually telling you

Three observations worth writing down:

  • The HBM zone is wide on purpose.At any given moment, more tiles are in the HBM-to-L2 read phase than in any other phase, because that's the slowest hop in the pipeline. The visual proportions match the real time ratios: ~70% of a tile's lifetime is spent being streamed from HBM.
  • Multiple tiles are always in flight. While tile 5 is computing in the tensor cores, tile 6 is in SRAM waiting, tile 7 is streaming from L2, and tile 8 is streaming from HBM. This is double buffering. It's the only reason peak throughput is achievable.
  • The tensor cores flash fast.The “compute” phase — the actual math — is the shortest segment on screen. That visual is the memory wall. Compute is fast. Moving bytes is slow. Nothing in Act VIII changes that; everything in Act VIII is a different way of moving fewer bytes for the same amount of math.
A real H100 pipeline has dozens of tiles in flight simultaneously per SM, and 132 SMs all running this pipeline in parallel. The animation shows 6–10 tiles in a single pipeline for clarity. The mechanism scales — the parallelism just multiplies.

How a matmul actually runs on the hardware

The word “matmul” hides an enormous amount of mechanical work. Here is what happens when an H100 executes a matrix multiply of Y=WXY = WX:

  1. The scheduler divides WWinto tiles sized to fit in one SM's SRAM (≈ 128 KB).
  2. For each tile, the TMA (Tensor Memory Accelerator, a hardware DMA engine) starts streaming the tile from HBM into SRAM asynchronously.
  3. While one tile loads, the tensor cores multiply the previous tile. This is the single most important trick in GPU design: overlap data transfer with compute so the compute units never stall.
  4. The partial result stays in registers. When all tiles of a row are done, the result is written back to HBM.

Where this leads — Act VIII as a tour of the memory wall

Every lesson that follows this one attacks the memory wall from a different angle:

  • KV cache — what it is, why it is the biggest source of memory traffic in long-context decode.
  • PagedAttention — making the KV cache layout coalesced and recyclable.
  • Continuous batching — pack more tokens per weight read, increasing effective AI.
  • FlashAttention — keep intermediate tensors in SRAM, avoid HBM round-trips.
  • Speculative decoding — verify several proposed tokens in a single weight read, amortising the cost.
  • RadixAttention — reuse the KV bytes of shared prefixes instead of re-reading them.

With the roofline in mind, each one reads as a different way of shifting a red dot rightward on the plot you just played with.

comprehension check
comprehension · 1 / 4

Decoding a single token with a 7B FP16 model on an H100 takes about 4–6 ms. Why isn't it 14 µs (the compute-bound estimate)?