Loading…

Microscalea field journal for small language models

progress0 / 2605 xp

Primer Curriculum Models Labs

0

← back to the atlas

Act VIII · Region 08

Serving the Model

The hardware-level act: KV cache, paging, speculation

Watch the KV cache grow. See PagedAttention turn the GPU into a tiny operating system. See speculative decoding dance — drafts proposed and verified. This is where SLM serving becomes economical.

badge · KV Cache Master

0 of 9 lessons completed

1
Inside the GPU
Registers, L1, L2, HBM — the memory pyramid and roofline model that explain why LLM token generation is bottlenecked by bandwidth, not compute
15 min
60 xp
2
The KV cache
What the KV cache stores, why it dominates GPU memory at long contexts, and how quantization and eviction strategies reduce the footprint
10 min
50 xp
3
PagedAttention
How vLLM borrows OS paging to eliminate KV cache fragmentation — block tables, non-contiguous allocation, and near-zero memory waste
12 min
60 xp
4
Continuous batching
Why static batching wastes GPU cycles and how iteration-level scheduling in vLLM and TGI streams requests without idle slots
9 min
45 xp
5
FlashAttention
Never materialize the full attention matrix — how tiling and on-chip SRAM cut HBM reads while computing exact softmax
10 min
50 xp
6
Speculative decoding
A small draft model proposes tokens, the target model verifies in one pass — 2-3× faster generation with zero quality loss
11 min
55 xp
7
Forcing valid JSON: grammar-constrained decoding
How vLLM forces models to emit valid JSON by masking logits against a compiled grammar — XGrammar's near-zero overhead trick
12 min
50 xp
8
Predicting further ahead: MTP breaks the one-token-per-step contract
DeepSeek-V3 and Qwen3-Next predict 2-4 tokens in one forward pass — sequential modules that chain, acceptance rates, and the speed tradeoff
22 min
70 xp
9
RadixAttention
How SGLang's radix tree reuses cached prefixes across requests — achieving 50-99% cache hit rates for shared-prompt workloads
10 min
50 xp