Microscale
0
Act VIIIServing the Model
lesson radix-attention · 10 min · 50 xp

RadixAttention

Prefix trees for shared-prompt workloads

Shared prefixes are everywhere — stop prefilling them twice

Real-world LLM workloads have huge amounts of shared prefix: every request shares a system prompt; agent trajectories branch late from a common prefix; few-shot prompts repeat across many queries. Standard engines re-prefill those shared tokens on every request. Wasteful.

RadixAttention (SGLang, LMSYS 2024) stores the KV cache in a radix tree indexed by token prefix. When a new request arrives, walk down the tree matching tokens; for every match, reuse the cached KV block. Unmatched suffixes become new tree nodes. LRU eviction keeps the cache bounded.

serving requests into the radix cache
req 1
sys:
You
are
helpful.
Q1
req 2
sys:
You
are
helpful.
Q2
req 3
sys:
You
are
helpful.
Q3
req 4
sys:
You
are
brief.
Q4
prefix tree after 1 request
Q1req 1helpful.areYousys:
tokens served
5
cache hits
0
hit rate
0%
Watch the hit rate climb as more requests with shared system prompts flow through. The radix tree's first four tokens ("sys: You are helpful.") are cached once and reused by every subsequent request. Request 4 diverges on the 4th token ("brief.") and the tree forks there — you can see the fork in the diagram above.

When this wins big

  • Agent workloads — many rollouts from the same prompt for tree search or best-of-N
  • Chat products — every user shares a system prompt
  • Few-shot serving — same few-shot example bank across thousands of queries
  • Document-grounded assistants — same retrieved context shared across a session's turns

SGLang reports 5–10× speedups over vLLM on these workloads. For independent-request workloads with no prefix sharing, RadixAttention offers no benefit. Match the engine to the workload pattern.

Why this is really a storage trick — with scheduling strings attached

RadixAttention doesn't make attention faster. It makes the prefillyou don't have to do faster — infinitely, because you skip it. That's why the latency win shows up in TTFT (time-to-first-token), not throughput per decode step: once a prefix is cached, a new request with that prefix starts decoding on token n+1n+1 instead of paying the full O(n2)O(n^2)prefill. For a 2048-token system prompt on a 7B model, that's roughly 400–600 ms erased from TTFT on an H100.

The catch is that the scheduler has to know about the cache. Zheng et al. (SGLang, 2024) showed that a naive FIFO scheduler destroys hit rate — an LRU eviction can toss a hot prefix right before a burst of requests that would have reused it. SGLang's scheduler is prefix-aware: it reorders the request queue so sequences sharing a prefix are served together, pinning their common nodes in cache for the duration of the burst. That scheduler coupling is why you can't just "add RadixAttention to vLLM" as a library; prefix caching and scheduling are the same design decision. vLLM's own automatic prefix caching (APC) is a simpler hash-keyed block-table reuse, which gets most of the storage win but not the scheduling one.