lesson radix-attention · 10 min · 50 xp

RadixAttention

How SGLang's radix tree reuses cached prefixes across requests — achieving 50-99% cache hit rates for shared-prompt workloads

Shared prefixes are everywhere — stop prefilling them twice

Real-world LLM workloads have huge amounts of shared prefix: every request shares a system prompt; agent trajectories branch late from a common prefix; few-shot prompts repeat across many queries. Standard engines re-prefill those shared tokens on every request. Wasteful.

RadixAttention (SGLang, LMSYS 2024) stores the KV cache in a radix tree indexed by token prefix. When a new request arrives, walk down the tree matching tokens; for every match, reuse the cached KV block. Unmatched suffixes become new tree nodes. LRU eviction keeps the cache bounded.

serving requests into the radix cache

req 1

sys:

You

are

helpful.

req 2

sys:

You

are

helpful.

req 3

sys:

You

are

helpful.

req 4

sys:

You

are

brief.

prefix tree after 1 request

tokens served

cache hits

hit rate

Watch the hit rate climb as more requests with shared system prompts flow through. The radix tree's first four tokens ("sys: You are helpful.") are cached once and reused by every subsequent request. Request 4 diverges on the 4th token ("brief.") and the tree forks there — you can see the fork in the diagram above.

When this wins big

Agent workloads — many rollouts from the same prompt for tree search or best-of-N
Chat products — every user shares a system prompt
Few-shot serving — same few-shot example bank across thousands of queries
Document-grounded assistants — same retrieved context shared across a session's turns

SGLang reports 5–10× speedups over vLLM on these workloads. For independent-request workloads with no prefix sharing, RadixAttention offers no benefit. Match the engine to the workload pattern.

Why this is really a storage trick — with scheduling strings attached

RadixAttention doesn't make attention faster. It makes the prefillyou don't have to do faster — infinitely, because you skip it. That's why the latency win shows up in TTFT (time-to-first-token), not throughput per decode step: once a prefix is cached, a new request with that prefix starts decoding on token $n+1$ instead of paying the full $O(n^2)$ prefill. For a 2048-token system prompt on a 7B model, that's roughly 400–600 ms erased from TTFT on an H100.

The catch is that the scheduler has to know about the cache. Zheng et al. (SGLang, 2024) showed that a naive FIFO scheduler destroys hit rate — an LRU eviction can toss a hot prefix right before a burst of requests that would have reused it. SGLang's scheduler is prefix-aware: it reorders the request queue so sequences sharing a prefix are served together, pinning their common nodes in cache for the duration of the burst. That scheduler coupling is why you can't just "add RadixAttention to vLLM" as a library; prefix caching and scheduling are the same design decision. vLLM's own automatic prefix caching (APC) is a simpler hash-keyed block-table reuse, which gets most of the storage win but not the scheduling one.

Shared prefixes are everywhere — stop prefilling them twice

When this wins big

Agent workloads — many rollouts from the same prompt for tree search or best-of-N

Chat products — every user shares a system prompt

Few-shot serving — same few-shot example bank across thousands of queries

Document-grounded assistants — same retrieved context shared across a session's turns

SGLang reports 5–10× speedups over vLLM on these workloads. For independent-request workloads with no prefix sharing, RadixAttention offers no benefit. Match the engine to the workload pattern.

Why this is really a storage trick — with scheduling strings attached

n+1

instead of paying the full

O(n^2)

prefill. For a 2048-token system prompt on a 7B model, that's roughly 400–600 ms erased from TTFT on an H100.