lesson constrained-decoding · 12 min · 50 xp

Forcing valid JSON: grammar-constrained decoding

How vLLM forces models to emit valid JSON by masking logits against a compiled grammar — XGrammar's near-zero overhead trick

act viii · serving · lesson 9

Forcing valid JSON: grammar-constrained decoding

How vLLM forces models to emit valid JSON by masking logits against a compiled grammar — and how XGrammar does it with near-zero overhead.

decode playground

one token at a time · grammar on

grammar: \{"name":"[a-zA-Z]+","age":\d+\}

grammar

step 0 / 12

top-20 logit distribution

tokens invalid from the current FSM state have their logits effectively set to −∞ — their bars fade to 10% opacity. sampling happens only over the valid subset.

grammar fsm · current state start

copper node · current state · moss ring · visited · faded arrows · transitions unreachable from here.

output so far

press “decode step” to begin

state chain: start

tokens emitted

grammar state

quality

in progress

The failure mode

Ask a modern instruction-tuned SLM to emit JSON, and most of the time you'll get JSON. Most of the time. On Phi-4-mini the measured rate hovers around 91% — the other 9% is the model slipping into a stray explanation, a trailing comma, an unterminated quote, or a markdown fence wrapping the whole response.

Nine percent is invisible in a demo. It is catastrophic in a tool pipeline. If your agent parses model output to call a function, one bad response in ten means the whole chain halts — and if you retry blindly, latency and cost compound until someone pages the on-call. The contract the rest of your system needs is not “usually valid”; it is always valid. Prompt engineering cannot get you there, because the decoder has no idea it's supposed to be writing JSON — it's just rolling dice over a 128k-token vocabulary, one token at a time.

the contract

The input to the sampler is a logit vector $\ell \in \mathbb{R}^{|V|}$ with one entry per token in the vocabulary. The sampler's job is to pick a token. Nothing in that pipeline knows what a brace is, or what a schema is, or that your downstream parser is about to call json.loads() on the result.

The one-line fix: mask the logits

Before the softmax, for every token in the vocabulary, ask the grammar a single question: would emitting this token here produce a prefix that can still be extended to something valid? If yes, the logit is untouched. If no, the logit is set to $-\infty$ .

python

# vLLM-style grammar-guided sampling, simplified
logits = model.forward(tokens)        # [vocab_size]
mask   = grammar.allowed_at(state)    # bool[vocab_size]
logits = logits.masked_fill(~mask, -inf)
probs  = softmax(logits)
token  = sample(probs)
state  = grammar.advance(state, token)

That's the whole idea. Softmax-of-minus-infinity is zero, so the invalid tokens cannot be sampled at any temperature. The model's own probability ranking is preserved over the legal tokens — you're not distorting its preferences, you're just deleting the options that would violate the contract. The literature calls this logit masking, and almost every production system (vLLM, TensorRT-LLM, Outlines, XGrammar, llguidance) reduces to exactly this operation at the innermost loop.

◆ paper

Efficient Guided Generation for Large Language Models

Willard, Louf · 2023

arxiv:2307.09702

The Outlines paper — the first widely-used implementation of the idea. It phrases grammar-guided decoding as a traversal of a deterministic finite automaton built from the regular expression, with logit masking at each step. Every system that came after builds on this framing.

From JSON Schema to a context-free grammar

Your Pydantic model is not an FSM. It is nested, recursive, sometimes self-referential. The compiler's job is to walk the schema and emit a grammar — traditionally in Extended Backus–Naur Form — that a decoder can run in O(1) per token.

python

from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age:  int

That class gets compiled to a JSON Schema, which gets compiled to a grammar that looks roughly like this:

ebnf

root    ::= "{" ws "\"name\"" ws ":" ws string ws
            "," ws "\"age\"" ws ":" ws integer ws "}"
string  ::= "\"" char* "\""
char    ::= [^"\\] | "\\" (["\\/bfnrt] | "u" hex hex hex hex)
integer ::= "-"? ([0-9] | [1-9] [0-9]+)
ws      ::= [ \t\n]*

String and integer rules are the hard part — they are unbounded, meaning the grammar has to accept any length. That's why a regex like [a-zA-Z]+needs a self-loop in the state diagram above: “letters” is not a state you leave after one character; it's a state you loop in until you see the closing quote.

XGrammar's caching trick

The naive implementation is: at every decode step, walk the grammar and check each of the 128k tokens against it. Even at 10ns per check that's 1.3ms per token — a catastrophic overhead against vLLM's sub-5ms per-token target on an H100.

XGrammar's insight — and it is the single idea that took grammar-constrained decoding from a 20% overhead to a measurable-but-small one — is that the allowed-token set for a given FSM state is deterministic. The state does not know which prompt produced it; it only knows which grammar symbol is next. So compute it once per state, cache the resulting bitmask, and on every subsequent visit to that state, the “check each token” step collapses to a bitmask AND.

naive (outlines 2023)

Walk the regex / EBNF for every (state, token) pair. Overhead ≈ O(|V|) per decode step with hidden regex engine costs.

Measured overhead on a 70B vLLM server: ~40% latency penalty on JSON-heavy workloads. Acceptable for offline batch; painful for real-time chat.

precomputed (xgrammar 2024)

Per-state Allowed[state] : Bitmask[|V|] computed once, offline, per (grammar, tokenizer) pair. Decode step becomes logits &= Allowed[s] — a single vectorised op.

Measured overhead on the XGrammar paper's own benchmark: roughly 1% TPOT (6.2 → 6.3 ms at batch 1; 9.0 → 9.1 ms at batch 16 on Llama-3.1-8B). Speedups versus baselines: up to ~3× over prior structured- output stacks on JSON Schema, and up to ~100× on nested CFG workloads where the pushdown automaton matters.

◆ paper

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Dong, Ruan, Cai, Luo, Xu, Ren, Ye, Ceze · 2024

arxiv:2411.15100

The XGrammar paper reports near-zero overhead versus unconstrained decoding on vLLM and SGLang benchmarks across JSON Schema, Python, and regex grammars. The trick is not a single algorithm but a layered pipeline: pre-compiled bitmasks for the hot path, a pushdown automaton for the rare nested case, and a CPU-GPU pipeline so the mask for token

t+1

is ready by the time the GPU finishes sampling token

t

Overhead, in actual milliseconds

The XGrammar paper benchmarks on Llama-3.1-8B on a single H100 and reports per-output-token times of 6.2 → 6.3 ms at batch 1 and 9.0 → 9.1 ms at batch 16 — unconstrained vs JSON-Schema-constrained. That is roughly 1% TPOT overhead: small, but not literally zero. Older Outlines deployments on the same workload show a larger gap; the gap narrowed substantially in the 2024 Outlines rewrite.

The reason the XGrammar number is essentially free is the CPU/GPU pipeline: while the GPU is busy with the forward pass for token $t$ , the CPU has already computed and uploaded the bitmask for the state it expects to be in after sampling token $t$ . The mask is there when the logits arrive. No bubble.

field note

If your structured-output latency looks noticeably higher than unconstrained, check which library your serving stack is using. A lot of early deployments are still on pre-2024 Outlines with no bitmask cache, and the XGrammar upgrade is usually a one-line config change in vLLM.

When grammar-constrained fails

The pitfall nobody talks about on the demo stage: a forced grammar can degrade the model's quality. Ask a base model to “summarise this article in JSON with a title and bullet_points” and turn on XGrammar, and you may get a perfectly valid JSON blob full of bad summaries. The model “wanted” to think in prose first — to reason, restate, hedge — and the grammar cut those escape hatches off. What's left is the model racing to fill in fields it hasn't finished thinking through.

The mechanism is straightforward: grammar masking removes tokens the model was about to sample with high probability, pushing distributions into regions of its output space that are out-of-distribution for the current prompt. You're still getting greedy (or temperature-sampled) decoding — but only within a narrow ribbon the grammar allows. If the model's natural completion lives outside that ribbon, you're sampling from the tails.

The production-tested mitigations:

Train-time exposure. Fine-tune on data where the model has already seen the target schema in context. The natural completion distribution moves closer to the ribbon — grammar masking becomes a safety net rather than a straightjacket.
Think-then-emit. Ask for a reasoning field in the schema, before any other field. The model gets to think in prose (inside string quotes), then commits to the structured fields once reasoning is done. Cheap, ugly, effective.
Two-pass decode.Generate freely first, then run a second pass (often a tiny 1B model) to repackage the answer into the target schema. Throws away the “decode once” win but recovers quality; worth it for high-stakes generations.

What to take to production

If you're on vLLM > 0.5, enable guided_json with the XGrammar backend. The overhead is below your p99 noise floor.
Write your Pydantic schema with a reasoning field first if the task involves any computation. The model needs a place to think.
Test the constrained output against real user prompts before you ship. Validity is free; quality is not, and grammar masking can silently degrade it.
For agent tool-calling, constrained decoding is not optional. The alternative is a retry loop that looks fine until 3 AM on a Tuesday when your LLM provider updates a model and the 9% failure rate becomes 14%.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3

In logit-masking grammar-constrained decoding, what value do we assign to the logit of a token that would violate the grammar from the current state?

lesson constrained-decoding · 12 min · 50 xp

Forcing valid JSON: grammar-constrained decoding

How vLLM forces models to emit valid JSON by masking logits against a compiled grammar — XGrammar's near-zero overhead trick

act viii · serving · lesson 9

Forcing valid JSON: grammar-constrained decoding

How vLLM forces models to emit valid JSON by masking logits against a compiled grammar — and how XGrammar does it with near-zero overhead.

decode playground

one token at a time · grammar on

grammar: \{"name":"[a-zA-Z]+","age":\d+\}

grammar

step 0 / 12

top-20 logit distribution

tokens invalid from the current FSM state have their logits effectively set to −∞ — their bars fade to 10% opacity. sampling happens only over the valid subset.

grammar fsm · current state start

copper node · current state · moss ring · visited · faded arrows · transitions unreachable from here.

output so far

press “decode step” to begin

state chain: start

tokens emitted

grammar state

quality

in progress

The failure mode

the contract

The one-line fix: mask the logits

python

# vLLM-style grammar-guided sampling, simplified
logits = model.forward(tokens)        # [vocab_size]
mask   = grammar.allowed_at(state)    # bool[vocab_size]
logits = logits.masked_fill(~mask, -inf)
probs  = softmax(logits)
token  = sample(probs)
state  = grammar.advance(state, token)

◆ paper

Efficient Guided Generation for Large Language Models

Willard, Louf · 2023

arxiv:2307.09702

From JSON Schema to a context-free grammar

python

from pydantic import BaseModel

class Person(BaseModel):
    name: str
    age:  int

That class gets compiled to a JSON Schema, which gets compiled to a grammar that looks roughly like this:

ebnf

root    ::= "{" ws "\"name\"" ws ":" ws string ws
            "," ws "\"age\"" ws ":" ws integer ws "}"
string  ::= "\"" char* "\""
char    ::= [^"\\] | "\\" (["\\/bfnrt] | "u" hex hex hex hex)
integer ::= "-"? ([0-9] | [1-9] [0-9]+)
ws      ::= [ \t\n]*

XGrammar's caching trick

naive (outlines 2023)

Walk the regex / EBNF for every (state, token) pair. Overhead ≈ O(|V|) per decode step with hidden regex engine costs.

Measured overhead on a 70B vLLM server: ~40% latency penalty on JSON-heavy workloads. Acceptable for offline batch; painful for real-time chat.

precomputed (xgrammar 2024)

Per-state Allowed[state] : Bitmask[|V|] computed once, offline, per (grammar, tokenizer) pair. Decode step becomes logits &= Allowed[s] — a single vectorised op.

◆ paper

XGrammar: Flexible and Efficient Structured Generation Engine for Large Language Models

Dong, Ruan, Cai, Luo, Xu, Ren, Ye, Ceze · 2024

arxiv:2411.15100

t+1

is ready by the time the GPU finishes sampling token

t

Overhead, in actual milliseconds

field note

When grammar-constrained fails

The production-tested mitigations:

Train-time exposure. Fine-tune on data where the model has already seen the target schema in context. The natural completion distribution moves closer to the ribbon — grammar masking becomes a safety net rather than a straightjacket.
Think-then-emit. Ask for a reasoning field in the schema, before any other field. The model gets to think in prose (inside string quotes), then commits to the structured fields once reasoning is done. Cheap, ugly, effective.
Two-pass decode.Generate freely first, then run a second pass (often a tiny 1B model) to repackage the answer into the target schema. Throws away the “decode once” win but recovers quality; worth it for high-stakes generations.

What to take to production

If you're on vLLM > 0.5, enable guided_json with the XGrammar backend. The overhead is below your p99 noise floor.
Write your Pydantic schema with a reasoning field first if the task involves any computation. The model needs a place to think.
Test the constrained output against real user prompts before you ship. Validity is free; quality is not, and grammar masking can silently degrade it.
For agent tool-calling, constrained decoding is not optional. The alternative is a retry loop that looks fine until 3 AM on a Tuesday when your LLM provider updates a model and the 9% failure rate becomes 14%.

comprehension check

comprehension · pick a tiereach tests the same ideas at a different depth

Recall: facts the lesson stated

Recall · question 1 / 3