Forcing valid JSON: grammar-constrained decoding
How vLLM forces models to emit valid JSON by masking logits against a compiled grammar — and how XGrammar does it with near-zero overhead.
\{"name":"[a-zA-Z]+","age":\d+\}The failure mode
Ask a modern instruction-tuned SLM to emit JSON, and most of the time you'll get JSON. Most of the time. On Phi-4-mini the measured rate hovers around 91% — the other 9% is the model slipping into a stray explanation, a trailing comma, an unterminated quote, or a markdown fence wrapping the whole response.
Nine percent is invisible in a demo. It is catastrophic in a tool pipeline. If your agent parses model output to call a function, one bad response in ten means the whole chain halts — and if you retry blindly, latency and cost compound until someone pages the on-call. The contract the rest of your system needs is not “usually valid”; it is always valid. Prompt engineering cannot get you there, because the decoder has no idea it's supposed to be writing JSON — it's just rolling dice over a 128k-token vocabulary, one token at a time.
The input to the sampler is a logit vector with one entry per token in the vocabulary. The sampler's job is to pick a token. Nothing in that pipeline knows what a brace is, or what a schema is, or that your downstream parser is about to call json.loads() on the result.
The one-line fix: mask the logits
Before the softmax, for every token in the vocabulary, ask the grammar a single question: would emitting this token here produce a prefix that can still be extended to something valid? If yes, the logit is untouched. If no, the logit is set to .
# vLLM-style grammar-guided sampling, simplified logits = model.forward(tokens) # [vocab_size] mask = grammar.allowed_at(state) # bool[vocab_size] logits = logits.masked_fill(~mask, -inf) probs = softmax(logits) token = sample(probs) state = grammar.advance(state, token)
That's the whole idea. Softmax-of-minus-infinity is zero, so the invalid tokens cannot be sampled at any temperature. The model's own probability ranking is preserved over the legal tokens — you're not distorting its preferences, you're just deleting the options that would violate the contract. The literature calls this logit masking, and almost every production system (vLLM, TensorRT-LLM, Outlines, XGrammar, llguidance) reduces to exactly this operation at the innermost loop.
From JSON Schema to a context-free grammar
Your Pydantic model is not an FSM. It is nested, recursive, sometimes self-referential. The compiler's job is to walk the schema and emit a grammar — traditionally in Extended Backus–Naur Form — that a decoder can run in O(1) per token.
from pydantic import BaseModel
class Person(BaseModel):
name: str
age: intThat class gets compiled to a JSON Schema, which gets compiled to a grammar that looks roughly like this:
root ::= "{" ws "\"name\"" ws ":" ws string ws
"," ws "\"age\"" ws ":" ws integer ws "}"
string ::= "\"" char* "\""
char ::= [^"\\] | "\\" (["\\/bfnrt] | "u" hex hex hex hex)
integer ::= "-"? ([0-9] | [1-9] [0-9]+)
ws ::= [ \t\n]*String and integer rules are the hard part — they are unbounded, meaning the grammar has to accept any length. That's why a regex like [a-zA-Z]+needs a self-loop in the state diagram above: “letters” is not a state you leave after one character; it's a state you loop in until you see the closing quote.
XGrammar's caching trick
The naive implementation is: at every decode step, walk the grammar and check each of the 128k tokens against it. Even at 10ns per check that's 1.3ms per token — a catastrophic overhead against vLLM's sub-5ms per-token target on an H100.
XGrammar's insight — and it is the single idea that took grammar-constrained decoding from a 20% overhead to a measurable-but-small one — is that the allowed-token set for a given FSM state is deterministic. The state does not know which prompt produced it; it only knows which grammar symbol is next. So compute it once per state, cache the resulting bitmask, and on every subsequent visit to that state, the “check each token” step collapses to a bitmask AND.
Walk the regex / EBNF for every (state, token) pair. Overhead ≈ O(|V|) per decode step with hidden regex engine costs.
Measured overhead on a 70B vLLM server: ~40% latency penalty on JSON-heavy workloads. Acceptable for offline batch; painful for real-time chat.
Per-state Allowed[state] : Bitmask[|V|] computed once, offline, per (grammar, tokenizer) pair. Decode step becomes logits &= Allowed[s] — a single vectorised op.
Measured overhead on the XGrammar paper's own benchmark: roughly 1% TPOT (6.2 → 6.3 ms at batch 1; 9.0 → 9.1 ms at batch 16 on Llama-3.1-8B). Speedups versus baselines: up to ~3× over prior structured- output stacks on JSON Schema, and up to ~100× on nested CFG workloads where the pushdown automaton matters.
Overhead, in actual milliseconds
The XGrammar paper benchmarks on Llama-3.1-8B on a single H100 and reports per-output-token times of 6.2 → 6.3 ms at batch 1 and 9.0 → 9.1 ms at batch 16 — unconstrained vs JSON-Schema-constrained. That is roughly 1% TPOT overhead: small, but not literally zero. Older Outlines deployments on the same workload show a larger gap; the gap narrowed substantially in the 2024 Outlines rewrite.
The reason the XGrammar number is essentially free is the CPU/GPU pipeline: while the GPU is busy with the forward pass for token , the CPU has already computed and uploaded the bitmask for the state it expects to be in after sampling token . The mask is there when the logits arrive. No bubble.
When grammar-constrained fails
The pitfall nobody talks about on the demo stage: a forced grammar can degrade the model's quality. Ask a base model to “summarise this article in JSON with a title and bullet_points” and turn on XGrammar, and you may get a perfectly valid JSON blob full of bad summaries. The model “wanted” to think in prose first — to reason, restate, hedge — and the grammar cut those escape hatches off. What's left is the model racing to fill in fields it hasn't finished thinking through.
The mechanism is straightforward: grammar masking removes tokens the model was about to sample with high probability, pushing distributions into regions of its output space that are out-of-distribution for the current prompt. You're still getting greedy (or temperature-sampled) decoding — but only within a narrow ribbon the grammar allows. If the model's natural completion lives outside that ribbon, you're sampling from the tails.
The production-tested mitigations:
- Train-time exposure. Fine-tune on data where the model has already seen the target schema in context. The natural completion distribution moves closer to the ribbon — grammar masking becomes a safety net rather than a straightjacket.
- Think-then-emit. Ask for a
reasoningfield in the schema, before any other field. The model gets to think in prose (inside string quotes), then commits to the structured fields once reasoning is done. Cheap, ugly, effective. - Two-pass decode.Generate freely first, then run a second pass (often a tiny 1B model) to repackage the answer into the target schema. Throws away the “decode once” win but recovers quality; worth it for high-stakes generations.
What to take to production
- If you're on vLLM > 0.5, enable
guided_jsonwith the XGrammar backend. The overhead is below your p99 noise floor. - Write your Pydantic schema with a
reasoningfield first if the task involves any computation. The model needs a place to think. - Test the constrained output against real user prompts before you ship. Validity is free; quality is not, and grammar masking can silently degrade it.
- For agent tool-calling, constrained decoding is not optional. The alternative is a retry loop that looks fine until 3 AM on a Tuesday when your LLM provider updates a model and the 9% failure rate becomes 14%.