Gen AI Foundations·May 7, 2026·5 minllm deeplearning

Structured Output and Constrained Decoding

There are two ways to get JSON out of a language model. One asks nicely and hopes. The other makes invalid output literally impossible to produce. Most teams ship the first, get burned by the 2% of responses that come back with a trailing comma or a chatty Sure, here's your JSON: preamble, and never find out the second exists. So let's start with why the polite approach is doomed, then build up to the version that can't fail.

The failure mode

You write a prompt: "Return a JSON object with name, age, and email." You parse the result with json.loads. It works. It works again. It works ninety-eight times. Then the model, sampling token by token from a probability distribution, picks a slightly-too-creative continuation and you get:

Here's the information you requested:
{ "name": "Ada", "age": "thirty-six", "email": "ada@" }

Three failures in one. A prose preamble that breaks parsing. An age as a word instead of a number. A truncated email. Each is a single low-probability token that the sampler happened to pick, and recall from how generation works: one bad token derails everything after it. At scale, "usually valid" is a synonym for "a pager that goes off at 3 a.m."

The patch most people reach for is prompt-and-retry: add "respond with ONLY valid JSON," validate, and if it fails, ask again. This helps. It also costs extra calls, adds latency, and still has a nonzero failure rate, because you're negotiating with a probabilistic process instead of constraining it.

The real fix: constrain the decoder

Recall the generation loop. At each step the model produces logits over the whole vocabulary, and a sampler picks one token. Constrained decoding inserts a step in between: before sampling, set the probability of every token that would violate your format to zero. Mask them out. Renormalize over what's left. Sample only from tokens that keep the output valid.

If the model has emitted { "age": and your schema says age is an integer, then at this position the only legal next tokens are digits (and maybe a minus sign or whitespace). Every other token — letters, quotes, braces — gets masked to zero probability. The model cannot write "thirty-six" because the tokens that would start that string were never on the table. Validity stops being something you check after the fact and becomes something the sampler is physically unable to break.

Constrained decoding masks illegal tokens before sampling so the output stays valid.
A grammar masks every illegal token to zero probability before the sampler picks one.

How does the system know which tokens are legal? It compiles your format into a state machine — a grammar. As tokens are emitted, the grammar advances through its states, and at each state it knows exactly which characters (and therefore which tokens) may come next. The clever engineering, worked out in the outlines library and the 2023 "Efficient Guided Generation" paper by Willard and Louf, is doing this index over the token vocabulary rather than character by character, so the masking adds almost no latency. Underneath, you can express the target as a regex, a JSON Schema, or a context-free grammar — llama.cpp's GBNF format has let local models do exactly this for a long time.

The ergonomic layer: Pydantic and instructor

Hand-writing grammars is tedious. In Python, the comfortable pattern is to define your shape as a Pydantic model and let a library handle the schema, the calls, and the validation. instructor patches an LLM client so you can pass a response_model and get a typed object back:

import instructor
from pydantic import BaseModel, EmailStr
from anthropic import Anthropic

class Contact(BaseModel):
    name: str
    age: int
    email: EmailStr

client = instructor.from_anthropic(Anthropic())

contact = client.messages.create(
    model="claude-sonnet-4-5",
    max_tokens=256,
    response_model=Contact,
    messages=[{"role": "user", "content": "Ada Lovelace, 36, ada@example.com"}],
)

print(contact.age + 1)   # 37 — it's a real int, not the string "36"

You get a Contact instance with age as an actual integer. Whether validity is enforced by true token masking or by the provider's own structured-output mode plus a validate-and-reretry loop depends on the backend — but from your side, the contract is a typed object, not a string you cross your fingers over. This is also the exact mechanism behind tool / function calling: the tool's parameter schema is a constraint on what the model may emit, which is why function calls parse reliably when free-form JSON doesn't.

The catch nobody mentions

Constrained decoding guarantees your output parses. It guarantees nothing about whether it's correct. Mask the model into emitting an integer for age and it will hand you 0 or 42 rather than refuse — a syntactically perfect value that's semantically invented. Forcing structure can paper over the model's uncertainty instead of surfacing it.

There's a subtler cost too. Hard constraints change the path the model takes. If your schema demands the model commit to a category field before it generates its reasoning, you've forced an answer ahead of the thinking that should justify it — and quality can drop. The fix is ordering: let a reasoning or rationale field come first in the schema so the model "thinks" before it "decides," then constrain the final fields. Structure and chain-of-thought aren't enemies; you just have to sequence them.

My rule of thumb:

The shift in mindset is small but it changes how solid your pipeline feels overnight: stop asking the model for valid structure and start making invalid structure unreachable. The model is going to roll its weighted dice no matter what. Constrained decoding just takes the bad outcomes off the dice.

Leave a Reply

Your email address will not be published.