Skip to content

[RFC] Constrained decoding for extension/llm β€” is it worth doing?Β #19215

@kirklandsign

Description

@kirklandsign

πŸš€ The feature, motivation and pitch

TL;DR

Should ExecuTorch add constrained decoding (force model output to conform
to a JSON schema, regex, or grammar) to extension/llm?

This RFC scopes the question to necessity, not implementation β€” no
dependency choice, no design commitment. Just: do we want this, and if
so, with what priority?


1. Motivation

What is constrained decoding?

Each decode step, mask logits of grammar-disallowed tokens to -inf
before sampling, and advance a per-sequence state machine after each
sampled token. The model is then guaranteed to emit text matching
the grammar β€” no parsing fallbacks, no retry-on-invalid-JSON.

Why this matters now for on-device LLMs

  1. Tool use / function calling. Modern instruction-tuned models
    (Llama 3.1+, Gemma 3+, Qwen 3, Phi-4) emit structured tool-call
    payloads. Without constrained decoding, ~5–15% of tool calls are
    syntactically invalid even on cloud-class models β€” much worse on
    small on-device models.
  2. Structured output for app integration. Mobile / edge developers
    building features around LLM output (form filling, data extraction,
    UI generation) need schema-conforming output to avoid client-side
    parser failures.
  3. Agent frameworks (LangChain, LlamaIndex, Anthropic Agent SDK)
    assume the underlying runtime can produce structured output reliably.

Where ExecuTorch sits today

Runtime Constrained decoding
vLLM βœ…
sglang βœ…
llama.cpp βœ…
MLX-LM βœ…
Google LiteRT-LM βœ…
MediaPipe LLM Inference βœ…
ExecuTorch ❌

ExecuTorch is currently the only major on-device LLM runtime without it.
For developers picking an on-device framework for agent / tool-use
workloads, this is a hard differentiator.


2. Use cases this would enable

2.1 JSON-schema-constrained tool calls

GenerationConfig config;
config.constraint = make_json_schema_constraint(R"({
  "type": "object",
  "properties": {
    "name": {"type": "string", "enum": ["get_weather", "send_message"]},
    "arguments": {"type": "object"}
  },
  "required": ["name", "arguments"]
})");
runner.generate(prompt, config, ...);
// Output guaranteed to be valid JSON matching the schema

2.2 Regex-constrained extraction

config.constraint = make_regex_constraint(R"(\d{3}-\d{3}-\d{4})");

2.3 EBNF / CFG for DSLs

For SQL fragments, custom action languages, or anything else with a
formal grammar.

2.4 Composability

Should compose with all existing sampler modes (greedy, top-p, top-k,
temperature) and with multimodal input (text out is constrained
regardless of what's coming in).


4. What the integration would look like (sketch only)

Just enough to make the discussion concrete β€” no API or backend
commitments here
.

A new module

extension/llm/grammar/
β”œβ”€β”€ constraint.h               # opaque per-sequence state machine
β”œβ”€β”€ constrained_decoder.h      # mask_logits() + advance() + can_finish()
β”œβ”€β”€ <backend>_constraint.cpp   # backend-specific impl (TBD)
└── tokenizer_adapter.h        # bridge to pytorch/tokenizers

One field added to GenerationConfig

struct GenerationConfig {
  // ... existing ...
  std::shared_ptr<Constraint> constraint;  // null = unconstrained
};

Two hook points in the runner decode loop

auto logits = decoder_.forward(prev_token);
if (constraint) cdec.mask_logits(logits.data(), vocab_size);   // ← (1)
auto next = sampler_.sample(logits.data());
if (constraint) cdec.advance(next);                            // ← (2)

One small extension to the tokenizer interface

The grammar engine needs the full vocabulary as byte sequences β€” likely
a single new method on pytorch/tokenizers's Tokenizer interface.

That's the entire surface area. The bulk of the work is whichever
grammar engine we pick β€” explicitly out of scope for this RFC.


Open questions for discussion

  1. Should we build this? Yes/no/wait.
  2. If yes, what priority? P0 (this quarter), P1 (next quarter),
    eventual.
  3. Who needs it most? Tool-use developers, structured-output
    integrators, on-device agent frameworks β€” please chime in with
    concrete use cases.
  4. Any blockers I'm missing? Licensing, model-export coupling, NPU
    path concerns, etc.

A follow-up RFC will cover backend choice and detailed API design once
we're aligned that this is worth doing.

Alternatives

No response

Additional context

No response

RFC (Optional)

No response

cc @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng

Metadata

Metadata

Labels

module: llmIssues related to LLM examples and apps, and to the extensions/llm/ codetriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

Projects

Status

To triage

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions