[RFC] Constrained decoding for extension/llm — is it worth doing?

### 🚀 The feature, motivation and pitch


## TL;DR

Should ExecuTorch add constrained decoding (force model output to conform
to a JSON schema, regex, or grammar) to `extension/llm`?

This RFC scopes the question to **necessity, not implementation** — no
dependency choice, no design commitment. Just: *do we want this, and if
so, with what priority?*

---

## 1. Motivation

### What is constrained decoding?

Each decode step, mask logits of grammar-disallowed tokens to `-inf`
before sampling, and advance a per-sequence state machine after each
sampled token. The model is then **guaranteed** to emit text matching
the grammar — no parsing fallbacks, no retry-on-invalid-JSON.

### Why this matters now for on-device LLMs

1. **Tool use / function calling.** Modern instruction-tuned models
   (Llama 3.1+, Gemma 3+, Qwen 3, Phi-4) emit structured tool-call
   payloads. Without constrained decoding, ~5–15% of tool calls are
   syntactically invalid even on cloud-class models — much worse on
   small on-device models.
2. **Structured output for app integration.** Mobile / edge developers
   building features around LLM output (form filling, data extraction,
   UI generation) need schema-conforming output to avoid client-side
   parser failures.
3. **Agent frameworks** (LangChain, LlamaIndex, Anthropic Agent SDK)
   assume the underlying runtime can produce structured output reliably.

### Where ExecuTorch sits today

| Runtime | Constrained decoding |
|---------|----------------------|
| vLLM | ✅ |
| sglang | ✅ |
| llama.cpp | ✅ |
| MLX-LM | ✅ |
| Google LiteRT-LM | ✅ |
| MediaPipe LLM Inference | ✅ |
| **ExecuTorch** | ❌ |

ExecuTorch is currently the only major on-device LLM runtime without it.
For developers picking an on-device framework for agent / tool-use
workloads, this is a hard differentiator.

---

## 2. Use cases this would enable

### 2.1 JSON-schema-constrained tool calls

```cpp
GenerationConfig config;
config.constraint = make_json_schema_constraint(R"({
  "type": "object",
  "properties": {
    "name": {"type": "string", "enum": ["get_weather", "send_message"]},
    "arguments": {"type": "object"}
  },
  "required": ["name", "arguments"]
})");
runner.generate(prompt, config, ...);
// Output guaranteed to be valid JSON matching the schema
```

### 2.2 Regex-constrained extraction

```cpp
config.constraint = make_regex_constraint(R"(\d{3}-\d{3}-\d{4})");
```

### 2.3 EBNF / CFG for DSLs

For SQL fragments, custom action languages, or anything else with a
formal grammar.

### 2.4 Composability

Should compose with all existing sampler modes (greedy, top-p, top-k,
temperature) and with multimodal input (text out is constrained
regardless of what's coming in).

---

## 4. What the integration would look like (sketch only)

Just enough to make the discussion concrete — **no API or backend
commitments here**.

### A new module

```
extension/llm/grammar/
├── constraint.h               # opaque per-sequence state machine
├── constrained_decoder.h      # mask_logits() + advance() + can_finish()
├── <backend>_constraint.cpp   # backend-specific impl (TBD)
└── tokenizer_adapter.h        # bridge to pytorch/tokenizers
```

### One field added to `GenerationConfig`

```cpp
struct GenerationConfig {
  // ... existing ...
  std::shared_ptr<Constraint> constraint;  // null = unconstrained
};
```

### Two hook points in the runner decode loop

```cpp
auto logits = decoder_.forward(prev_token);
if (constraint) cdec.mask_logits(logits.data(), vocab_size);   // ← (1)
auto next = sampler_.sample(logits.data());
if (constraint) cdec.advance(next);                            // ← (2)
```

### One small extension to the tokenizer interface

The grammar engine needs the full vocabulary as byte sequences — likely
a single new method on `pytorch/tokenizers`'s `Tokenizer` interface.

That's the entire surface area. The bulk of the work is whichever
grammar engine we pick — explicitly **out of scope** for this RFC.

---

## Open questions for discussion

1. **Should we build this?** Yes/no/wait.
2. **If yes, what priority?** P0 (this quarter), P1 (next quarter),
   eventual.
3. **Who needs it most?** Tool-use developers, structured-output
   integrators, on-device agent frameworks — please chime in with
   concrete use cases.
4. **Any blockers I'm missing?** Licensing, model-export coupling, NPU
   path concerns, etc.

A follow-up RFC will cover backend choice and detailed API design once
we're aligned that this is worth doing.


### Alternatives

_No response_

### Additional context

_No response_

### RFC (Optional)

_No response_

cc @larryliu0820 @mergennachin @cccclai @helunwencser @jackzhxng

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Constrained decoding for extension/llm — is it worth doing? #19215

🚀 The feature, motivation and pitch

TL;DR

1. Motivation

What is constrained decoding?

Why this matters now for on-device LLMs

Where ExecuTorch sits today

2. Use cases this would enable

2.1 JSON-schema-constrained tool calls

2.2 Regex-constrained extraction

2.3 EBNF / CFG for DSLs

2.4 Composability

4. What the integration would look like (sketch only)

A new module

One field added to `GenerationConfig`

Two hook points in the runner decode loop

One small extension to the tokenizer interface

Open questions for discussion

Alternatives

Additional context

RFC (Optional)

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Runtime	Constrained decoding
vLLM	✅
sglang	✅
llama.cpp	✅
MLX-LM	✅
Google LiteRT-LM	✅
MediaPipe LLM Inference	✅
ExecuTorch	❌

[RFC] Constrained decoding for extension/llm — is it worth doing? #19215

Description

🚀 The feature, motivation and pitch

TL;DR

1. Motivation

What is constrained decoding?

Why this matters now for on-device LLMs

Where ExecuTorch sits today

2. Use cases this would enable

2.1 JSON-schema-constrained tool calls

2.2 Regex-constrained extraction

2.3 EBNF / CFG for DSLs

2.4 Composability

4. What the integration would look like (sketch only)

A new module

One field added to GenerationConfig

Two hook points in the runner decode loop

One small extension to the tokenizer interface

Open questions for discussion

Alternatives

Additional context

RFC (Optional)

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

One field added to `GenerationConfig`