fix(flows): make paper_flow produce a valid Paper under strict structured output by savycompany · Pull Request #79 · LLMQuant/quant-mind

savycompany · 2026-06-05T22:06:00Z

Description

paper_flow currently cannot complete an end-to-end extraction against the OpenAI Agents SDK, because it sets the agent's output_type to the canonical knowledge.Paper — a store schema that is hostile to structured output:

Paper.nodes is a dict[UUID, TreeNode]. A dict serialises to additionalProperties, which the SDK's strict JSON-schema mode rejects:
UserError: additionalProperties should not be set for object types.
With strict mode off, models don't emit RFC-4122 UUIDs, so the UUID id fields fail Pydantic validation on free-form ids like "intro_node":
Input should be a valid UUID ... 'root'.

Either way the default flow raises before returning — no real paper extracts. This is the migration gap flagged in the README note (#71).

Approach

Keep the canonical store schema untouched — its UUID identity + flat dict map are load-bearing for dedup/indexing per docs/design/en/storage.md — and fix the issue at the apex (flows) layer, which is where the LLM boundary actually lives (and the only layer import-linter allows to import knowledge):

quantmind/flows/_paper_draft.py (new): PaperDraft / PaperDraftNode, a strict-schema-safe extraction target. Children are embedded (a real nested tree) rather than referenced by id, so the model never maintains a flat id map — eliminating dangling-reference / duplicate-id / missing-root failure modes. draft_to_paper() lifts a validated draft into a canonical Paper: assigns real UUIDs, wires parent_id / children_ids / position, and injects provenance (source, arxiv_id, authors, publication-date as_of) the flow already knows from the fetch layer instead of asking the model to author it.
quantmind/flows/paper.py: targets PaperDraft by default and lifts the result into a Paper. A caller-supplied output_type still bypasses the draft and is returned verbatim (isinstance(result, Paper) pass-through), preserving the existing override contract. The arxiv branch of _fetch_and_format now propagates published_at so as_of is the real cutoff.

As a side effect this also fixes empty Paper.arxiv_id / Paper.authors: previously the model was asked to author provenance and routinely left it blank.

Verification

Full golden harness green: ruff format / ruff check / basedpyright / lint-imports / pytest — 248 passed, 90% coverage (15 new unit tests for the draft schema, converter, provenance injection, and the flow integration path).
End-to-end against the live Agents SDK (gpt-4o-mini, default path, no strict-schema overrides): a real arXiv paper (2605.20636) extracts into a 19-node Paper with UUID ids and a correct section/subsection tree, then survives a JSON store round-trip.

Checklist

The PR title starts with $CATEGORY(xx): xxx (fix(flows): ...)
Related issue is referred in this PR — relates to the SDK migration (Tracking: Migrate Agent layer to OpenAI Agents SDK #71); happy to link a dedicated issue if preferred
The markdown and latex are rendered correctly.
The code in PR is well-documented.
The PR is complete and small (4 files, +427/-11)

🤖 Generated with Claude Code

paper_flow set the agent's output_type to the canonical knowledge.Paper, whose store schema is hostile to OpenAI structured output: - Paper.nodes is a dict[UUID, TreeNode]; a dict serialises to additionalProperties, which the Agents SDK strict-JSON-schema mode rejects ("additionalProperties should not be set for object types"). - With strict mode off, models do not emit RFC-4122 UUIDs, so the UUID id fields fail Pydantic validation on free-form ids like "intro_node". Either way the default flow raised before returning, so no real extraction completed end-to-end. Introduce a strict-schema-safe LLM target in the apex layer and keep the canonical store schema untouched (UUID identity + dict map are load-bearing for dedup/indexing per the storage design doc): - quantmind/flows/_paper_draft.py: PaperDraft/PaperDraftNode (nested children, no id bookkeeping) + draft_to_paper(), which assigns real UUIDs, wires parent_id/children_ids, and injects provenance (source, arxiv_id, authors, published-date as_of) the flow already knows from the fetch layer instead of trusting the model to author it. - paper_flow now targets PaperDraft by default and lifts the result into a Paper. A caller-supplied output_type still bypasses the draft and is returned verbatim (isinstance(result, Paper) pass-through), so the existing override contract is preserved. As a side effect this fixes empty Paper.arxiv_id/authors: previously the model was asked to author provenance and routinely left it blank. Verified end-to-end against the live Agents SDK (gpt-4o-mini): a real arXiv paper now extracts into a 19-node Paper with UUID ids and survives a JSON store round-trip. Full verify harness green (248 tests, 90% cov). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(flows): make paper_flow produce a valid Paper under strict structured output#79

fix(flows): make paper_flow produce a valid Paper under strict structured output#79
savycompany wants to merge 1 commit into
LLMQuant:masterfrom
savycompany:fix/paper-flow-strict-schema

savycompany commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

savycompany commented Jun 5, 2026

Description

Approach

Verification

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants