Skip to content

fix(flows): make paper_flow produce a valid Paper under strict structured output#79

Open
savycompany wants to merge 1 commit into
LLMQuant:masterfrom
savycompany:fix/paper-flow-strict-schema
Open

fix(flows): make paper_flow produce a valid Paper under strict structured output#79
savycompany wants to merge 1 commit into
LLMQuant:masterfrom
savycompany:fix/paper-flow-strict-schema

Conversation

@savycompany
Copy link
Copy Markdown

Description

paper_flow currently cannot complete an end-to-end extraction against the OpenAI Agents SDK, because it sets the agent's output_type to the canonical knowledge.Paper — a store schema that is hostile to structured output:

  • Paper.nodes is a dict[UUID, TreeNode]. A dict serialises to additionalProperties, which the SDK's strict JSON-schema mode rejects:
    UserError: additionalProperties should not be set for object types.
  • With strict mode off, models don't emit RFC-4122 UUIDs, so the UUID id fields fail Pydantic validation on free-form ids like "intro_node":
    Input should be a valid UUID ... 'root'.

Either way the default flow raises before returning — no real paper extracts. This is the migration gap flagged in the README note (#71).

Approach

Keep the canonical store schema untouched — its UUID identity + flat dict map are load-bearing for dedup/indexing per docs/design/en/storage.md — and fix the issue at the apex (flows) layer, which is where the LLM boundary actually lives (and the only layer import-linter allows to import knowledge):

  • quantmind/flows/_paper_draft.py (new): PaperDraft / PaperDraftNode, a strict-schema-safe extraction target. Children are embedded (a real nested tree) rather than referenced by id, so the model never maintains a flat id map — eliminating dangling-reference / duplicate-id / missing-root failure modes. draft_to_paper() lifts a validated draft into a canonical Paper: assigns real UUIDs, wires parent_id / children_ids / position, and injects provenance (source, arxiv_id, authors, publication-date as_of) the flow already knows from the fetch layer instead of asking the model to author it.
  • quantmind/flows/paper.py: targets PaperDraft by default and lifts the result into a Paper. A caller-supplied output_type still bypasses the draft and is returned verbatim (isinstance(result, Paper) pass-through), preserving the existing override contract. The arxiv branch of _fetch_and_format now propagates published_at so as_of is the real cutoff.

As a side effect this also fixes empty Paper.arxiv_id / Paper.authors: previously the model was asked to author provenance and routinely left it blank.

Verification

  • Full golden harness green: ruff format / ruff check / basedpyright / lint-imports / pytest248 passed, 90% coverage (15 new unit tests for the draft schema, converter, provenance injection, and the flow integration path).
  • End-to-end against the live Agents SDK (gpt-4o-mini, default path, no strict-schema overrides): a real arXiv paper (2605.20636) extracts into a 19-node Paper with UUID ids and a correct section/subsection tree, then survives a JSON store round-trip.

Checklist

  • The PR title starts with $CATEGORY(xx): xxx (fix(flows): ...)
  • Related issue is referred in this PR — relates to the SDK migration (Tracking: Migrate Agent layer to OpenAI Agents SDK #71); happy to link a dedicated issue if preferred
  • The markdown and latex are rendered correctly.
  • The code in PR is well-documented.
  • The PR is complete and small (4 files, +427/-11)

🤖 Generated with Claude Code

paper_flow set the agent's output_type to the canonical knowledge.Paper,
whose store schema is hostile to OpenAI structured output:

- Paper.nodes is a dict[UUID, TreeNode]; a dict serialises to
  additionalProperties, which the Agents SDK strict-JSON-schema mode
  rejects ("additionalProperties should not be set for object types").
- With strict mode off, models do not emit RFC-4122 UUIDs, so the UUID
  id fields fail Pydantic validation on free-form ids like "intro_node".

Either way the default flow raised before returning, so no real
extraction completed end-to-end.

Introduce a strict-schema-safe LLM target in the apex layer and keep the
canonical store schema untouched (UUID identity + dict map are
load-bearing for dedup/indexing per the storage design doc):

- quantmind/flows/_paper_draft.py: PaperDraft/PaperDraftNode (nested
  children, no id bookkeeping) + draft_to_paper(), which assigns real
  UUIDs, wires parent_id/children_ids, and injects provenance (source,
  arxiv_id, authors, published-date as_of) the flow already knows from
  the fetch layer instead of trusting the model to author it.
- paper_flow now targets PaperDraft by default and lifts the result into
  a Paper. A caller-supplied output_type still bypasses the draft and is
  returned verbatim (isinstance(result, Paper) pass-through), so the
  existing override contract is preserved.

As a side effect this fixes empty Paper.arxiv_id/authors: previously the
model was asked to author provenance and routinely left it blank.

Verified end-to-end against the live Agents SDK (gpt-4o-mini): a real
arXiv paper now extracts into a 19-node Paper with UUID ids and survives
a JSON store round-trip. Full verify harness green (248 tests, 90% cov).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants