science: sync structured systematization output#111
Conversation
There was a problem hiding this comment.
Pull request overview
This PR syncs ASSERT’s “systematization” stage and viewer to a newer science-side structured JSON output shape, adding validation scoring alongside the behavior spec and updating taxonomy conversion and UI to consume the new fields.
Changes:
- Update the systematization prompt + runtime to produce/validate a structured JSON artifact (including
concept_specpatterns andvalidationscores) and write it directly. - Update systematization→taxonomy conversion to consume the structured artifact (instead of Markdown +
summary_items) and embed the JSON into the converter prompt. - Update viewer pages/modal to display pattern counts and render systematization sections (including validation) from the new artifact shape.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| viewer/src/routes/suite/[suite_id]/+page.svelte | Switches suite header count from legacy summary_items to structured concept_spec.patterns and updates label text. |
| viewer/src/routes/suite/[suite_id]/[run_id]/+page.svelte | Improves results header layout responsiveness and uses axis labels for grouping controls. |
| viewer/src/lib/SystematizationModal.svelte | Updates modal to read structured patterns and validation, and renders a synthesized Markdown view of key sections. |
| tests/test_systematization_stage.py | Updates stage tests/fixtures for structured systematization artifacts and new validation requirements. |
| tests/test_systematization_convert_stage.py | Updates conversion tests/fixtures to reflect structured artifact ingestion and prompt expectations. |
| prompts/validation_criteria.md | Adds the validation rubric content to be embedded into the systematization prompt. |
| prompts/systematization_single.md | Extends the output contract to include validation + nested slot components and adds deep-research/validation requirements. |
| p2m/stages/systematization.py | Implements structured schema/models, prompt assembly with embedded validation rubric, and stricter structured validation. |
| p2m/stages/systematization_convert.py | Loads/validates structured systematization JSON and feeds it (minus meta) into taxonomy conversion. |
| SYSTEMATIZATION_PROMPT = load_prompt_text("systematization_single.md") | ||
| VALIDATION_CRITERIA_PROMPT = load_prompt_text("validation_criteria.md") | ||
| SYSTEMATIZATION_PROMPT = load_prompt_text("systematization_single.md").replace( | ||
| "{validation_criteria}", VALIDATION_CRITERIA_PROMPT |
There was a problem hiding this comment.
Fixed in the latest push by replacing the exact double-braced validation_criteria token and updating the prompt assertion. Re-ran the targeted systematization/viewer tests, 14 passed, plus git diff --check.
| for index, lens in enumerate(parsed.stakeholder_lenses): | ||
| _require_nonempty(lens.label, f"stakeholder_lenses[{index}].label") | ||
| _require_nonempty(lens.expertise, f"stakeholder_lenses[{index}].expertise") | ||
| if not parsed.validation: | ||
| raise ValueError("systematization validation must include at least one item") | ||
| for index, item in enumerate(parsed.validation): |
There was a problem hiding this comment.
Fixed in the latest push. Validation now requires exactly the six utility attributes, checks them case-insensitively, and rejects missing, duplicate, or unexpected attributes. Added focused missing/duplicate tests and re-ran the systematization/viewer targeted tests.
| function systematizationPatternCount(systematization: Record<string, unknown> | null | undefined): number { | ||
| const conceptSpec = systematization?.concept_spec; | ||
| if (!conceptSpec || typeof conceptSpec !== 'object' || Array.isArray(conceptSpec)) return 0; | ||
| const patterns = (conceptSpec as Record<string, unknown>).patterns; | ||
| return Array.isArray(patterns) ? patterns.length : 0; | ||
| } | ||
|
|
||
| let summaryItemCount = $derived(systematizationPatternCount(data.systematization)); |
There was a problem hiding this comment.
Fixed in the latest push by renaming the derived value to patternCount, matching the structured concept_spec.patterns source used by the suite header.
| model_cfg=ModelConfig(name="azure/gpt-5.4"), | ||
| ) | ||
|
|
||
| async def test_run_systematization_to_taxonomy_rejects_missing_systematization(self) -> None: | ||
| async def test_run_systematization_to_taxonomy_rejects_missing_systematization_field(self) -> None: | ||
| with TemporaryDirectory() as tmp_dir: | ||
| tmp_path = Path(tmp_dir) | ||
| systematization_path = tmp_path / "systematization.json" | ||
| systematization_path.write_text( | ||
| json.dumps( | ||
| { | ||
| "behavior": "Harmful advice", | ||
| "summary_items": [], | ||
| } | ||
| ), | ||
| encoding="utf-8", | ||
| ) | ||
| payload = _structured_systematization() | ||
| del payload["validation"] | ||
| systematization_path.write_text(json.dumps(payload), encoding="utf-8") |
There was a problem hiding this comment.
Fixed in the latest push by renaming the test to test_run_systematization_to_taxonomy_rejects_missing_validation, since the fixture intentionally deletes the validation field.
296520b to
309f361
Compare
Summary
Science-side comparison
Recent Omni
omni/measurementschanges since May 16 were:e5025090systematization: literature-grounded retrieval + validation criteria + structured output shape36813a2fviewer split of policy violation by permissibilitycb5b2874wording clarification for policy violation on permissible requestsThe viewer/permissibility split is already covered separately by #94. This PR ports the applicable systematization delta. It intentionally does not copy Omni wholesale because Omni's branch has different concept/policy naming and science-only repo assumptions.
Validation
/home/jakepresent/git/adaptive-eval-ms-import/.venv/bin/python -m pytest tests/test_systematization_stage.py tests/test_systematization_convert_stage.py tests/test_viewer_run_page_server.py -qnpm --prefix viewer run checkreports only pre-existing unrelated diagnostics inPrimerDropdown.svelte,PrimerPagination.svelte,routes/+page.svelte, androutes/new/+page.svelte; no diagnostics in touched viewer files.git diff --check