science: sync structured systematization output by jakepresent · Pull Request #111 · responsibleai/ASSERT

jakepresent · 2026-05-27T16:15:38Z

Summary

Syncs the latest science-side systematization shape from Omni into ASSERT while preserving ASSERT's current behavior/taxonomy terminology.
Adds the validation-criteria rubric to the systematization prompt and requires the model to return structured validation scores alongside the behavior spec.
Stores the structured systematization JSON directly, feeds that structured artifact into taxonomy conversion, and updates the viewer systematization modal to read source patterns / validation from the new artifact shape.

Science-side comparison

Recent Omni omni/measurements changes since May 16 were:

e5025090 systematization: literature-grounded retrieval + validation criteria + structured output shape
36813a2f viewer split of policy violation by permissibility
cb5b2874 wording clarification for policy violation on permissible requests

The viewer/permissibility split is already covered separately by #94. This PR ports the applicable systematization delta. It intentionally does not copy Omni wholesale because Omni's branch has different concept/policy naming and science-only repo assumptions.

Validation

/home/jakepresent/git/adaptive-eval-ms-import/.venv/bin/python -m pytest tests/test_systematization_stage.py tests/test_systematization_convert_stage.py tests/test_viewer_run_page_server.py -q
npm --prefix viewer run check reports only pre-existing unrelated diagnostics in PrimerDropdown.svelte, PrimerPagination.svelte, routes/+page.svelte, and routes/new/+page.svelte; no diagnostics in touched viewer files.
git diff --check

Copilot

Pull request overview

This PR syncs ASSERT’s “systematization” stage and viewer to a newer science-side structured JSON output shape, adding validation scoring alongside the behavior spec and updating taxonomy conversion and UI to consume the new fields.

Changes:

Update the systematization prompt + runtime to produce/validate a structured JSON artifact (including concept_spec patterns and validation scores) and write it directly.
Update systematization→taxonomy conversion to consume the structured artifact (instead of Markdown + summary_items) and embed the JSON into the converter prompt.
Update viewer pages/modal to display pattern counts and render systematization sections (including validation) from the new artifact shape.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
viewer/src/routes/suite/[suite_id]/+page.svelte	Switches suite header count from legacy `summary_items` to structured `concept_spec.patterns` and updates label text.
viewer/src/routes/suite/[suite_id]/[run_id]/+page.svelte	Improves results header layout responsiveness and uses axis labels for grouping controls.
viewer/src/lib/SystematizationModal.svelte	Updates modal to read structured patterns and validation, and renders a synthesized Markdown view of key sections.
tests/test_systematization_stage.py	Updates stage tests/fixtures for structured systematization artifacts and new validation requirements.
tests/test_systematization_convert_stage.py	Updates conversion tests/fixtures to reflect structured artifact ingestion and prompt expectations.
prompts/validation_criteria.md	Adds the validation rubric content to be embedded into the systematization prompt.
prompts/systematization_single.md	Extends the output contract to include validation + nested slot components and adds deep-research/validation requirements.
p2m/stages/systematization.py	Implements structured schema/models, prompt assembly with embedded validation rubric, and stricter structured validation.
p2m/stages/systematization_convert.py	Loads/validates structured systematization JSON and feeds it (minus `meta`) into taxonomy conversion.

jakepresent · 2026-05-28T19:06:02Z

-SYSTEMATIZATION_PROMPT = load_prompt_text("systematization_single.md")
+VALIDATION_CRITERIA_PROMPT = load_prompt_text("validation_criteria.md")
+SYSTEMATIZATION_PROMPT = load_prompt_text("systematization_single.md").replace(
+    "{validation_criteria}", VALIDATION_CRITERIA_PROMPT


Fixed in the latest push by replacing the exact double-braced validation_criteria token and updating the prompt assertion. Re-ran the targeted systematization/viewer tests, 14 passed, plus git diff --check.

jakepresent · 2026-05-28T19:06:04Z

+    for index, lens in enumerate(parsed.stakeholder_lenses):
+        _require_nonempty(lens.label, f"stakeholder_lenses[{index}].label")
+        _require_nonempty(lens.expertise, f"stakeholder_lenses[{index}].expertise")
+    if not parsed.validation:
+        raise ValueError("systematization validation must include at least one item")
+    for index, item in enumerate(parsed.validation):


Fixed in the latest push. Validation now requires exactly the six utility attributes, checks them case-insensitively, and rejects missing, duplicate, or unexpected attributes. Added focused missing/duplicate tests and re-ran the systematization/viewer targeted tests.

jakepresent · 2026-05-28T19:06:07Z

+	function systematizationPatternCount(systematization: Record<string, unknown> | null | undefined): number {
+		const conceptSpec = systematization?.concept_spec;
+		if (!conceptSpec || typeof conceptSpec !== 'object' || Array.isArray(conceptSpec)) return 0;
+		const patterns = (conceptSpec as Record<string, unknown>).patterns;
+		return Array.isArray(patterns) ? patterns.length : 0;
+	}
+
+	let summaryItemCount = $derived(systematizationPatternCount(data.systematization));


Fixed in the latest push by renaming the derived value to patternCount, matching the structured concept_spec.patterns source used by the suite header.

jakepresent · 2026-05-28T19:06:09Z

                    model_cfg=ModelConfig(name="azure/gpt-5.4"),
                )

-    async def test_run_systematization_to_taxonomy_rejects_missing_systematization(self) -> None:
+    async def test_run_systematization_to_taxonomy_rejects_missing_systematization_field(self) -> None:
        with TemporaryDirectory() as tmp_dir:
            tmp_path = Path(tmp_dir)
            systematization_path = tmp_path / "systematization.json"
-            systematization_path.write_text(
-                json.dumps(
-                    {
-                        "behavior": "Harmful advice",
-                        "summary_items": [],
-                    }
-                ),
-                encoding="utf-8",
-            )
+            payload = _structured_systematization()
+            del payload["validation"]
+            systematization_path.write_text(json.dumps(payload), encoding="utf-8")


Fixed in the latest push by renaming the test to test_run_systematization_to_taxonomy_rejects_missing_validation, since the fixture intentionally deletes the validation field.

Copilot AI review requested due to automatic review settings May 28, 2026 18:09

Copilot started reviewing on behalf of jakepresent May 28, 2026 18:09 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

This was referenced May 28, 2026

fix(viewer): show scenario run model labels #120

Merged

feat(test-set): add configurable sampling methods #121

Draft

science: sync structured systematization output

309f361

jakepresent force-pushed the jake/science-sync-systematization branch from 296520b to 309f361 Compare May 29, 2026 22:03

tangym approved these changes May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

science: sync structured systematization output#111

science: sync structured systematization output#111
jakepresent wants to merge 1 commit into
mainfrom
jake/science-sync-systematization

jakepresent commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

jakepresent May 28, 2026 •

edited

Loading

Uh oh!

jakepresent May 28, 2026

Uh oh!

jakepresent May 28, 2026 •

edited

Loading

Uh oh!

jakepresent May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jakepresent commented May 27, 2026

Summary

Science-side comparison

Validation

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

jakepresent May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakepresent May 28, 2026

Choose a reason for hiding this comment

Uh oh!

jakepresent May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jakepresent May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jakepresent May 28, 2026 •

edited

Loading

jakepresent May 28, 2026 •

edited

Loading

jakepresent May 28, 2026 •

edited

Loading