[codex] Refactor benchmark execution orchestration by glennmatlin · Pull Request #53 · eilab-gt/SuspensePerception

glennmatlin · 2026-04-21T21:41:27Z

Summary

Refactor the Thriller CLI path into reusable experiment preparation and run_config execution helpers.
Add a paper-style benchmark runner with planning, resume checks, manifests, preflight API support, and result-quality failure detection.
Update API/parsing behavior and optional adversarial dependencies so local tests and smoke execution do not require the full NLP stack unless those augmentations are used.
Expand tests around runner planning, output safety, parsing behavior, prompt generation, and API response extraction.

Why

The previous run path was hard to orchestrate safely for full benchmark runs: it mixed CLI-only behavior, prompt snapshot writes, API key attachment, output directory creation, and execution. The new runner gives us a repeat-aware, manifest-backed workflow that can dry-run plans, resume completed runs, and fail fast on partial or malformed outputs.

Validation

uv run ruff check src tests conftest.py scripts/run_benchmark.py
uv run pytest tests/ -> 20 passed, 1 existing pydantic deprecation warning

Notes

Generated benchmark outputs and quarantined partial runs remain local and are not included in this PR.

Refactor benchmark execution orchestration

068ae37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[codex] Refactor benchmark execution orchestration#53

[codex] Refactor benchmark execution orchestration#53
glennmatlin wants to merge 1 commit into
mainfrom
glenn/refactor-n-run

glennmatlin commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

glennmatlin commented Apr 21, 2026

Summary

Why

Validation

Notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant