Skip to content

[codex] Refactor benchmark execution orchestration#53

Draft
glennmatlin wants to merge 1 commit into
mainfrom
glenn/refactor-n-run
Draft

[codex] Refactor benchmark execution orchestration#53
glennmatlin wants to merge 1 commit into
mainfrom
glenn/refactor-n-run

Conversation

@glennmatlin

Copy link
Copy Markdown
Collaborator

Summary

  • Refactor the Thriller CLI path into reusable experiment preparation and run_config execution helpers.
  • Add a paper-style benchmark runner with planning, resume checks, manifests, preflight API support, and result-quality failure detection.
  • Update API/parsing behavior and optional adversarial dependencies so local tests and smoke execution do not require the full NLP stack unless those augmentations are used.
  • Expand tests around runner planning, output safety, parsing behavior, prompt generation, and API response extraction.

Why

The previous run path was hard to orchestrate safely for full benchmark runs: it mixed CLI-only behavior, prompt snapshot writes, API key attachment, output directory creation, and execution. The new runner gives us a repeat-aware, manifest-backed workflow that can dry-run plans, resume completed runs, and fail fast on partial or malformed outputs.

Validation

  • uv run ruff check src tests conftest.py scripts/run_benchmark.py
  • uv run pytest tests/ -> 20 passed, 1 existing pydantic deprecation warning

Notes

Generated benchmark outputs and quarantined partial runs remain local and are not included in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant