Skip to content

Add opt-in concurrent evaluation (overlap eval with next solve)#11

Open
robdmac wants to merge 3 commits into
SprocketLab:mainfrom
robdmac:concurrent-eval
Open

Add opt-in concurrent evaluation (overlap eval with next solve)#11
robdmac wants to merge 3 commits into
SprocketLab:mainfrom
robdmac:concurrent-eval

Conversation

@robdmac
Copy link
Copy Markdown

@robdmac robdmac commented May 27, 2026

Currently, evaluation runs serially between checkpoints - a checkpoint's tests must finish before the agent starts the next one even though eval results never feed back into the (blind) agent. This adds an opt-in --concurrent-evaluation flag that instead runs each checkpoint's eval concurrently with the next checkpoint's solve, so eval no longer blocks progress. On subprocess/compile-heavy problems (such as test_translator), evaluation is ~57% of per-run wall-clock so this approximately halves the run-time.

A single rolling background thread evaluates checkpoint N while N+1 solves; the previous eval is joined before the next starts, so solve stays <= 1 checkpoint ahead of eval and at most one solve + one eval run at a time (in-flight containers bounded to 2). Results land per-checkpoint as each eval completes, not batched at the end. Off by default.

Since eval never feeds the agent, scores are unchanged for the ANY_CASE pass policy, verified by re-evaluating recorded snapshots: pass counts and test_collection_hash bit-for-bit identical at checkpoints.

Trade-off: can't early-stop on test failures (a checkpoint's eval finishes during the next solve); agent errors and rate limits will stop the run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@robdmac
Copy link
Copy Markdown
Author

robdmac commented May 28, 2026

This can be hardened. Converting to draft.

@robdmac robdmac marked this pull request as draft May 28, 2026 15:29
@robdmac robdmac marked this pull request as ready for review May 28, 2026 17:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant