Add opt-in concurrent evaluation (overlap eval with next solve)#11
Open
robdmac wants to merge 3 commits into
Open
Add opt-in concurrent evaluation (overlap eval with next solve)#11robdmac wants to merge 3 commits into
robdmac wants to merge 3 commits into
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Author
|
This can be hardened. Converting to draft. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Currently, evaluation runs serially between checkpoints - a checkpoint's tests must finish before the agent starts the next one even though eval results never feed back into the (blind) agent. This adds an opt-in --concurrent-evaluation flag that instead runs each checkpoint's eval concurrently with the next checkpoint's solve, so eval no longer blocks progress. On subprocess/compile-heavy problems (such as test_translator), evaluation is ~57% of per-run wall-clock so this approximately halves the run-time.
A single rolling background thread evaluates checkpoint N while N+1 solves; the previous eval is joined before the next starts, so solve stays <= 1 checkpoint ahead of eval and at most one solve + one eval run at a time (in-flight containers bounded to 2). Results land per-checkpoint as each eval completes, not batched at the end. Off by default.
Since eval never feeds the agent, scores are unchanged for the ANY_CASE pass policy, verified by re-evaluating recorded snapshots: pass counts and test_collection_hash bit-for-bit identical at checkpoints.
Trade-off: can't early-stop on test failures (a checkpoint's eval finishes during the next solve); agent errors and rate limits will stop the run.