fix: eval worktree CWD, fallback patch discovery, Triton wrapper dete…#118
Open
iraj465 wants to merge 1 commit into
Open
fix: eval worktree CWD, fallback patch discovery, Triton wrapper dete…#118iraj465 wants to merge 1 commit into
iraj465 wants to merge 1 commit into
Conversation
Two fixes for the post-round evaluation pipeline:
1. commandment.py: Add `cd "${GEAK_WORK_DIR}" &&` before `exec python3`
in all run.sh variants. Without this, the eval worktree's Python CWD
was wrong, so `open('kernel.py')` and harness imports resolved from
the wrong directory during FULL_BENCHMARK verification. This caused
verified_speedup to always show ~1.0x even when the agent achieved
real speedups (e.g. 4.46x measured independently by AKA).
2. evaluation.py: When no per-task best_results.json exists, check for
best_patch.diff at the round root. Handles the case where
dispatch_tasks fails and the orchestrator LLM creates patches directly.
Made-with: Cursor
d1aa1f8 to
7c422bd
Compare
Collaborator
Author
|
It's WIP. Some aspects of code-quality can be better, open to suggestions |
7c422bd to
7053e00
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Title: fix: eval worktree CWD and fallback patch discovery
Body:
Two fixes for the post-round evaluation pipeline:
commandment.py: Add cd "${GEAK_WORK_DIR}" && before exec python3 in all run.sh variants. Without this, the eval worktree's Python CWD was wrong, so open('kernel.py') and harness imports resolved from the wrong directory. This caused FULL_BENCHMARK verification to test the unpatched baseline kernel instead of the optimized one.
evaluation.py: When no per-task best_results.json exists, check for best_patch.diff at the round root. Handles the case where dispatch_tasks fails and the orchestrator LLM creates patches directly.
Evidence: In AKA benchmark runs, GEAK internally reported verified_speedup=1.0x for refk_identity while AKA's independent re-evaluation measured 4.46x (baseline=0.0174ms, optimized=0.0039ms). The CWD fix ensures both measurements agree by running the correct kernel in the eval worktree.
Tested on: 7 AKA Triton kernels completed with standard evaluation flow, all compile=true, correct=true.