Skip to content

fix: eval worktree CWD, fallback patch discovery, Triton wrapper dete…#118

Open
iraj465 wants to merge 1 commit into
mainfrom
fix/eval-worktree-cwd-and-patches
Open

fix: eval worktree CWD, fallback patch discovery, Triton wrapper dete…#118
iraj465 wants to merge 1 commit into
mainfrom
fix/eval-worktree-cwd-and-patches

Conversation

@iraj465
Copy link
Copy Markdown
Collaborator

@iraj465 iraj465 commented Apr 8, 2026

Title: fix: eval worktree CWD and fallback patch discovery

Body:

Two fixes for the post-round evaluation pipeline:

commandment.py: Add cd "${GEAK_WORK_DIR}" && before exec python3 in all run.sh variants. Without this, the eval worktree's Python CWD was wrong, so open('kernel.py') and harness imports resolved from the wrong directory. This caused FULL_BENCHMARK verification to test the unpatched baseline kernel instead of the optimized one.

evaluation.py: When no per-task best_results.json exists, check for best_patch.diff at the round root. Handles the case where dispatch_tasks fails and the orchestrator LLM creates patches directly.

Evidence: In AKA benchmark runs, GEAK internally reported verified_speedup=1.0x for refk_identity while AKA's independent re-evaluation measured 4.46x (baseline=0.0174ms, optimized=0.0039ms). The CWD fix ensures both measurements agree by running the correct kernel in the eval worktree.

Tested on: 7 AKA Triton kernels completed with standard evaluation flow, all compile=true, correct=true.

Two fixes for the post-round evaluation pipeline:

1. commandment.py: Add `cd "${GEAK_WORK_DIR}" &&` before `exec python3`
   in all run.sh variants. Without this, the eval worktree's Python CWD
   was wrong, so `open('kernel.py')` and harness imports resolved from
   the wrong directory during FULL_BENCHMARK verification. This caused
   verified_speedup to always show ~1.0x even when the agent achieved
   real speedups (e.g. 4.46x measured independently by AKA).

2. evaluation.py: When no per-task best_results.json exists, check for
   best_patch.diff at the round root. Handles the case where
   dispatch_tasks fails and the orchestrator LLM creates patches directly.

Made-with: Cursor
@iraj465 iraj465 force-pushed the fix/eval-worktree-cwd-and-patches branch from d1aa1f8 to 7c422bd Compare April 8, 2026 16:41
@iraj465
Copy link
Copy Markdown
Collaborator Author

iraj465 commented Apr 8, 2026

It's WIP. Some aspects of code-quality can be better, open to suggestions

@iraj465 iraj465 added bug Something isn't working WIP labels Apr 8, 2026
@Umangatamd Umangatamd force-pushed the fix/eval-worktree-cwd-and-patches branch from 7c422bd to 7053e00 Compare May 3, 2026 07:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working WIP

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant