judge: embed seed test file in the prompt; reframe tests as joint review#17
Open
samkeen wants to merge 1 commit into
Open
judge: embed seed test file in the prompt; reframe tests as joint review#17samkeen wants to merge 1 commit into
samkeen wants to merge 1 commit into
Conversation
Closes #16. The per-task seed test file is committed at seed time (0e3ba27) so it never appears in task_diff. Today's judge prompt receives the diff only, which produces two failure modes seen in session 20260526-132309-ab3927: - false rejection on file-existence acceptance criteria - worker buying past the judge with cosmetic edits to get the file into the diff (41 iterations, 448k tokens on T-001) This commit lands two paired changes: 1. _build_judge_user_message embeds the matching `tests/test_t<NNN>_*.py` under a new section flagged as "already on disk, not in this diff". Reuses validators._task_test_glob so the judge sees exactly what pytest ran. Capped at JUDGE_TEST_FILE_MAX_CHARS (8k). 2. prompts/judge.md reframes the test file and implementation as joint subjects of review. Two new rejection categories — Weak test and Tests-pass-but-wrong — plus a per-AC↔assertion correspondence rule and an explicit "files in HEAD count as exists" rule covering the #16 symptom directly. Intentionally not in this PR: the larger Gherkin / worker-writes-tests responsibility-matrix question. Want to see how this change shifts judge behaviour across a few real runs before deciding whether to redistribute authorship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #16. Lands two paired changes that fix the symptom (false rejection on file-existence ACs) and the deeper problem (judge can't verify the assertions actually pin down the criteria).
_build_judge_user_messageglobstests/test_t<NNN>_*.py(reusingvalidators._task_test_glob, so the judge sees exactly what pytest ran) and embeds it under## Seed acceptance test (already on disk, not in this diff). Capped atJUDGE_TEST_FILE_MAX_CHARS = 8000.The new
_build_judge_user_messageis pure on(task, worktree, diff)so it's testable without mocking the LLM client.What's intentionally not in this PR
The larger responsibility-matrix question (Gherkin output from
prep-feature, worker writes its own tests, etc.) stays open. The plan is to run two or three demo sessions with the new judge framing, look at what it rejects on, and decide based on data whether the seeder's tests or the worker's gaming is the bottleneck. Phase 2 walks through that door with evidence, not a guess.Test plan
pytest tests/— 220 passed (7 new intest_judge_seed_test_embed.py)ruff check tilth/ tests/— clean🤖 Generated with Claude Code