Skip to content

judge: embed seed test file in the prompt; reframe tests as joint review#17

Open
samkeen wants to merge 1 commit into
mainfrom
judge-sees-seed-test
Open

judge: embed seed test file in the prompt; reframe tests as joint review#17
samkeen wants to merge 1 commit into
mainfrom
judge-sees-seed-test

Conversation

@samkeen
Copy link
Copy Markdown
Contributor

@samkeen samkeen commented May 27, 2026

Summary

Closes #16. Lands two paired changes that fix the symptom (false rejection on file-existence ACs) and the deeper problem (judge can't verify the assertions actually pin down the criteria).

  • Embed the seed test in the judge prompt. _build_judge_user_message globs tests/test_t<NNN>_*.py (reusing validators._task_test_glob, so the judge sees exactly what pytest ran) and embeds it under ## Seed acceptance test (already on disk, not in this diff). Capped at JUDGE_TEST_FILE_MAX_CHARS = 8000.
  • Reframe the judge prompt. Tests and implementation are now joint review subjects. Two new rejection categories — Weak test and Tests-pass-but-wrong — plus a per-AC↔assertion correspondence rule and an explicit "files in HEAD count as exists" rule that resolves the Judge can't review seed acceptance test files; rejects on file-existence criteria #16 symptom directly.

The new _build_judge_user_message is pure on (task, worktree, diff) so it's testable without mocking the LLM client.

What's intentionally not in this PR

The larger responsibility-matrix question (Gherkin output from prep-feature, worker writes its own tests, etc.) stays open. The plan is to run two or three demo sessions with the new judge framing, look at what it rejects on, and decide based on data whether the seeder's tests or the worker's gaming is the bottleneck. Phase 2 walks through that door with evidence, not a guess.

Test plan

  • pytest tests/ — 220 passed (7 new in test_judge_seed_test_embed.py)
  • ruff check tilth/ tests/ — clean
  • Run the demo session and observe whether the file-existence rejection goes away on T-001
  • Watch a session where the seed test is weak (e.g. only checks return code on an AC that names stderr too) and confirm the judge flags it

🤖 Generated with Claude Code

Closes #16. The per-task seed test file is committed at seed time
(0e3ba27) so it never appears in task_diff. Today's judge prompt receives
the diff only, which produces two failure modes seen in session
20260526-132309-ab3927:

  - false rejection on file-existence acceptance criteria
  - worker buying past the judge with cosmetic edits to get the file into
    the diff (41 iterations, 448k tokens on T-001)

This commit lands two paired changes:

1. _build_judge_user_message embeds the matching `tests/test_t<NNN>_*.py`
   under a new section flagged as "already on disk, not in this diff".
   Reuses validators._task_test_glob so the judge sees exactly what
   pytest ran. Capped at JUDGE_TEST_FILE_MAX_CHARS (8k).

2. prompts/judge.md reframes the test file and implementation as joint
   subjects of review. Two new rejection categories — Weak test and
   Tests-pass-but-wrong — plus a per-AC↔assertion correspondence rule
   and an explicit "files in HEAD count as exists" rule covering the #16
   symptom directly.

Intentionally not in this PR: the larger Gherkin / worker-writes-tests
responsibility-matrix question. Want to see how this change shifts judge
behaviour across a few real runs before deciding whether to redistribute
authorship.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Judge can't review seed acceptance test files; rejects on file-existence criteria

1 participant