judge: embed seed test file in the prompt; reframe tests as joint review by samkeen · Pull Request #17 · AlteredCraft/tilth

samkeen · 2026-05-27T00:12:19Z

Summary

Closes #16. Lands two paired changes that fix the symptom (false rejection on file-existence ACs) and the deeper problem (judge can't verify the assertions actually pin down the criteria).

Embed the seed test in the judge prompt. _build_judge_user_message globs tests/test_t<NNN>_*.py (reusing validators._task_test_glob, so the judge sees exactly what pytest ran) and embeds it under ## Seed acceptance test (already on disk, not in this diff). Capped at JUDGE_TEST_FILE_MAX_CHARS = 8000.
Reframe the judge prompt. Tests and implementation are now joint review subjects. Two new rejection categories — Weak test and Tests-pass-but-wrong — plus a per-AC↔assertion correspondence rule and an explicit "files in HEAD count as exists" rule that resolves the Judge can't review seed acceptance test files; rejects on file-existence criteria #16 symptom directly.

The new _build_judge_user_message is pure on (task, worktree, diff) so it's testable without mocking the LLM client.

What's intentionally not in this PR

The larger responsibility-matrix question (Gherkin output from prep-feature, worker writes its own tests, etc.) stays open. The plan is to run two or three demo sessions with the new judge framing, look at what it rejects on, and decide based on data whether the seeder's tests or the worker's gaming is the bottleneck. Phase 2 walks through that door with evidence, not a guess.

Test plan

pytest tests/ — 220 passed (7 new in test_judge_seed_test_embed.py)
ruff check tilth/ tests/ — clean
Run the demo session and observe whether the file-existence rejection goes away on T-001
Watch a session where the seed test is weak (e.g. only checks return code on an AC that names stderr too) and confirm the judge flags it

🤖 Generated with Claude Code

Closes #16. The per-task seed test file is committed at seed time (0e3ba27) so it never appears in task_diff. Today's judge prompt receives the diff only, which produces two failure modes seen in session 20260526-132309-ab3927: - false rejection on file-existence acceptance criteria - worker buying past the judge with cosmetic edits to get the file into the diff (41 iterations, 448k tokens on T-001) This commit lands two paired changes: 1. _build_judge_user_message embeds the matching `tests/test_t<NNN>_*.py` under a new section flagged as "already on disk, not in this diff". Reuses validators._task_test_glob so the judge sees exactly what pytest ran. Capped at JUDGE_TEST_FILE_MAX_CHARS (8k). 2. prompts/judge.md reframes the test file and implementation as joint subjects of review. Two new rejection categories — Weak test and Tests-pass-but-wrong — plus a per-AC↔assertion correspondence rule and an explicit "files in HEAD count as exists" rule covering the #16 symptom directly. Intentionally not in this PR: the larger Gherkin / worker-writes-tests responsibility-matrix question. Want to see how this change shifts judge behaviour across a few real runs before deciding whether to redistribute authorship. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

judge: embed seed test file in the prompt; reframe tests as joint review#17

judge: embed seed test file in the prompt; reframe tests as joint review#17
samkeen wants to merge 1 commit into
mainfrom
judge-sees-seed-test

samkeen commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

samkeen commented May 27, 2026

Summary

What's intentionally not in this PR

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant