Skip to content

Judge can't review seed acceptance test files; rejects on file-existence criteria #16

@samkeen

Description

@samkeen

What happens

In a session where the per-task seed test file (tests/test_t<NNN>_*.py) is committed to the session branch at prep time (via commit_seed, landed in 0e3ba27), the judge has no visibility into the test contents during per-task review. Its prompt receives:

  • task description
  • acceptance criteria
  • AGENTS.md
  • task_diff (= git diff HEAD + intent-to-add for untracked)

The seed test file is in HEAD by design, so it's not in task_diff. Two real consequences observed in session 20260526-132309-ab3927:

Symptom 1: false rejection on file-existence AC

When the seeder writes an acceptance criterion phrased as "tests/ directory exists with at least tests/test_t001_scaffold.py", the judge sees the diff, doesn't see the file, and rejects:

"The diff does not create the required tests/ directory or any file within it (e.g., tests/test_t001_scaffold.py)"judge_verdict at iter=11 and iter=15

Symptom 2: worker buys past the judge with cosmetic edits

After 2 rejects, the worker eventually touched the seed test file with a one-line cosmetic edit (PROJECT_ROOT = Path(__file__).resolve().parent.parent) just to get the file into the diff. T-001's accepted commit:

tests/test_t001_scaffold.py | 4 +++-

Real cost: 41 iterations and 448k tokens on a scaffold task (71% of the whole session's tokens) — vs. T-002/T-003 at 8 iterations each. The self-improvement step then proposed the cosmetic refactor as a "learning to add to AGENTS.md," compounding the harness dysfunction.

The deeper question

Beyond the false-rejection symptom: even when the worker's diff is clean, the judge today can't verify "does the implementation actually satisfy what the seed test will assert?" That's exactly the subjective check the judge exists for. Seeing the diff alone, without the test file, the judge has to infer from the AC whether the implementation matches — and AC are by design coarser than test assertions.

Three options for fixing it

Option What it does Token cost per judge call Catches symptom 1? Catches deeper question?
A. File path only Append to AC section: "tests/test_t001_scaffold.py is already in HEAD from the seed commit; presence satisfies any file-existence criterion." ~20 tokens
B. Path + test function signatures Path + def test_*(...) lines + docstrings, no bodies ~150 tokens partial (judge knows what's tested at a high level, not what's asserted)
C. Full file contents Embed the full per-task seed test file under a new section labelled ## Seed acceptance test (already on disk, not in this diff) ~600–1200 tokens

Token costs measured from the actual seed tests in session 20260526-132309: T-001 ~600, T-002 ~900, T-003 ~1200. Run had 5 judge calls total → Option C adds ~5k tokens vs. current; the diff itself is already up to JUDGE_DIFF_MAX_CHARS so this is a small fractional increase on a call that isn't cheap to begin with.

Recommendation

Option C. The narrow symptom (false rejection) is just one failure mode; the deeper "does the diff really satisfy the asserted behavior" question is what subjective judging is supposed to catch in the first place. Hiding the test file from the judge undermines its job.

Implementation outline:

  • In tilth/loop.py:_judge_task, after the "Acceptance criteria" section, glob worktree/tests/test_<task_id_normalised>_*.py (reuse validators._task_test_glob logic), read the file, embed under ## Seed acceptance test (already on disk, not in this diff).
  • Update tilth/prompts/judge.md to explain the new section and the "files in HEAD count as exists" rule for any AC mentioning file presence.
  • Test: assert the judge prompt includes the test contents when a matching file exists.

Related

  • Introduced by commit-the-seed-at-prep (commit 0e3ba27 — "prep: commit the seed bundle so per-task diffs start clean"). That commit fixed an opposite scope-creep bug; this issue is the dual.
  • Session evidence: sessions/20260526-132309-ab3927/events.jsonl (look for judge_verdict events at iter=11 and iter=15 of T-001).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions