Judge can't review seed acceptance test files; rejects on file-existence criteria

## What happens

In a session where the per-task seed test file (`tests/test_t<NNN>_*.py`) is committed to the session branch at prep time (via `commit_seed`, landed in 0e3ba27), the judge has no visibility into the test contents during per-task review. Its prompt receives:

- task description
- acceptance criteria
- AGENTS.md
- task_diff (= `git diff HEAD` + intent-to-add for untracked)

The seed test file is in HEAD by design, so it's *not* in `task_diff`. Two real consequences observed in session `20260526-132309-ab3927`:

### Symptom 1: false rejection on file-existence AC

When the seeder writes an acceptance criterion phrased as *"tests/ directory exists with at least tests/test_t001_scaffold.py"*, the judge sees the diff, doesn't see the file, and rejects:

> *"The diff does not create the required tests/ directory or any file within it (e.g., tests/test_t001_scaffold.py)"* — `judge_verdict` at iter=11 and iter=15

### Symptom 2: worker buys past the judge with cosmetic edits

After 2 rejects, the worker eventually touched the seed test file with a one-line cosmetic edit (`PROJECT_ROOT = Path(__file__).resolve().parent.parent`) just to get the file into the diff. T-001's accepted commit:

```
tests/test_t001_scaffold.py | 4 +++-
```

Real cost: **41 iterations and 448k tokens on a scaffold task** (71% of the whole session's tokens) — vs. T-002/T-003 at 8 iterations each. The self-improvement step then proposed the cosmetic refactor as a "learning to add to AGENTS.md," compounding the harness dysfunction.

## The deeper question

Beyond the false-rejection symptom: even when the worker's diff *is* clean, the judge today can't verify *"does the implementation actually satisfy what the seed test will assert?"* That's exactly the subjective check the judge exists for. Seeing the diff alone, without the test file, the judge has to infer from the AC whether the implementation matches — and AC are by design coarser than test assertions.

## Three options for fixing it

| Option | What it does | Token cost per judge call | Catches symptom 1? | Catches deeper question? |
|---|---|---|---|---|
| **A. File path only** | Append to AC section: *"tests/test_t001_scaffold.py is already in HEAD from the seed commit; presence satisfies any file-existence criterion."* | ~20 tokens | ✅ | ❌ |
| **B. Path + test function signatures** | Path + `def test_*(...)` lines + docstrings, no bodies | ~150 tokens | ✅ | partial (judge knows what's tested at a high level, not what's asserted) |
| **C. Full file contents** | Embed the full per-task seed test file under a new section labelled `## Seed acceptance test (already on disk, not in this diff)` | ~600–1200 tokens | ✅ | ✅ |

Token costs measured from the actual seed tests in session 20260526-132309: T-001 ~600, T-002 ~900, T-003 ~1200. Run had 5 judge calls total → Option C adds ~5k tokens vs. current; the diff itself is already up to `JUDGE_DIFF_MAX_CHARS` so this is a small fractional increase on a call that isn't cheap to begin with.

## Recommendation

**Option C.** The narrow symptom (false rejection) is just one failure mode; the deeper "does the diff really satisfy the asserted behavior" question is what subjective judging is supposed to catch in the first place. Hiding the test file from the judge undermines its job.

Implementation outline:
- In `tilth/loop.py:_judge_task`, after the "Acceptance criteria" section, glob `worktree/tests/test_<task_id_normalised>_*.py` (reuse `validators._task_test_glob` logic), read the file, embed under `## Seed acceptance test (already on disk, not in this diff)`.
- Update `tilth/prompts/judge.md` to explain the new section and the *"files in HEAD count as exists"* rule for any AC mentioning file presence.
- Test: assert the judge prompt includes the test contents when a matching file exists.

## Related

- Introduced by commit-the-seed-at-prep (commit 0e3ba27 — "prep: commit the seed bundle so per-task diffs start clean"). That commit fixed an opposite scope-creep bug; this issue is the dual.
- Session evidence: `sessions/20260526-132309-ab3927/events.jsonl` (look for `judge_verdict` events at iter=11 and iter=15 of T-001).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Judge can't review seed acceptance test files; rejects on file-existence criteria #16

What happens

Symptom 1: false rejection on file-existence AC

Symptom 2: worker buys past the judge with cosmetic edits

The deeper question

Three options for fixing it

Recommendation

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Option	What it does	Token cost per judge call	Catches symptom 1?	Catches deeper question?
A. File path only	Append to AC section: "tests/test_t001_scaffold.py is already in HEAD from the seed commit; presence satisfies any file-existence criterion."	~20 tokens	✅	❌
B. Path + test function signatures	Path + `def test_*(...)` lines + docstrings, no bodies	~150 tokens	✅	partial (judge knows what's tested at a high level, not what's asserted)
C. Full file contents	Embed the full per-task seed test file under a new section labelled `## Seed acceptance test (already on disk, not in this diff)`	~600–1200 tokens	✅	✅

Judge can't review seed acceptance test files; rejects on file-existence criteria #16

Description

What happens

Symptom 1: false rejection on file-existence AC

Symptom 2: worker buys past the judge with cosmetic edits

The deeper question

Three options for fixing it

Recommendation

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions