Judge rejections trigger long re-verification tails after the fix is already in

## What

When the judge rejects a task, the worker fixes the issue quickly — and then burns several more iterations on redundant verification before declaring done again. With the previous `TILTH_MAX_ITERATIONS_PER_TASK=8`, a single judge rejection late in a task could itself cap the run.

## Evidence from the 5/1 demo

`sessions/20260501-143858-f8e7a3/` — T-004 took 15 iterations total. Anatomy:

| Iter | Activity |
| ---- | --- |
| 1–6 | Implement `cmd_list`, run pytest |
| **7** | Validators pass → judge **rejects** (`lines[i-1]` indexing bug) |
| 8 | Re-read file. Reasoning trace cleanly identifies the bug. |
| 9 | `edit_file` — fix lands. |
| **10** | `bash` — mktemp+heredoc smoke test (`# Groceries` + open + done) |
| **11** | `bash` — same smoke test, minor variation |
| **12** | `bash` — same smoke test, third variation |
| **13** | `bash` — `pytest tests/test_t004_list.py` |
| **14** | `bash` — `pytest` across all task test files |
| 15 | No tool calls → validators run → judge **accepts** |

Iters 10–14 are five mostly-redundant verifications. The fix at iter 9 was already correct. With the prior cap of 8 iterations, this task would have failed `iter_cap` at iter 8 — *before* the worker could even read the file in response to the rejection.

The cap has now been bumped to 32, which fixes the immediate symptom (the issue with `b45a6f5` / `ec5486d`).

## Why this happens

Two probable contributors:

1. `tilth/prompts/system.md` includes a self-review reminder:
   > Before your final summary, ask yourself: \"What evidence do I have that the acceptance criteria are met?\" If the answer is only \"the code looks right,\" run the tests one more time.

   After a judge burn, the worker is incentivised to over-verify.
2. The judge feedback message in `loop.py:552` says \"Stop calling tools and respond with a summary only when the issue is resolved.\" — but doesn't give a cheap signal for what counts as resolved beyond \"the issue.\"

## Proposed directions (not mutually exclusive)

1. **Per-rejection iteration bonus** instead of a global cap. e.g. `+5` iters per judge rejection. Keeps the global cap honest for well-behaved tasks; absorbs the verification tail when something legitimately surprises the worker.
2. **Tighten the judge-rejection feedback message** with explicit guidance: \"Fix the issue and re-run only the acceptance tests. Do not add ad-hoc smoke tests. Declare done when the acceptance tests pass with the fix.\"
3. **Detect verification loops**. If the worker has run pytest twice with the same exit status and no intervening edit, inject \"you've already verified — declare done\" as feedback. (Probably too clever / brittle.)

(1) is the most defensible — it acknowledges that a judge rejection legitimately extends the work envelope, without normalising the long tail for everyone.

## Related

- `tilth/loop.py:_run_task` (the iteration cap, the judge-feedback message)
- `tilth/prompts/system.md` (the self-review reminder)
- `.env.example` (`TILTH_MAX_ITERATIONS_PER_TASK`)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Judge rejections trigger long re-verification tails after the fix is already in #11

What

Evidence from the 5/1 demo

Why this happens

Proposed directions (not mutually exclusive)

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Iter	Activity
1–6	Implement `cmd_list`, run pytest
7	Validators pass → judge rejects (`lines[i-1]` indexing bug)
8	Re-read file. Reasoning trace cleanly identifies the bug.
9	`edit_file` — fix lands.
10	`bash` — mktemp+heredoc smoke test (`# Groceries` + open + done)
11	`bash` — same smoke test, minor variation
12	`bash` — same smoke test, third variation
13	`bash` — `pytest tests/test_t004_list.py`
14	`bash` — `pytest` across all task test files
15	No tool calls → validators run → judge accepts

Judge rejections trigger long re-verification tails after the fix is already in #11

Description

What

Evidence from the 5/1 demo

Why this happens

Proposed directions (not mutually exclusive)

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions