What
When the judge rejects a task, the worker fixes the issue quickly — and then burns several more iterations on redundant verification before declaring done again. With the previous TILTH_MAX_ITERATIONS_PER_TASK=8, a single judge rejection late in a task could itself cap the run.
Evidence from the 5/1 demo
sessions/20260501-143858-f8e7a3/ — T-004 took 15 iterations total. Anatomy:
| Iter |
Activity |
| 1–6 |
Implement cmd_list, run pytest |
| 7 |
Validators pass → judge rejects (lines[i-1] indexing bug) |
| 8 |
Re-read file. Reasoning trace cleanly identifies the bug. |
| 9 |
edit_file — fix lands. |
| 10 |
bash — mktemp+heredoc smoke test (# Groceries + open + done) |
| 11 |
bash — same smoke test, minor variation |
| 12 |
bash — same smoke test, third variation |
| 13 |
bash — pytest tests/test_t004_list.py |
| 14 |
bash — pytest across all task test files |
| 15 |
No tool calls → validators run → judge accepts |
Iters 10–14 are five mostly-redundant verifications. The fix at iter 9 was already correct. With the prior cap of 8 iterations, this task would have failed iter_cap at iter 8 — before the worker could even read the file in response to the rejection.
The cap has now been bumped to 32, which fixes the immediate symptom (the issue with b45a6f5 / ec5486d).
Why this happens
Two probable contributors:
-
tilth/prompts/system.md includes a self-review reminder:
Before your final summary, ask yourself: "What evidence do I have that the acceptance criteria are met?" If the answer is only "the code looks right," run the tests one more time.
After a judge burn, the worker is incentivised to over-verify.
-
The judge feedback message in loop.py:552 says "Stop calling tools and respond with a summary only when the issue is resolved." — but doesn't give a cheap signal for what counts as resolved beyond "the issue."
Proposed directions (not mutually exclusive)
- Per-rejection iteration bonus instead of a global cap. e.g.
+5 iters per judge rejection. Keeps the global cap honest for well-behaved tasks; absorbs the verification tail when something legitimately surprises the worker.
- Tighten the judge-rejection feedback message with explicit guidance: "Fix the issue and re-run only the acceptance tests. Do not add ad-hoc smoke tests. Declare done when the acceptance tests pass with the fix."
- Detect verification loops. If the worker has run pytest twice with the same exit status and no intervening edit, inject "you've already verified — declare done" as feedback. (Probably too clever / brittle.)
(1) is the most defensible — it acknowledges that a judge rejection legitimately extends the work envelope, without normalising the long tail for everyone.
Related
tilth/loop.py:_run_task (the iteration cap, the judge-feedback message)
tilth/prompts/system.md (the self-review reminder)
.env.example (TILTH_MAX_ITERATIONS_PER_TASK)
What
When the judge rejects a task, the worker fixes the issue quickly — and then burns several more iterations on redundant verification before declaring done again. With the previous
TILTH_MAX_ITERATIONS_PER_TASK=8, a single judge rejection late in a task could itself cap the run.Evidence from the 5/1 demo
sessions/20260501-143858-f8e7a3/— T-004 took 15 iterations total. Anatomy:cmd_list, run pytestlines[i-1]indexing bug)edit_file— fix lands.bash— mktemp+heredoc smoke test (# Groceries+ open + done)bash— same smoke test, minor variationbash— same smoke test, third variationbash—pytest tests/test_t004_list.pybash—pytestacross all task test filesIters 10–14 are five mostly-redundant verifications. The fix at iter 9 was already correct. With the prior cap of 8 iterations, this task would have failed
iter_capat iter 8 — before the worker could even read the file in response to the rejection.The cap has now been bumped to 32, which fixes the immediate symptom (the issue with
b45a6f5/ec5486d).Why this happens
Two probable contributors:
tilth/prompts/system.mdincludes a self-review reminder:After a judge burn, the worker is incentivised to over-verify.
The judge feedback message in
loop.py:552says "Stop calling tools and respond with a summary only when the issue is resolved." — but doesn't give a cheap signal for what counts as resolved beyond "the issue."Proposed directions (not mutually exclusive)
+5iters per judge rejection. Keeps the global cap honest for well-behaved tasks; absorbs the verification tail when something legitimately surprises the worker.(1) is the most defensible — it acknowledges that a judge rejection legitimately extends the work envelope, without normalising the long tail for everyone.
Related
tilth/loop.py:_run_task(the iteration cap, the judge-feedback message)tilth/prompts/system.md(the self-review reminder).env.example(TILTH_MAX_ITERATIONS_PER_TASK)