Skip to content

Judge rejections trigger long re-verification tails after the fix is already in #11

@samkeen

Description

@samkeen

What

When the judge rejects a task, the worker fixes the issue quickly — and then burns several more iterations on redundant verification before declaring done again. With the previous TILTH_MAX_ITERATIONS_PER_TASK=8, a single judge rejection late in a task could itself cap the run.

Evidence from the 5/1 demo

sessions/20260501-143858-f8e7a3/ — T-004 took 15 iterations total. Anatomy:

Iter Activity
1–6 Implement cmd_list, run pytest
7 Validators pass → judge rejects (lines[i-1] indexing bug)
8 Re-read file. Reasoning trace cleanly identifies the bug.
9 edit_file — fix lands.
10 bash — mktemp+heredoc smoke test (# Groceries + open + done)
11 bash — same smoke test, minor variation
12 bash — same smoke test, third variation
13 bashpytest tests/test_t004_list.py
14 bashpytest across all task test files
15 No tool calls → validators run → judge accepts

Iters 10–14 are five mostly-redundant verifications. The fix at iter 9 was already correct. With the prior cap of 8 iterations, this task would have failed iter_cap at iter 8 — before the worker could even read the file in response to the rejection.

The cap has now been bumped to 32, which fixes the immediate symptom (the issue with b45a6f5 / ec5486d).

Why this happens

Two probable contributors:

  1. tilth/prompts/system.md includes a self-review reminder:

    Before your final summary, ask yourself: "What evidence do I have that the acceptance criteria are met?" If the answer is only "the code looks right," run the tests one more time.

    After a judge burn, the worker is incentivised to over-verify.

  2. The judge feedback message in loop.py:552 says "Stop calling tools and respond with a summary only when the issue is resolved." — but doesn't give a cheap signal for what counts as resolved beyond "the issue."

Proposed directions (not mutually exclusive)

  1. Per-rejection iteration bonus instead of a global cap. e.g. +5 iters per judge rejection. Keeps the global cap honest for well-behaved tasks; absorbs the verification tail when something legitimately surprises the worker.
  2. Tighten the judge-rejection feedback message with explicit guidance: "Fix the issue and re-run only the acceptance tests. Do not add ad-hoc smoke tests. Declare done when the acceptance tests pass with the fix."
  3. Detect verification loops. If the worker has run pytest twice with the same exit status and no intervening edit, inject "you've already verified — declare done" as feedback. (Probably too clever / brittle.)

(1) is the most defensible — it acknowledges that a judge rejection legitimately extends the work envelope, without normalising the long tail for everyone.

Related

  • tilth/loop.py:_run_task (the iteration cap, the judge-feedback message)
  • tilth/prompts/system.md (the self-review reminder)
  • .env.example (TILTH_MAX_ITERATIONS_PER_TASK)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions