What happened
In demo session 20260529-113158-6b9b12, the seeder produced a cross-task contradiction on the same call, main([]):
| Task |
Contract for main([]) |
| T-001 AC |
returns 0 — seed test test_main_returns_zero asserts main([]) == 0 |
| T-002 AC |
returns a non-zero exit code and prints usage to stderr |
T-001 pins a transitional stub behavior (main([]) == 0) as a permanent acceptance criterion with a regression test. T-002 was always going to supersede that behavior. The two are mutually exclusive.
Why it forced a contract rewrite
validators.py:run_pytest ratchets — it runs completed tasks' tests as a regression guard (commit d2adb33). So during T-002, T-001's now-stale test_main_returns_zero runs and fails. The worker had exactly two escapes:
- Break T-002's AC (make
main([]) return 0), or
- Rewrite T-001's seed test.
It chose (2): flipped test_main_returns_zero → test_main_returns_nonzero_on_no_args (assert main([]) != 0). Validators then passed and the task was accepted.
The damage
T-001's stated acceptance criterion (main([]) returns 0) is now silently contradicted by T-001's own rewritten test. The run shows 3/3 green; the T-001 contract in prd.json is false relative to the code. This is the cross-cutting friction from proposals/frictions-2026-05-26.md: "we've seen both tests pass and the judge accept with broken work." There's now an AC↔test mismatch on a completed task that nothing flags.
Sub-finding: the evaluator overrode its own hard rule
tilth/prompts/judge.md makes cross-task seed-test edits (tests/test_t<NNN>_*.py, NNN ≠ this task) a hard reject, no judgement. In this session the evaluator:
- rejected a
test_t003_persist.py edit (future task → correct), but
- accepted the
test_t001_scaffold.py edit (completed task), improvising a completed-vs-future distinction the rule does not contain — and even instructed it at iter 31 ("Only the test_t001_scaffold.py change ... should remain in this diff").
The reasoning is pragmatic (a completed task's test legitimately invalidated by evolved behavior ≠ tampering with a future task's contract), but it means the gaming backstop isn't actually hard, and the precedent is "you may rewrite an earlier task's test if you tell a good story."
Root cause + fix surface
This is F1/F2 (seed contradiction) + F9 (no cross-task awareness), turned into a forcing function by the validator ratchet. The fix belongs upstream in the seeder, not in weakening the cross-task rule:
- The seeder should not pin transitional/stub behavior (
main([]) == 0) as a permanent regression assertion. A stub behavior that a later task supersedes should either not be asserted in a ratcheted test, or the superseding task (T-002) should explicitly acknowledge it changes T-001's main([]) contract.
- Weakening
judge.md's cross-task hard-reject to bless "completed-task test updates" is the wrong move — it invites the gaming the rule exists to stop.
Open tension (deliberately not resolved here)
Keeping the cross-task rule truly hard means a contradictory seed like this one would deadlock (worker can't pass T-002 without touching T-001's test; judge won't allow it → iter cap). That's arguably the correct signal (bad seed fails loudly instead of silently passing broken work) — but it collides with F7 (no escape from a broken task) and the deferred halt-authority question. This interaction should inform the halt design when it's revisited, using the ledger data v1 sessions now produce.
Notes
Related
What happened
In demo session
20260529-113158-6b9b12, the seeder produced a cross-task contradiction on the same call,main([]):main([])0— seed testtest_main_returns_zeroassertsmain([]) == 0T-001 pins a transitional stub behavior (
main([]) == 0) as a permanent acceptance criterion with a regression test. T-002 was always going to supersede that behavior. The two are mutually exclusive.Why it forced a contract rewrite
validators.py:run_pytestratchets — it runs completed tasks' tests as a regression guard (commit d2adb33). So during T-002, T-001's now-staletest_main_returns_zeroruns and fails. The worker had exactly two escapes:main([])return 0), orIt chose (2): flipped
test_main_returns_zero→test_main_returns_nonzero_on_no_args(assert main([]) != 0). Validators then passed and the task was accepted.The damage
T-001's stated acceptance criterion (
main([])returns 0) is now silently contradicted by T-001's own rewritten test. The run shows 3/3 green; the T-001 contract inprd.jsonis false relative to the code. This is the cross-cutting friction fromproposals/frictions-2026-05-26.md: "we've seen both tests pass and the judge accept with broken work." There's now an AC↔test mismatch on a completed task that nothing flags.Sub-finding: the evaluator overrode its own hard rule
tilth/prompts/judge.mdmakes cross-task seed-test edits (tests/test_t<NNN>_*.py, NNN ≠ this task) a hard reject, no judgement. In this session the evaluator:test_t003_persist.pyedit (future task → correct), buttest_t001_scaffold.pyedit (completed task), improvising a completed-vs-future distinction the rule does not contain — and even instructed it at iter 31 ("Only the test_t001_scaffold.py change ... should remain in this diff").The reasoning is pragmatic (a completed task's test legitimately invalidated by evolved behavior ≠ tampering with a future task's contract), but it means the gaming backstop isn't actually hard, and the precedent is "you may rewrite an earlier task's test if you tell a good story."
Root cause + fix surface
This is F1/F2 (seed contradiction) + F9 (no cross-task awareness), turned into a forcing function by the validator ratchet. The fix belongs upstream in the seeder, not in weakening the cross-task rule:
main([]) == 0) as a permanent regression assertion. A stub behavior that a later task supersedes should either not be asserted in a ratcheted test, or the superseding task (T-002) should explicitly acknowledge it changes T-001'smain([])contract.judge.md's cross-task hard-reject to bless "completed-task test updates" is the wrong move — it invites the gaming the rule exists to stop.Open tension (deliberately not resolved here)
Keeping the cross-task rule truly hard means a contradictory seed like this one would deadlock (worker can't pass T-002 without touching T-001's test; judge won't allow it → iter cap). That's arguably the correct signal (bad seed fails loudly instead of silently passing broken work) — but it collides with F7 (no escape from a broken task) and the deferred halt-authority question. This interaction should inform the halt design when it's revisited, using the ledger data v1 sessions now produce.
Notes
uv initcollateral; this is transitional-stub-as-regression-test + ratchet).Related
20260529-113158-6b9b12