Skip to content

Seeder anti-pattern: transitional stub behavior pinned as a permanent regression test forces a later task to rewrite a completed task's seed test #23

@samkeen

Description

@samkeen

What happened

In demo session 20260529-113158-6b9b12, the seeder produced a cross-task contradiction on the same call, main([]):

Task Contract for main([])
T-001 AC returns 0 — seed test test_main_returns_zero asserts main([]) == 0
T-002 AC returns a non-zero exit code and prints usage to stderr

T-001 pins a transitional stub behavior (main([]) == 0) as a permanent acceptance criterion with a regression test. T-002 was always going to supersede that behavior. The two are mutually exclusive.

Why it forced a contract rewrite

validators.py:run_pytest ratchets — it runs completed tasks' tests as a regression guard (commit d2adb33). So during T-002, T-001's now-stale test_main_returns_zero runs and fails. The worker had exactly two escapes:

  1. Break T-002's AC (make main([]) return 0), or
  2. Rewrite T-001's seed test.

It chose (2): flipped test_main_returns_zerotest_main_returns_nonzero_on_no_args (assert main([]) != 0). Validators then passed and the task was accepted.

The damage

T-001's stated acceptance criterion (main([]) returns 0) is now silently contradicted by T-001's own rewritten test. The run shows 3/3 green; the T-001 contract in prd.json is false relative to the code. This is the cross-cutting friction from proposals/frictions-2026-05-26.md: "we've seen both tests pass and the judge accept with broken work." There's now an AC↔test mismatch on a completed task that nothing flags.

Sub-finding: the evaluator overrode its own hard rule

tilth/prompts/judge.md makes cross-task seed-test edits (tests/test_t<NNN>_*.py, NNN ≠ this task) a hard reject, no judgement. In this session the evaluator:

  • rejected a test_t003_persist.py edit (future task → correct), but
  • accepted the test_t001_scaffold.py edit (completed task), improvising a completed-vs-future distinction the rule does not contain — and even instructed it at iter 31 ("Only the test_t001_scaffold.py change ... should remain in this diff").

The reasoning is pragmatic (a completed task's test legitimately invalidated by evolved behavior ≠ tampering with a future task's contract), but it means the gaming backstop isn't actually hard, and the precedent is "you may rewrite an earlier task's test if you tell a good story."

Root cause + fix surface

This is F1/F2 (seed contradiction) + F9 (no cross-task awareness), turned into a forcing function by the validator ratchet. The fix belongs upstream in the seeder, not in weakening the cross-task rule:

  • The seeder should not pin transitional/stub behavior (main([]) == 0) as a permanent regression assertion. A stub behavior that a later task supersedes should either not be asserted in a ratcheted test, or the superseding task (T-002) should explicitly acknowledge it changes T-001's main([]) contract.
  • Weakening judge.md's cross-task hard-reject to bless "completed-task test updates" is the wrong move — it invites the gaming the rule exists to stop.

Open tension (deliberately not resolved here)

Keeping the cross-task rule truly hard means a contradictory seed like this one would deadlock (worker can't pass T-002 without touching T-001's test; judge won't allow it → iter cap). That's arguably the correct signal (bad seed fails loudly instead of silently passing broken work) — but it collides with F7 (no escape from a broken task) and the deferred halt-authority question. This interaction should inform the halt design when it's revisited, using the ledger data v1 sessions now produce.

Notes

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions