Skip to content

Demo T-001 seed is a recurring F1 instance (description authorises a command whose collateral the AC doesn't ratify) #22

@samkeen

Description

@samkeen

What this is

A tracker for the F1 friction ("the seed is the contract but nothing validates the seed", proposals/frictions-2026-05-26.md) as it shows up concretely in the demo's T-001. Not a proposal to patch the seeder for uv init specifically — see Explicitly out of scope below.

The shape

T-001 ("Bootstrap project skeleton"):

  • Description authorises uv init to create pyproject.toml.
  • uv init also creates README.md, .python-version, and a root main.py stub — none acknowledged by the description.
  • Acceptance criteria require todo_cli/__main__.py as the entry point (from todo_cli.__main__ import main succeeds). They say nothing about the collateral files, and nothing about cleaning up the now-orphaned root main.py.

So the seed relies on the worker inferring that the auto-generated root main.py is dead weight to remove, and that README.md / .python-version are fine to keep. The contract is underspecified, not internally validated.

Where seen

  • 20260526-173822-f411ea — original F1 example.
  • 20260527-174931-25fa4c — Phase 1 demo run. Evaluator rejected; worker thrashed iters 24-35 (incl. an attempted [tool.ruff] exclude at iter 33 — F12 gaming) before user interrupt.

Why we're tracking, not fixing in v1

The run-time symptom (evaluator over-rejecting hygiene/scaffolding files) is being addressed separately by softening tilth/prompts/judge.md — redrawing the scope-creep bright line at cross-task interference and letting the evaluator apply judgement to hygiene/tooling/scaffolding collateral. That should let this seed succeed at run time with one corrective iteration (worker removes the dead main.py, keeps README.md/.python-version).

But the seed itself is still unvalidated — the judge softening compensates at run time; it doesn't make the contract self-consistent. The structural fix is contract negotiation at seed time (Anthropic harness pattern: evaluator reviews proposed AC before any code is written), which is v2 territory per proposals/v1-worker-evaluator-dialogue.md ("What v1 is deliberately not").

Value as a fixture

This seed is a good reproducible anchor for:

  • Phase 3 (submit_case.work_arounds): does the worker now name the scaffolding collateral as a work-around, and does the evaluator accept/reject that specific claim instead of inferring it from the diff?
  • v2 contract negotiation: does the evaluator catch the description↔AC underspecification before the worker runs?

Explicitly out of scope

  • Hardcoding uv init awareness into tilth/seed/prompts.md (overfits — next time it's git init / npm init / cookiecutter).
  • A narrow "scaffolding-command discipline rule" in the seeder prompt — considered and dropped; the judge softening is the better intervention layer for v1, and seed validation is the real fix for v2.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions