Skip to content

feat: gate pilot — LLM at batch decision boundaries#1

Draft
logpie wants to merge 2 commits into
mainfrom
worktree-gate-pilot
Draft

feat: gate pilot — LLM at batch decision boundaries#1
logpie wants to merge 2 commits into
mainfrom
worktree-gate-pilot

Conversation

@logpie
Copy link
Copy Markdown
Owner

@logpie logpie commented Mar 30, 2026

Summary

  • Adds a gate pilot that replaces simple replan() after batch failures with richer failure analysis, retry strategies, context routing, and skip recommendations
  • Stateless design: reads disk artifacts, returns structured JSON, orchestrator validates and applies
  • Falls back to replan() on failure. Config flag pilot: false to disable. Zero overhead when no failures.
  • Codex-reviewed (3 rounds, APPROVED). 484 unit tests pass. 53 tasks across 18 e2e runs, 0 regressions.

Status: NOT validated on real failures

The pilot never fired during benchmarking because the coding agent passed all tasks. This is expected — the pilot's value is at i2p scale (8+ tasks, multiple batches, partial failures). Shipping as a safe no-op upgrade.

What's new

File What
otto/pilot.py Gate pilot module — context assembly, LLM call, decision parsing
otto/orchestrator.py Pilot at batch boundaries, fallback to replan, config flag
otto/runner.py Pilot guidance separated in retry prompts
tests/test_pilot.py 22 unit tests
tests/test_pilot_benchmark.py 6 scenario tests
bench/pilot-benchmark.sh A/B benchmark runner
bench/pressure/projects/pilot-test-* 3 synthetic test projects

Design docs

  • Spec: docs/superpowers/specs/2026-03-29-gate-pilot.md
  • Plan: docs/superpowers/plans/2026-03-29-gate-pilot-stage1.md
  • i2p spec: docs/superpowers/specs/2026-03-26-otto-intent-to-product.md

Test plan

  • 484 unit tests pass (0 new failures)
  • Codex adversarial review: 3 rounds, APPROVED
  • 18 e2e runs (6 projects × baseline/pilot): zero overhead, zero regressions
  • 4 real-world combined runs (ufo, humanize, camelcase, pre-commit)
  • 5-task greenfield run with merge conflicts
  • Pending: real-world pilot invocation — needs a run where batch has mixed pass/fail results with remaining tasks. Monitor pilot.log on next failure.

🤖 Generated with Claude Code

Adds a gate pilot that replaces the simple replan() call after batch
failures. The pilot reads disk artifacts (verify logs, QA verdicts,
task summaries, learnings) and returns structured decisions: failure
analysis, retry strategies, routed context for upcoming tasks, skip
recommendations, and re-batching.

Key design:
- Stateless: reconstructs context from files each invocation
- No telephone game: pilot makes system-level decisions, coding agents
  interpret their own errors directly
- Structured JSON output, orchestrator validates and applies
- Same model as planner (configurable via planner_model)
- Falls back to replan() on parse failure
- Config flag: pilot: false in otto.yaml to disable
- Zero overhead when no failures (pilot only invoked at batch boundary
  with failures + remaining tasks)

Codex-reviewed: 3 rounds, all CRITICAL/IMPORTANT findings fixed, APPROVED.
Benchmark: 53 tasks across 18 runs, 0 regressions, 0 pilot overhead.
Pilot not yet validated on real failures — shipping as safe no-op upgrade
for i2p readiness. Will prove value at scale (5+ tasks, multiple batches).

New files:
- otto/pilot.py — context assembly, LLM invocation, decision parsing
- tests/test_pilot.py — 22 unit tests
- tests/test_pilot_benchmark.py — 6 scenario benchmark tests
- bench/pilot-benchmark.sh — A/B benchmark runner
- bench/pressure/projects/pilot-test-* — 3 synthetic test projects

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Mar 30, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6de94498-333a-48d3-b2f0-f8f8313d2328

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch worktree-gate-pilot

Comment @coderabbitai help to get the list of available commands and usage tips.

Supersedes the gates + gate pilot approach. Simplified to 5 steps:
classify → plan → execute → verify → fix-or-replan.

Key decisions:
- Single-task is a valid plan (no forced decomposition)
- Product artifacts at project root (not otto_arch/)
- Persistent context.md accumulates across tasks
- Vertical slices over horizontal layers
- User journeys from user's perspective, not feature list
- Fix rounds continue while making progress, replan on planning failures
- Codex-reviewed design

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
logpie added a commit that referenced this pull request Apr 26, 2026
Closes 1 CRITICAL + 2 IMPORTANT findings from the mc-audit hunters.

CRITICAL closed:
- Logs appended unbounded into client state; 10MB log could lock the browser
  (codex-long-string-overflow #1). Replaced with bounded ring buffer
  (1MB max), separate unbounded totalBytes/totalLines counters, droppedBytes
  tracker for the elided-bytes header.

IMPORTANT closed:
- LogPane didn't distinguish "Live, polling" from "Final" state
  (codex-evidence-trustworthiness #5). Header now shows live polling cadence
  + last-update age, OR final size + line count.
- Missing log file rendered as generic "waiting for output"; fetch errors
  toasted while polling kept hammering (codex-error-empty-states #9). Now
  shows the path explicitly, plus an error state with Retry button. Polling
  pauses when log is missing/errored.

Polling resilience:
- Exponential backoff on consecutive errors: 1.2s → 2s → 5s → 15s → 30s.
- Resets to 1.2s on first successful read.
- Stops polling when run is terminal AND fully drained (uses new server
  `eof` field).
- Pauses polling when inspector is closed or tab is hidden
  (visibilitychange listener); resumes on visible.

Server changes:
- `LogReadResult` gains `total_bytes` (file size at read time) and `eof`
  (whether next_offset == total_bytes after this slice). All three
  constructor sites populated. Lets the client render "Final · {size}"
  headers and detect drain without a second HEAD request.
- `LogsResponse` TS type updated.

Tests:
- `tests/browser/test_log_buffering.py` — 7 paired Playwright tests:
  - 5MB log renders <1.5MB DOM with elided-bytes header
  - Live state + polling header for active runs
  - Final state + line count for terminal runs
  - Missing-file path display + paused polling
  - Error backoff schedule (gap_first ≈ 2s, gap_second ≈ 5s)
  - Polling stops on inspector close
  - Polling stops on tab hidden
- Browser suite: 15 passed (7 new + 7 cluster A + 1 smoke)
- Default suite: 1076 passed (no regressions)
- npm run web:typecheck: clean
- npm run web:build: 277.82 kB JS / 33.34 kB CSS

Note: pre-existing basedpyright warnings in service.py around `_record_event`
calls (lines 393-544) are not introduced by this commit; they predate
cluster B and are flagged because basedpyright now analyzes the file when
it's touched.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request Apr 26, 2026
CLUSTER C — bundle/build integrity (closes 1 CRITICAL + 4 IMPORTANT + 1 NOTE):

CRITICAL closed:
- `app.py:35`: `otto web` served checked-in bundle without verifying it
  matches source — developer skips `npm run web:build` → silent stale UI
  (codex-packaging-bundle #1).

IMPORTANT closed:
- Python build never built frontend; `pip install -e .` shipped whatever
  static was in tree (#2). Vite plugin emits `build-stamp.json` on every
  build with source-hash + timestamp + git commit; FastAPI startup verifies
  freshness via `verify_bundle_freshness()`.
- `web:build` ran only Vite; `web:typecheck` advisory; no CI gates (#3).
  New `web:verify` script chains typecheck → build → committed-check.
- Default cache headers underused hashed assets (#4). New
  `_CacheHeaderStaticFiles` subclass: `no-store` for shell + index.html,
  `public, max-age=31536000, immutable` for `/static/assets/*`.
- Server didn't validate `index.html` referenced JS/CSS exist (#5).
  Startup parses index.html and asserts every referenced static path
  resolves; missing → fail-fast with `npm run web:build` guidance.

NOTE closed:
- `[tool.setuptools.package-data]` was flat (`static/*`, `static/assets/*`);
  future nested assets (fonts/, images/, locale/) would silently miss
  wheels. Now `static/**/*` recursive glob.

Files: otto/web/bundle.py (new, 263 lines), otto/web/client/vite.config.ts
(build-stamp emitter), scripts/build_stamp.py (CLI for manual stamp),
scripts/check_bundle_committed.py (git-diff guard for CI), package.json
(web:verify script), pyproject.toml (recursive package_data), otto/web/app.py
(verify_bundle_freshness call + _CacheHeaderStaticFiles).

Tests: tests/test_web_bundle_freshness.py (5 tests) + tests/test_web_cache_headers.py
(2 tests, 8 actual checks). 15 new server-layer tests, all green.

CLUSTER D — history pagination (closes 1 CRITICAL + 2 IMPORTANT):

CRITICAL closed:
- `total_rows=247` displayed but only first page rendered; power user with
  200+ runs stuck on page 1 (heavy-user, codex-state-management #6,
  codex-long-string-overflow #3).

IMPORTANT closed:
- `/api/state` accepted `history_page` + `page_size`; client never sent
  `history_page` and rendered no controls.
- Page-size selector now lives in the UI (10/25/50/100, default 25);
  server clamps to [1, 200] to refuse stale URLs requesting unbounded slices.

Implementation:
- `MissionControlFilters.history_page_size: int | None` for per-request
  override (server clamps to safe range).
- App.tsx History pane: pagination footer (Page N of M · X runs · ←/→ ·
  jump-to + page-size selector). URL persists `hp` + `ps` query params.
- Filter changes reset page to 1.
- Stale deep-link `?hp=99` (out-of-range) → "Page 99 doesn't exist; jump
  to page 1" with reset button.

Files: otto/mission_control/{model.py,serializers.py} (history_page_size
plumbing), otto/web/app.py (param wiring), otto/web/client/src/{App.tsx,
api.ts,types.ts,styles.css} (pagination UI), tests/browser/test_history_pagination.py
(10 paired Playwright tests, all green).

Verification:
- Browser suite: 25 passed (8 cluster A + 7 cluster B + 10 cluster D + new
  smoke set from cluster C cache headers via tests/test_web_cache_headers.py)
- Default suite: 1091 passed (was 1076; +15 cluster C server tests)
- npm run web:typecheck: clean
- npm run web:build: 277.82 kB JS / 33.34 kB CSS (rebuilt with stamp)

Note: cluster C agent hit an API overload during its final summary step but
all files landed cleanly on disk; verified by independently running the new
test files before committing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request Apr 26, 2026
CLUSTER E — diff freshness contract (closes 1 CRITICAL + 1 IMPORTANT):

CRITICAL closed:
- Diff fetched once and held client-side without target/branch SHAs or
  merge-base; **the code merged could differ from the diff reviewed**
  (codex-evidence-trustworthiness #1).

IMPORTANT closed:
- Diff truncation was bare `truncated` suffix; user couldn't tell how much
  was hidden, no full-diff download path (codex-evidence-trustworthiness #2).

Implementation:
- `MissionControlService.diff()` enriches response with: `fetched_at`,
  `target_sha`, `branch_sha`, `merge_base`, `command`, `limit_chars`,
  `full_size_chars`, `shown_hunks`, `total_hunks`, `errors[]`.
- `MissionControlService._validate_expected_diff_shas` — merge action
  rejects with 409 + "Re-fetch the diff to confirm what will be merged"
  when `expected_target_sha` / `expected_branch_sha` differ from live HEAD.
- POST /api/runs/{id}/actions/{action} forwards SHAs from request body.
- DiffPane renders freshness header (captured-X-ago + target/branch/base
  short SHAs with full-SHA tooltip; warnings when SHAs are null) + Refresh
  button + truncation banner ("Showing N hunks of M · X KB of Y MB" + Copy
  diff command).
- Merge confirm dialog spells out "Land branch {short} @ {sha} into target
  {short} @ {sha}" with the actual SHAs.
- SPA passes SHAs from most-recent diff fetch on every merge POST.

CLUSTER F — boot-loading gate + first-run clarity (closes 2 CRITICAL +
~12 IMPORTANT):

CRITICAL closed:
- App.tsx tri-state boot gate (`loading | launcher | ready`) — main shell
  no longer renders before /api/projects returns; "New job" button can no
  longer be enabled with project undefined (codex-first-time-user #1).
- Pre-submit advanced-options summary in JobDialog: "Will run with: claude
  · sonnet · effort=high · verification=fast" outside Advanced details
  with "Edit" link (codex-first-time-user #2).

IMPORTANT closed:
- Launcher subhead: "Otto runs AI coding jobs in isolated git worktrees,
  then lets you review logs, diffs, and merge results."
- "Managed root" helper text explains current-repo isolation
- Empty project list: "Create your first Otto project below" + auto-focus
- First-run primary CTA: "Start first build" (reverts to "New job" once
  any run exists)
- Build/Improve/Certify dropdown options gain helper descriptions
- All commands now require non-empty intent or focus (was build-only)
- Dirty-project confirm lists up to 5 dirty files with "+N more"
- "Start queued job" CTA when watcher is stopped + jobs queued
- Empty detail copy: "Select a task card to review logs, code changes,
  verification, and next action."
- RunInspector tab labels: jargon-soft alternatives
- Recovery actions surfaced as primary contextual buttons (Retry / Resume
  / Cleanup) next to run header — Advanced still has full list
- HTTP-code-to-actionable-copy mapping in api.ts: 409/400/403/5xx →
  recovery messages (no more raw "HTTP 409")

Tests:
- tests/test_diff_freshness.py — 6 server tests
- tests/browser/test_diff_freshness.py — 5 browser tests
- tests/browser/test_first_run_clarity.py — 13 browser tests
- All green; default suite 1097 (was 1091; +6 cluster E server)
- Browser suite 47 (was 25; +5 cluster E + +17 cluster F)
- npm web:typecheck clean; bundle rebuilt

Followups for orchestrator:
- SHA-mismatch refusal is opt-in by client (older callers omit SHAs and
  bypass the gate); consider promoting to power-user opt-out flag
- Pre-fetch diff inside runActionForRun("merge") so SHA gate is
  non-bypassable from SPA
- Related provenance gaps (proof drawer cache, ArtifactRef metadata,
  visual-evidence manifest, proof file digest) share same architectural
  fix; bundle into a future cluster

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request Apr 26, 2026
…/launcher/lost-connection)

============================================================
5 IMPORTANT closures
============================================================

W7-IMPORTANT-1 — iPhone submit button below the fold
  In devices["iPhone 14"] viewport (390x664), JobDialog submit at
  y=674 was clipped without scroll affordance. JobDialog scroll-
  container now uses max-height: 100dvh - 80px and overflow-y: auto;
  submit button always reachable via scroll OR the dialog renders a
  floating bottom-bar on narrow viewports.

W8-IMPORTANT-1 — JobDialog ignores Cmd+Enter from textarea
  Power-user shortcut was dead. Added onKeyDown to intent textarea:
  (Cmd|Ctrl)+Enter now triggers submit when validation passes.

W9-IMPORTANT-2 — Run double-rendered after terminal_outcome
  Live[]/history[] transition not atomic — same run_id appeared in
  both, UI rendered twice. Client-side dedupe: when computing rows
  to render, exclude live items whose run_id appears in history with
  terminal_outcome set.

Codex first-time-user #4 — "Managed root" looked like current repo
  disappeared. Launcher panel adds: "Otto manages projects in
  isolated git worktrees so it never touches your other repos. Pick
  or create one below to start."

Codex error-empty-states #1 — Lost-connection banner
  When polling fails 3+ consecutive times, sticky banner appears:
  "Lost connection to Mission Control. Retrying every 5s..." with
  manual retry button. Auto-clears on first successful poll after.

============================================================
Tests added (14 new browser tests)
============================================================

- tests/browser/test_iphone_submit_button_reachable.py (3 tests)
- tests/browser/test_job_dialog_cmd_enter.py (3 tests)
- tests/browser/test_no_double_render_after_terminal.py (4 tests)
- tests/browser/test_launcher_managed_root_explanation.py (1 test)
- tests/browser/test_lost_connection_banner.py (3 tests)

============================================================
Test counts
============================================================
- Default: 1189 (no change; all UI fixes)
- Browser: 198 effective (was 184; +14 new)
- npm web:typecheck: clean
- npm web:build: clean

============================================================
Tally
============================================================
CRITICAL: 29 of 29 closed (100%)
IMPORTANT: ~113 of 132 closed (~86%)

Followups:
- ~19 NOTE-tier IMPORTANTs remain (paper cuts; deferred per
  severity-gate rule unless adjacent)
- 76 NOTE items deferred per policy
- Phase 3.5 R1-R14 actual recordings ($70-140, hours of real LLM)
  remain scaffolded but not captured

This effectively closes the bulk of Phase 4 IMPORTANT work. Branch
is in releasable state.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 6, 2026
The architectural design doc (docs/intent-to-product-design.md) was
the canonical reference that drove the redesign — adding it as item
#1 in the read-in-this-order list ahead of progress.md / research.md
/ plan.md. Fixes the duplicate "4." numbering left by the prior edit.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 16, 2026
Three deterministic RED-first repros for the concurrency + recursive-
decomposition regime, exercising real code paths (only the agent step is
an injected callable; no LLM, no spec-compile, no 90-min run; <5s total):

- #1 depth-3 dual-subtree concurrent propagation: GREEN — owning-worktree
  propagation (regression guard for 89d4bad) + blocked-slice isolation
  with structured reason are sound. Banked as permanent regression.
- #2 5-way concurrent merge into one integration branch: RED — one-shot
  union repair against a moving integration target drops already-landed
  sibling contributions (final routes [route-0,route-2], expected all 5)
  instead of bounded re-entry. Composition of flock + union guard + seam
  re-entry under 5-way race is broken.
- #3 task_graph/spec-state concurrent terminal writes (32 children): RED —
  spec_state.append_event derives event_id from an unlocked line-count
  read (spec_state.py:328-333), so concurrent writers stale-read the same
  count and produce duplicate/skewed event_ids, violating the documented
  stable-unique contract amendments rely on (trigger_event_id linkage).

No production code changed — triage/repro pass only. Fixes dispatched next.

Repros built by Codex; root causes confirmed against real spec_state code.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 16, 2026
…rm contract-write invariant

Plan-Gate-APPROVED redesign step S1 (builds on S0 primitive). Flips
repro scene #1 GREEN; scenes #2-#5 stay RED (S1 flips only #1).

- Scheduler-ordering invariant in `_process_children`: no `feature`
  child dispatches while a sibling `foundation` task for that parent is
  pending/in-flight/unverified or the parent foundation_contracts are
  absent/invalid — even without `depends_on` (scheduler invariant, not
  prompt).
- Isolation gate beside `check_route_registration_isolation`: pre
  feature-dispatch, every foundation_contract path must be exclusively
  owned; an overlapping/nested feature owned_path → re-enter the
  architect via the existing `_reenter_or_block_architect_contract`
  bounded machinery with `kind="shared_foundation_not_isolated"`;
  bounded exhaustion → structured terminal (no crash, no silent
  dispatch).
- Uniform contract-write invariant: a task may write a
  foundation_contract path only if it is the contract's owner_task_id
  or a contract_amendment for it (S1 emits a structured
  `foundation_contract_write_blocked` and does not advance the branch;
  S2 adds the amendment routing). ONE shared helper applied at ALL 8
  enumerated v5 commit/merge admission hooks (preflight repair, integ-
  agent commit, root-inline, subtree-prop repair, child-verify repair,
  scaffold repair, _merge_child_branch pre-commit dirty/untracked +
  pre-merge committed child-branch delta, merge-conflict repair).
- Plan-Gate must-have: `_task_entry_allows_upward_merge` /
  `_child_result_allows_upward_merge` now consult durable
  merge_blocked graph state so a stale in-memory `pass` cannot bypass
  the S1 gate.
- Secondary: legacy `detect_scope_violations` treats a newly-created
  critical shared-contract path as a violation (defense-in-depth; v5
  gate is primary).

Verified: scene #1 GREEN, #2-#5 RED; 19 S1+S0 units GREEN; broad
v5/seam suite shows ONLY the 4 known pre-existing test_v5_phase2
git-worktree-rot failures (no new); ruff clean. S0 untouched.
Codex-implemented; Claude-reviewed (scope, no S2-S5 leak, scene-#1
not-gamed, stale-pass must-have, 8-hook enumeration).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 16, 2026
…duler hold, honest terminal

S1 Gate R1 (3 CRITICAL + 2 IMPORTANT; 8-hook enumeration + no-S2-leak
confirmed):

- CRITICAL-1: scheduler allowed feature dispatch when foundation
  contracts not yet declared even with a foundation sibling. Now an
  explicit task_role=="foundation" sibling HOLDS all feature dispatch
  until every foundation sibling is mergeable/verified AND valid parent
  foundation_contracts are present; pure-feature decomposition (no
  foundation sibling) still bypasses.
- CRITICAL-2: terminal-blocked foundation silently stranded features
  (dropped from ready + loop break). Now affected ready/pending feature
  children are honestly marked merge_blocked kind="foundation_unsatisfied"
  (no silent drop).
- CRITICAL-3 (real capstone root cause): isolation gate only caught
  exact contract-path overlap, NOT a feature owned_path nested under
  the foundation owner's broad tree (foundation owns backend/, contract
  backend/auth.py, feature owns backend/routers/auth.py). Now a
  foundation-owned tree covering a declared contract is exclusive;
  nested sibling feature paths are rejected via the existing
  shared_foundation_not_isolated architect re-entry. Repro scene #1
  strengthened to the nested capstone shape (verified RED on old
  db5b819, GREEN on fix — genuine, not gamed).
- IMPORTANT-4: `_task_entry_allows_upward_merge` now also rejects a
  stale verdict=="pass" carrying durable merge_blocked metadata (was
  only fixed in `_child_result_allows_upward_merge`).
- IMPORTANT-5: removed the over-broad any-contract_amendment write
  allow (S2 will reintroduce a bound amendment→contract allow);
  integration-of-record allow explicitly deferred, not left ambiguous.

Verified: scene #1 GREEN (nested shape), #2-#5 RED; 21 S1+S0 units
GREEN; broad suite only the 4 known pre-existing test_v5_phase2
git-worktree-rot failures; ruff clean. S0 untouched. Codex-fixed
(Codex-found via Impl Gate R1); Claude-reviewed + RED-on-old verified.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 16, 2026
…er / honest terminal)

S1 Gate R2 (2 CRITICAL; nested-tree fix + IMPORTANT-4/5 confirmed
correct R2, untouched). Both were the same disease — a feature child
silently stranded pending instead of an honest structured terminal
(the capstone hang class):

- CRITICAL-1: foundation PASSED but never produced valid
  foundation_contracts → features were silently held/dropped with no
  re-entry or terminalization (the S1 test masked it by injecting
  contracts externally between _process_children calls). Now: passed/
  mergeable foundation + absent/invalid contracts re-enters the
  architect via the existing bounded `_reenter_or_block_architect
  _contract` machinery (kind=foundation_contracts_missing_after_pass);
  on bounded exhaustion the dependent features are honestly
  merge_blocked — never left pending awaiting external metadata
  mutation. Test rewritten to assert the real runtime transition (no
  external contract injection).
- CRITICAL-2: terminalization only ran when ready_features non-empty,
  but a depends_on=["foundation"] feature is NOT ready when foundation
  failed (deps unsatisfied), so it stayed silently pending. Now
  terminal-foundation handling scans ALL same-parent unmerged feature
  siblings (not just ready) and marks each merge_blocked
  kind="foundation_unsatisfied"; covers foundation merge_blocked AND
  catastrophic/failed.

Tradeoff: terminal handling uses task-graph siblings as source of
truth (not orphan pending JSONL) — consistent with the existing
scheduler/metadata model, no new lifecycle channel.

Verified: scene #1 GREEN (nested capstone shape), #2-#5 RED; 22 S1+S0
units GREEN incl. 2 strengthened/added scheduler tests asserting real
runtime transitions; broad suite only the 4 known pre-existing
test_v5_phase2 failures; ruff clean. S0 + nested-tree fix untouched.
Codex-fixed (Codex-found via Impl Gate R2); Claude-reviewed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 16, 2026
Plan-Gate-APPROVED redesign step S2 (builds on S0/S1). Flips repro
scene #5 GREEN (full lifecycle); scenes #2/#3/#4 stay RED.

- _merge_child_branch union-feedback: when the union/conflict path is a
  declared foundation_contract the child does NOT own, no longer routes
  the repair to the leaf (the b15 scope-gate deadlock). Instead
  schedules a runnable task_role=="contract_amendment" task owned by the
  contract's owner_task_id, owned_paths=[contract], emits
  foundation_contract_amendment_repair.
- Net-new lifecycle (task_graph): set_contract_amendment_blocked
  records last_agent_verdict, CLEARS verdict/completed_at (un-non-
  runnable), sets non-terminal blocked_pending_contract_amendment +
  blocked_on_task_id; clear_contract_amendment_blocked_tasks clears ALL
  leaves blocked on an amendment (Plan-Gate must-have #3, not just the
  first).
- take_ready: new blocked_on_task_id gate (analogous to depends_on) —
  a leaf with an unsatisfied blocked_on is skipped, not dispatched/
  terminal.
- Amendment terminal-PASS → clear all blocked leaves + re-enqueue
  merge-only retry (scheduler re-entry w/ contract_amendment_retry_merge
  metadata, reuses pending/lease machinery, bypasses Lead, retries only
  _merge_child_branch). Amendment terminal-FAIL → each leaf honest
  merge_blocked (no silent hang).
- Reintroduced the BOUND contract_amendment write-allow S1 removed: an
  amendment may write only its bound contract (owner/path match via
  task metadata), not any contract.
- Blocked graph state authoritative over stale in-memory LeadResult(pass)
  (Plan-Gate must-have #2; composes with S1's hardening).

Verified: scene #5 strengthened to full-lifecycle assertion (verified
RED on pre-S2 78535d1, GREEN now — not gamed); scenes #1/#5 GREEN,
#2/#3/#4 RED; 27 S0+S1+S2 units GREEN; broad suite only the 4 known
pre-existing test_v5_phase2 failures; ruff clean. S0/S1 untouched.
Codex-implemented; Claude-reviewed (scope, lifecycle, RED-on-old).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 16, 2026
…tlement, bound writes, bounded churn

S2 Gate R1 (2 CRITICAL + 1 IMPORTANT + 1 NOTE) — the silent-hang /
double-merge class:

- CRITICAL-1: merge-only retry never persisted the restored terminal
  verdict (only in-memory) → after restart the graph had verdict=None
  and re-dispatched the leaf (double-merge); the scene masked it via a
  fake set_verdict. Now the restored `pass` is persisted ONLY after
  _merge_child_branch really succeeds AND a durable graph re-read shows
  no fresh block/retry/merge_blocked — idempotent, no restart
  double-merge.
- CRITICAL-2: an amendment _run_child CRASH set it catastrophic without
  running fail-settlement → blocked leaves kept blocked_on_task_id
  forever (take_ready skips them = silent hang). Now ANY amendment
  terminalization (crash/catastrophic/failed/merge_blocked) runs
  _settle_contract_amendment_dependents → every blocked leaf becomes
  honest merge_blocked.
- IMPORTANT-3: bound write-allow still let a contract_amendment task
  modify arbitrary NON-contract files (gate only flagged
  contract-overlapping paths). Now a contract_amendment task may write
  ONLY its bound contract path; any other changed path is rejected.
- NOTE-4: futile amendment churn was unbounded (pass-without-fix →
  schedule another amendment forever). Now bounded per (leaf, contract)
  (cap=2: initial + 1 retry, matching existing bounded-retry style) →
  honest structured merge_blocked on exhaustion.

+4 regressions: durable verdict after real (non-fake) retry + no
re-dispatch; amendment crash settles all blocked leaves; amendment
cannot write a non-contract file; futile-amendment bounded → terminal.

Verified: scenes #1/#5 GREEN, #2/#3/#4 RED; 30 S0+S1+S2 units GREEN;
broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff
clean. S0/S1 untouched. Codex-fixed (Codex-found via Impl Gate R1);
Claude-reviewed. Tradeoff: amendment retry cap=2 (small bounded style).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 16, 2026
…durable in-progress state)

S2 Gate R2 (1 IMPORTANT; C2/I3/N4 confirmed correct R2). The
merge-only-retry flag was cleared BEFORE the merge, leaving a window
(verdict=None, blocked_on=None, retry=False) where a crash/restart or a
second runner could re-dispatch the leaf via take_ready (in-process
lease only) → double-merge class.

Fix: durable `contract_amendment_retry_in_progress` set atomically when
entering merge-only retry (no longer pre-clears
contract_amendment_retry_merge); `take_ready` treats it non-runnable so
empty-in-flight / crash-restart / second-runner cannot re-dispatch the
leaf as an ordinary task; cleared ONLY atomically (single graph
lock/write) with the terminal outcome — success persists `pass` + both
flags; merge_blocked persists terminal + flags; fresh re-block clears
stale retry flags atomically with blocked_on_task_id (preserving
last_agent_verdict). Fails-closed during the window; idempotent on
restart (resume/settle, never double-merge).

+1 regression: simulates the exact in-retry window pre-durable-pass and
asserts fresh take_ready(in_flight=set()) does NOT return the leaf.

Verified: scenes #1/#5 GREEN, #2/#3/#4 RED; 31 S0+S1+S2 units GREEN;
broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff
clean. S0/S1 + R1 fixes (C2/I3/N4) untouched. Codex-fixed (Codex-found
via Impl Gate R2); Claude-reviewed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 16, 2026
…le-recovery

S2 Gate R3 (2 CRITICAL): the R2 durable-in-progress fix closed the
in-process window but (1) left the second-runner race open
(mark_..._in_progress wasn't compare-and-set; _run_child ignored its
return) and (2) introduced a crash/restart DEADLOCK (stale in-progress,
no recovery — traded double-merge for permanent stuck).

- Atomic claim: `mark_contract_amendment_retry_in_progress` is now a
  compare-and-set under the existing `_locked_graph()` fcntl.LOCK_EX —
  flips in_progress=True only if still retry-merge/unblocked/non-terminal
  and unclaimed-or-stale-with-budget; persists owner token/pid/host/
  heartbeat/claim-count/merge-context. `_run_child` consumes the return:
  False → does NOT run _merge_child_branch (yields to the owner; no
  double-merge, no terminalize-of-a-live-owner). One active merger at a
  time, cross-process.
- Bounded stale-recovery: stale = same-host owner pid gone OR
  heartbeat/start exceeds the bounded timeout. take_ready reopens ONLY
  stale retry-merge entries as merge-only retries (never ordinary Lead
  dispatch); remaining claim budget → reclaim+resume from durable
  contract_amendment_merge_context; budget exhausted → structured
  merge_blocked. Composes with N4's per-(leaf,contract) cap. Never
  deadlocks, never double-merges, never re-dispatches as ordinary.

Net invariant: exactly one runner executes a leaf's merge-only retry at
a time; crash/restart always resolves to pass or honest merge_blocked
within bounded attempts.

+2 regressions: concurrent-claim race (exactly one wins, loser doesn't
merge); stale in-progress recovery (resume→pass or bounded→merge_blocked,
never ordinary, never stuck). R2 restart-window + durable-verdict
regressions still pass.

Verified: scenes #1/#5 GREEN, #2/#3/#4 RED; 33 S0+S1+S2 units GREEN;
broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff
clean. R1 (C2/I3/N4) + S0/S1 untouched. Codex-fixed (Codex-found via
Impl Gate R3); Claude-reviewed. (Codex sub-docs research/plan-s2-
amendment-retry-recovery.md included.) Tradeoff: conservative
remote-host staleness handled via timeout.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 16, 2026
…(no false reclaim)

S2 Gate R4 (final round, 1 minimal must-fix; everything else confirmed
acceptable, residual NOTE-level). Heartbeat was written only at claim
time, never refreshed, but _merge_child_branch can legitimately run
~1800s > the 15-min stale timeout → a LIVE long-running retry owner was
falsely reclaimed by a second runner (the exact race R3 closed,
reopened by long merges).

Fix: owner-token-checked periodic heartbeat refresh (60s interval, well
under the 15-min stale window) wrapping the awaited _merge_child_branch
in the merge-only retry path. The refresher writes the heartbeat under
_locked_graph() ONLY when owner==this child_session_id AND
retry_in_progress AND retry_merge AND no terminal/blocked state landed
(re-checked each tick; stops if owner/state no longer matches).
try/finally cancels + awaits it (suppress CancelledError) on
success/merge_blocked/re-block/exception — no leaked task, no
post-terminal refresh. Dead/stalled owners still go stale and are
bounded-recovered via the existing timeout (unchanged).

+1 regression: live long-running heartbeating owner is NOT reclaimed by
a second claim (CAS still False); existing dead-owner stale-recovery
still recovers; R2/R3 regressions still pass.

Verified: scenes #1/#5 GREEN, #2/#3/#4 RED; 34 S0+S1+S2 units GREEN;
broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff
clean. R1/R2/R3 + S0/S1 untouched. Codex-fixed (Codex-found via Impl
Gate R4); Claude-reviewed. Accepted NOTE-level residual: conservative
remote/unknown-host stale timeout (dead remote owner waits out the
bounded timeout before recovery).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 16, 2026
…loop (kills the 1799s hang)

Plan-Gate-APPROVED redesign step S4 (builds on S0/S1/S2). Directly
fixes the user's original pain: the 1799s leaf repair-agent timeout
that hung the iTracker capstone. Flips repro scene #2 GREEN;
#3/#4 stay RED.

- After a SCOPED conflict repair, `_merge_child_branch` runs
  integration smoke in DETECTION-ONLY mode — it no longer enters
  `_run_integration_smoke_preflight_with_repair`'s leaf repair loop for
  an out-of-scope / foundation clean-deploy failure. Both leaf-reachable
  entry points converted: the direct post-conflict path AND the
  stale-target `_repair_stale_target_and_retry_merge(run_smoke_preflight
  =True)` path. (Root/subtree integration smoke unchanged — not leaf.)
- An out-of-scope/foundation clean-deploy failure now emits a
  correctly-owned foundation_repair_needed / integration_repair_needed
  that creates a RUNNABLE graph task and S2-blocks the leaf (reuses S2's
  set_contract_amendment_blocked lifecycle / atomic-claim / stale-
  recovery — repair_route distinguishes integration_smoke_repair from
  foundation contract amendments) — never a dangling event, never a
  1799s leaf loop.
- In-scope failures keep existing scoped repair (no behavior change).
- v5_preflight_repair: scoped leaf conflict-repair prompts no longer
  demand the full acceptance oracle.

Repro scene #2 oracle refined (RED-first, verified RED on eae1f3a /
GREEN now — not weakened): asserts no leaf smoke-REPAIR loop from
either entry point, a runnable correctly-owned repair-need, leaf
S2-blocked (not merge_blocked), detection-only smoke allowed.

Verified: scenes #1/#2/#5 GREEN, #3/#4 RED; 38 S0-S2+S4 ownership
units GREEN; ruff clean; S0/S1/S2 untouched. Codex-implemented;
Claude-reviewed (RED-on-old verified; scope confirmed).

Pre-existing rot NOTE (NOT this redesign): committed
test_v5_architect_retry.py patches otto.v5_runner.check_scaffold_compiles
which was removed by e2329e9 (pre-session "agent-native repair Step
4") → AttributeError on 3 tests; plus the 4 test_v5_phase2
git-worktree-rot failures. Both predate + are unrelated to S0-S4 and
are entangled with the user's 4 uncommitted route-isolation dirty files
(deliberately NOT committed here).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 16, 2026
…nded pathless terminal, scoped in-scope fallback

S4 Gate R1 (1 CRITICAL + 1 IMPORTANT). The gate caught that S4 was
broken on the REAL path (tests used pathful fakes):

- CRITICAL: CleanOracleIssue.paths were dropped by
  preflight_issues_from_clean_oracle / PreflightIssue (no path field) /
  smoke serialization → S4's classifier saw REAL failures as pathless →
  always out-of-scope → empty-bound contract_amendment (rejects all
  writes) + cap-check-key('') vs increment-key('integration_smoke
  _repair') mismatch → cap never trips → the 1799s stuck-cycle
  re-emerged through S2 tasks. Fixed: added optional PreflightIssue.paths
  (legacy None preserved; constructors/consumers audited), threaded
  CleanOracleIssue.paths → PreflightIssue.paths →
  _preflight_issue_payload → _smoke_payload_paths, plus a robust
  fallback reading clean_oracle_result.issues[].paths. A genuinely
  pathless smoke failure now terminalizes as honest structured
  merge_blocked kind="integration_smoke_unrouteable" (never an
  empty-bound amendment, never uncapped). Single consistent normalized
  repair_path key used for BOTH the cap check and increment.
- IMPORTANT: the in-scope leaf smoke-repair fallback entered an
  UNRESTRICTED full-oracle loop (no allowed_paths/scope_policy → prompt
  demanded full acceptance oracle; commit hook only foundation-gated).
  Fixed: in-scope fallback now passes allowed_paths=leaf.owned_paths +
  scope_policy="allowed_paths", and the repair commit hook blocks any
  changed path outside that allowlist before the foundation-contract
  gate. A leaf smoke-repair can never widen beyond its owned paths.

+3 real-path regressions: clean-oracle serialization preserves paths
(RED on fa5c481 — old code had no PreflightIssue.paths, classifier
returned []); pathless smoke → bounded honest terminal (no empty
amendment); in-scope fallback packet + commit-hook enforce owned_paths
(inspects the real packet, not a monkeypatched call count).

Verified: scenes #1/#2/#5 GREEN, #3/#4 RED; 41 S0-S2+S4 ownership
units GREEN; ruff clean; S0/S1/S2 untouched. The 4 test_v5_phase2 +
committed test_v5_architect_retry check_scaffold_compiles-AttributeError
failures remain PRE-EXISTING rot (unrelated, entangled with the user's
uncommitted route-isolation work; deliberately not committed).
Codex-fixed (Codex-found via Impl Gate R1); Claude-reviewed.

Tradeoff: genuinely-pathless smoke failures terminalize immediately
(honest, actionable) rather than consuming retries against a synthetic
key.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 16, 2026
…from broad compile inputs)

S4 Gate R2 (1 CRITICAL; R1 pathless/cap + in-scope-scoping confirmed
CLOSED). py_compile set CleanOracleIssue.paths = ALL compiled files
(command input set), not the causal failing file → S4's
all-paths-must-overlap scope check made a leaf-owned syntax error look
out-of-scope → misrouted an in-scope leaf bug to the wrong owner /
under-scoped repair (and the first-sorted-path fallback guessed an
arbitrary owner).

Fixed at both sides of the seam:
- Producer (otto/v5_clean_verify.py): new _py_compile_causal_paths
  parses the actual failing filename(s) from py_compile stderr/stdout;
  py_compile_failed.paths is now CAUSAL, not the broad input set. Audit:
  py_compile was the ONLY clean-oracle producer with the
  paths=command-input pattern; all others pass explicit/none.
- Router (otto/v5_runner.py): no first-sorted-path guess. The
  contract-amendment write gate now supports MULTIPLE bound paths and
  smoke-repair scheduling owns/binds ALL causal paths; if causal paths
  are empty or cannot all be bound to the selected route → honest
  integration_smoke_unrouteable terminal (never under-scoped, never
  arbitrary-owner).

Net invariant: leaf-owned causal failure stays in-scope (scoped leaf
repair, unchanged); foundation/out-of-scope causal failure routes to
the correct owner with ALL causal paths bound; indeterminate →
honest-terminal; broad non-causal input paths never drive
scope/routing.

+ real py_compile_failed multi-input regressions (leaf-owned causal →
in-scope; foundation causal → routed+bound; indeterminate →
unrouteable). The leaf regression directly exercises the d91cece bug
(old paths=rel_files fails the causal-path assertion before routing).

Verified: scenes #1/#2/#5 GREEN, #3/#4 RED; 44 S0-S2+S4 ownership
units GREEN; broad suite only the known pre-existing test_v5_phase2 +
the S5-RED scene #3 (no new regression); ruff clean. S0/S1/S2 + S4-R1
untouched. Codex-fixed (Codex-found via Impl Gate R2); Claude-reviewed.
Tradeoff: broad compile inputs no longer kept as separate routing
evidence (still inspectable via the recorded oracle command).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 20, 2026
… decomposition)

Capstone run #1 evidence: root Lead decomposed CORRECTLY (5 children in
graph: foundation + 4 features, all submit_subtask duplicate:false) in
~237s, then the run died "run_budget_seconds exceeded during
root_decomposition (3000s)" at duration 642.2s — a 4.7x-early fire with
a false label.

Root cause (my P4 impl, not Codex's P1-P5 logic): both phase caps were
hard-set to 240s — `spec_compile` (v5_runner.py:4007) and
`root_decomposition` (:4059). 240s is far too tight: a real flat
compile of a 47-feature product is ~6-10min and decomposition ~4-5min.
The Lead emitted all 5 children at 237s and the 240s cap killed the
phase ~3s later as it returned. `_V5RunDeadlineExceeded` also always
reported `run_budget_seconds` (3000) regardless of which limit fired,
masking the real cause.

Fix: relax both phase caps 240 → 900s (generous hang-rails — healthy
spec ~6min / decomp ~4min never killed; a genuine >15min hang still
caught). run_budget_seconds remains the true total ceiling (P4 intent
preserved: per-phase caps are hang detectors, the run budget is the
real bound). `_await_with_run_deadline`/`_V5RunDeadlineExceeded` now
report the ACTUAL fired timeout and whether it was a phase_cap vs
run_budget — honest diagnostics so this can't mislabel again.

Verified: syntax OK; ruff clean; 12 P1-P5 tests GREEN (incl P4
wall-clock). Claude-fixed from real run-#1 logs (no Codex — out of
credit). basedpyright None.get@595-598 confirmed guarded (false
positive); remaining strict-typer flags in P1-P5 are non-runtime noise.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
logpie added a commit that referenced this pull request May 20, 2026
Run #6 (mib6-001619) was the first run to traverse the full pipeline
(compile→decomp→scaffold→persist→4-feature concurrent fan-out→integration).
It exposed the next-layer instance of the decomp-boundary class: 3 of 4
features merge_blocked at integration on a conflict in
`backend/tests/conftest.py`. The conflict packet showed `base: ""` — every
feature leaf independently CREATED its own conftest.py (each needs test
fixtures), and the conflict-repair agent timed out (399s) trying to reconcile
divergent creates of the same shared file.

lead.md's architect guidance isolated route/API/screen registration but said
nothing about shared TEST/BUILD infrastructure, which is the same kind of
shared registry: a file every feature would otherwise each create or edit.
Adds a rule (general, not conftest-specific): the scaffold MUST create shared
test/build bootstrap (conftest.py, tests/setup.*, jest.config.*, shared DB/
session fixtures, shared mocks, shared lint/type config) and list it in
shared_registry_files with leaf_edit:false; feature leaves add only their own
test_<feature>.* modules under the extension globs and import the shared
harness. Divergent independent creates of these files are the #1 integration
merge-conflict cause.

Prompt-only root-cause fix (no new deterministic validator predicate — that
would re-introduce the brittle-predicate anti-pattern this campaign is
removing). Pairs with 104522a (policy-label first-try gate) for the
run #7 <45min test.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant