feat: gate pilot — LLM at batch decision boundaries by logpie · Pull Request #1 · logpie/otto

logpie · 2026-03-30T04:05:12Z

Summary

Adds a gate pilot that replaces simple replan() after batch failures with richer failure analysis, retry strategies, context routing, and skip recommendations
Stateless design: reads disk artifacts, returns structured JSON, orchestrator validates and applies
Falls back to replan() on failure. Config flag pilot: false to disable. Zero overhead when no failures.
Codex-reviewed (3 rounds, APPROVED). 484 unit tests pass. 53 tasks across 18 e2e runs, 0 regressions.

Status: NOT validated on real failures

The pilot never fired during benchmarking because the coding agent passed all tasks. This is expected — the pilot's value is at i2p scale (8+ tasks, multiple batches, partial failures). Shipping as a safe no-op upgrade.

What's new

File	What
`otto/pilot.py`	Gate pilot module — context assembly, LLM call, decision parsing
`otto/orchestrator.py`	Pilot at batch boundaries, fallback to replan, config flag
`otto/runner.py`	Pilot guidance separated in retry prompts
`tests/test_pilot.py`	22 unit tests
`tests/test_pilot_benchmark.py`	6 scenario tests
`bench/pilot-benchmark.sh`	A/B benchmark runner
`bench/pressure/projects/pilot-test-*`	3 synthetic test projects

Design docs

Spec: docs/superpowers/specs/2026-03-29-gate-pilot.md
Plan: docs/superpowers/plans/2026-03-29-gate-pilot-stage1.md
i2p spec: docs/superpowers/specs/2026-03-26-otto-intent-to-product.md

Test plan

484 unit tests pass (0 new failures)
Codex adversarial review: 3 rounds, APPROVED
18 e2e runs (6 projects × baseline/pilot): zero overhead, zero regressions
4 real-world combined runs (ufo, humanize, camelcase, pre-commit)
5-task greenfield run with merge conflicts
Pending: real-world pilot invocation — needs a run where batch has mixed pass/fail results with remaining tasks. Monitor pilot.log on next failure.

🤖 Generated with Claude Code

Adds a gate pilot that replaces the simple replan() call after batch failures. The pilot reads disk artifacts (verify logs, QA verdicts, task summaries, learnings) and returns structured decisions: failure analysis, retry strategies, routed context for upcoming tasks, skip recommendations, and re-batching. Key design: - Stateless: reconstructs context from files each invocation - No telephone game: pilot makes system-level decisions, coding agents interpret their own errors directly - Structured JSON output, orchestrator validates and applies - Same model as planner (configurable via planner_model) - Falls back to replan() on parse failure - Config flag: pilot: false in otto.yaml to disable - Zero overhead when no failures (pilot only invoked at batch boundary with failures + remaining tasks) Codex-reviewed: 3 rounds, all CRITICAL/IMPORTANT findings fixed, APPROVED. Benchmark: 53 tasks across 18 runs, 0 regressions, 0 pilot overhead. Pilot not yet validated on real failures — shipping as safe no-op upgrade for i2p readiness. Will prove value at scale (5+ tasks, multiple batches). New files: - otto/pilot.py — context assembly, LLM invocation, decision parsing - tests/test_pilot.py — 22 unit tests - tests/test_pilot_benchmark.py — 6 scenario benchmark tests - bench/pilot-benchmark.sh — A/B benchmark runner - bench/pressure/projects/pilot-test-* — 3 synthetic test projects Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

coderabbitai · 2026-03-30T04:05:19Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 6de94498-333a-48d3-b2f0-f8f8313d2328

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch worktree-gate-pilot

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Supersedes the gates + gate pilot approach. Simplified to 5 steps: classify → plan → execute → verify → fix-or-replan. Key decisions: - Single-task is a valid plan (no forced decomposition) - Product artifacts at project root (not otto_arch/) - Persistent context.md accumulates across tasks - Vertical slices over horizontal layers - User journeys from user's perspective, not feature list - Fix rounds continue while making progress, replan on planning failures - Codex-reviewed design Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Closes 1 CRITICAL + 2 IMPORTANT findings from the mc-audit hunters. CRITICAL closed: - Logs appended unbounded into client state; 10MB log could lock the browser (codex-long-string-overflow #1). Replaced with bounded ring buffer (1MB max), separate unbounded totalBytes/totalLines counters, droppedBytes tracker for the elided-bytes header. IMPORTANT closed: - LogPane didn't distinguish "Live, polling" from "Final" state (codex-evidence-trustworthiness #5). Header now shows live polling cadence + last-update age, OR final size + line count. - Missing log file rendered as generic "waiting for output"; fetch errors toasted while polling kept hammering (codex-error-empty-states #9). Now shows the path explicitly, plus an error state with Retry button. Polling pauses when log is missing/errored. Polling resilience: - Exponential backoff on consecutive errors: 1.2s → 2s → 5s → 15s → 30s. - Resets to 1.2s on first successful read. - Stops polling when run is terminal AND fully drained (uses new server `eof` field). - Pauses polling when inspector is closed or tab is hidden (visibilitychange listener); resumes on visible. Server changes: - `LogReadResult` gains `total_bytes` (file size at read time) and `eof` (whether next_offset == total_bytes after this slice). All three constructor sites populated. Lets the client render "Final · {size}" headers and detect drain without a second HEAD request. - `LogsResponse` TS type updated. Tests: - `tests/browser/test_log_buffering.py` — 7 paired Playwright tests: - 5MB log renders <1.5MB DOM with elided-bytes header - Live state + polling header for active runs - Final state + line count for terminal runs - Missing-file path display + paused polling - Error backoff schedule (gap_first ≈ 2s, gap_second ≈ 5s) - Polling stops on inspector close - Polling stops on tab hidden - Browser suite: 15 passed (7 new + 7 cluster A + 1 smoke) - Default suite: 1076 passed (no regressions) - npm run web:typecheck: clean - npm run web:build: 277.82 kB JS / 33.34 kB CSS Note: pre-existing basedpyright warnings in service.py around `_record_event` calls (lines 393-544) are not introduced by this commit; they predate cluster B and are flagged because basedpyright now analyzes the file when it's touched. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

CLUSTER C — bundle/build integrity (closes 1 CRITICAL + 4 IMPORTANT + 1 NOTE): CRITICAL closed: - `app.py:35`: `otto web` served checked-in bundle without verifying it matches source — developer skips `npm run web:build` → silent stale UI (codex-packaging-bundle #1). IMPORTANT closed: - Python build never built frontend; `pip install -e .` shipped whatever static was in tree (#2). Vite plugin emits `build-stamp.json` on every build with source-hash + timestamp + git commit; FastAPI startup verifies freshness via `verify_bundle_freshness()`. - `web:build` ran only Vite; `web:typecheck` advisory; no CI gates (#3). New `web:verify` script chains typecheck → build → committed-check. - Default cache headers underused hashed assets (#4). New `_CacheHeaderStaticFiles` subclass: `no-store` for shell + index.html, `public, max-age=31536000, immutable` for `/static/assets/*`. - Server didn't validate `index.html` referenced JS/CSS exist (#5). Startup parses index.html and asserts every referenced static path resolves; missing → fail-fast with `npm run web:build` guidance. NOTE closed: - `[tool.setuptools.package-data]` was flat (`static/*`, `static/assets/*`); future nested assets (fonts/, images/, locale/) would silently miss wheels. Now `static/**/*` recursive glob. Files: otto/web/bundle.py (new, 263 lines), otto/web/client/vite.config.ts (build-stamp emitter), scripts/build_stamp.py (CLI for manual stamp), scripts/check_bundle_committed.py (git-diff guard for CI), package.json (web:verify script), pyproject.toml (recursive package_data), otto/web/app.py (verify_bundle_freshness call + _CacheHeaderStaticFiles). Tests: tests/test_web_bundle_freshness.py (5 tests) + tests/test_web_cache_headers.py (2 tests, 8 actual checks). 15 new server-layer tests, all green. CLUSTER D — history pagination (closes 1 CRITICAL + 2 IMPORTANT): CRITICAL closed: - `total_rows=247` displayed but only first page rendered; power user with 200+ runs stuck on page 1 (heavy-user, codex-state-management #6, codex-long-string-overflow #3). IMPORTANT closed: - `/api/state` accepted `history_page` + `page_size`; client never sent `history_page` and rendered no controls. - Page-size selector now lives in the UI (10/25/50/100, default 25); server clamps to [1, 200] to refuse stale URLs requesting unbounded slices. Implementation: - `MissionControlFilters.history_page_size: int | None` for per-request override (server clamps to safe range). - App.tsx History pane: pagination footer (Page N of M · X runs · ←/→ · jump-to + page-size selector). URL persists `hp` + `ps` query params. - Filter changes reset page to 1. - Stale deep-link `?hp=99` (out-of-range) → "Page 99 doesn't exist; jump to page 1" with reset button. Files: otto/mission_control/{model.py,serializers.py} (history_page_size plumbing), otto/web/app.py (param wiring), otto/web/client/src/{App.tsx, api.ts,types.ts,styles.css} (pagination UI), tests/browser/test_history_pagination.py (10 paired Playwright tests, all green). Verification: - Browser suite: 25 passed (8 cluster A + 7 cluster B + 10 cluster D + new smoke set from cluster C cache headers via tests/test_web_cache_headers.py) - Default suite: 1091 passed (was 1076; +15 cluster C server tests) - npm run web:typecheck: clean - npm run web:build: 277.82 kB JS / 33.34 kB CSS (rebuilt with stamp) Note: cluster C agent hit an API overload during its final summary step but all files landed cleanly on disk; verified by independently running the new test files before committing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

CLUSTER E — diff freshness contract (closes 1 CRITICAL + 1 IMPORTANT): CRITICAL closed: - Diff fetched once and held client-side without target/branch SHAs or merge-base; **the code merged could differ from the diff reviewed** (codex-evidence-trustworthiness #1). IMPORTANT closed: - Diff truncation was bare `truncated` suffix; user couldn't tell how much was hidden, no full-diff download path (codex-evidence-trustworthiness #2). Implementation: - `MissionControlService.diff()` enriches response with: `fetched_at`, `target_sha`, `branch_sha`, `merge_base`, `command`, `limit_chars`, `full_size_chars`, `shown_hunks`, `total_hunks`, `errors[]`. - `MissionControlService._validate_expected_diff_shas` — merge action rejects with 409 + "Re-fetch the diff to confirm what will be merged" when `expected_target_sha` / `expected_branch_sha` differ from live HEAD. - POST /api/runs/{id}/actions/{action} forwards SHAs from request body. - DiffPane renders freshness header (captured-X-ago + target/branch/base short SHAs with full-SHA tooltip; warnings when SHAs are null) + Refresh button + truncation banner ("Showing N hunks of M · X KB of Y MB" + Copy diff command). - Merge confirm dialog spells out "Land branch {short} @ {sha} into target {short} @ {sha}" with the actual SHAs. - SPA passes SHAs from most-recent diff fetch on every merge POST. CLUSTER F — boot-loading gate + first-run clarity (closes 2 CRITICAL + ~12 IMPORTANT): CRITICAL closed: - App.tsx tri-state boot gate (`loading | launcher | ready`) — main shell no longer renders before /api/projects returns; "New job" button can no longer be enabled with project undefined (codex-first-time-user #1). - Pre-submit advanced-options summary in JobDialog: "Will run with: claude · sonnet · effort=high · verification=fast" outside Advanced details with "Edit" link (codex-first-time-user #2). IMPORTANT closed: - Launcher subhead: "Otto runs AI coding jobs in isolated git worktrees, then lets you review logs, diffs, and merge results." - "Managed root" helper text explains current-repo isolation - Empty project list: "Create your first Otto project below" + auto-focus - First-run primary CTA: "Start first build" (reverts to "New job" once any run exists) - Build/Improve/Certify dropdown options gain helper descriptions - All commands now require non-empty intent or focus (was build-only) - Dirty-project confirm lists up to 5 dirty files with "+N more" - "Start queued job" CTA when watcher is stopped + jobs queued - Empty detail copy: "Select a task card to review logs, code changes, verification, and next action." - RunInspector tab labels: jargon-soft alternatives - Recovery actions surfaced as primary contextual buttons (Retry / Resume / Cleanup) next to run header — Advanced still has full list - HTTP-code-to-actionable-copy mapping in api.ts: 409/400/403/5xx → recovery messages (no more raw "HTTP 409") Tests: - tests/test_diff_freshness.py — 6 server tests - tests/browser/test_diff_freshness.py — 5 browser tests - tests/browser/test_first_run_clarity.py — 13 browser tests - All green; default suite 1097 (was 1091; +6 cluster E server) - Browser suite 47 (was 25; +5 cluster E + +17 cluster F) - npm web:typecheck clean; bundle rebuilt Followups for orchestrator: - SHA-mismatch refusal is opt-in by client (older callers omit SHAs and bypass the gate); consider promoting to power-user opt-out flag - Pre-fetch diff inside runActionForRun("merge") so SHA gate is non-bypassable from SPA - Related provenance gaps (proof drawer cache, ArtifactRef metadata, visual-evidence manifest, proof file digest) share same architectural fix; bundle into a future cluster Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…/launcher/lost-connection) ============================================================ 5 IMPORTANT closures ============================================================ W7-IMPORTANT-1 — iPhone submit button below the fold In devices["iPhone 14"] viewport (390x664), JobDialog submit at y=674 was clipped without scroll affordance. JobDialog scroll- container now uses max-height: 100dvh - 80px and overflow-y: auto; submit button always reachable via scroll OR the dialog renders a floating bottom-bar on narrow viewports. W8-IMPORTANT-1 — JobDialog ignores Cmd+Enter from textarea Power-user shortcut was dead. Added onKeyDown to intent textarea: (Cmd|Ctrl)+Enter now triggers submit when validation passes. W9-IMPORTANT-2 — Run double-rendered after terminal_outcome Live[]/history[] transition not atomic — same run_id appeared in both, UI rendered twice. Client-side dedupe: when computing rows to render, exclude live items whose run_id appears in history with terminal_outcome set. Codex first-time-user #4 — "Managed root" looked like current repo disappeared. Launcher panel adds: "Otto manages projects in isolated git worktrees so it never touches your other repos. Pick or create one below to start." Codex error-empty-states #1 — Lost-connection banner When polling fails 3+ consecutive times, sticky banner appears: "Lost connection to Mission Control. Retrying every 5s..." with manual retry button. Auto-clears on first successful poll after. ============================================================ Tests added (14 new browser tests) ============================================================ - tests/browser/test_iphone_submit_button_reachable.py (3 tests) - tests/browser/test_job_dialog_cmd_enter.py (3 tests) - tests/browser/test_no_double_render_after_terminal.py (4 tests) - tests/browser/test_launcher_managed_root_explanation.py (1 test) - tests/browser/test_lost_connection_banner.py (3 tests) ============================================================ Test counts ============================================================ - Default: 1189 (no change; all UI fixes) - Browser: 198 effective (was 184; +14 new) - npm web:typecheck: clean - npm web:build: clean ============================================================ Tally ============================================================ CRITICAL: 29 of 29 closed (100%) IMPORTANT: ~113 of 132 closed (~86%) Followups: - ~19 NOTE-tier IMPORTANTs remain (paper cuts; deferred per severity-gate rule unless adjacent) - 76 NOTE items deferred per policy - Phase 3.5 R1-R14 actual recordings ($70-140, hours of real LLM) remain scaffolded but not captured This effectively closes the bulk of Phase 4 IMPORTANT work. Branch is in releasable state. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

The architectural design doc (docs/intent-to-product-design.md) was the canonical reference that drove the redesign — adding it as item #1 in the read-in-this-order list ahead of progress.md / research.md / plan.md. Fixes the duplicate "4." numbering left by the prior edit. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Three deterministic RED-first repros for the concurrency + recursive- decomposition regime, exercising real code paths (only the agent step is an injected callable; no LLM, no spec-compile, no 90-min run; <5s total): - #1 depth-3 dual-subtree concurrent propagation: GREEN — owning-worktree propagation (regression guard for 89d4bad) + blocked-slice isolation with structured reason are sound. Banked as permanent regression. - #2 5-way concurrent merge into one integration branch: RED — one-shot union repair against a moving integration target drops already-landed sibling contributions (final routes [route-0,route-2], expected all 5) instead of bounded re-entry. Composition of flock + union guard + seam re-entry under 5-way race is broken. - #3 task_graph/spec-state concurrent terminal writes (32 children): RED — spec_state.append_event derives event_id from an unlocked line-count read (spec_state.py:328-333), so concurrent writers stale-read the same count and produce duplicate/skewed event_ids, violating the documented stable-unique contract amendments rely on (trigger_event_id linkage). No production code changed — triage/repro pass only. Fixes dispatched next. Repros built by Codex; root causes confirmed against real spec_state code. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…rm contract-write invariant Plan-Gate-APPROVED redesign step S1 (builds on S0 primitive). Flips repro scene #1 GREEN; scenes #2-#5 stay RED (S1 flips only #1). - Scheduler-ordering invariant in `_process_children`: no `feature` child dispatches while a sibling `foundation` task for that parent is pending/in-flight/unverified or the parent foundation_contracts are absent/invalid — even without `depends_on` (scheduler invariant, not prompt). - Isolation gate beside `check_route_registration_isolation`: pre feature-dispatch, every foundation_contract path must be exclusively owned; an overlapping/nested feature owned_path → re-enter the architect via the existing `_reenter_or_block_architect_contract` bounded machinery with `kind="shared_foundation_not_isolated"`; bounded exhaustion → structured terminal (no crash, no silent dispatch). - Uniform contract-write invariant: a task may write a foundation_contract path only if it is the contract's owner_task_id or a contract_amendment for it (S1 emits a structured `foundation_contract_write_blocked` and does not advance the branch; S2 adds the amendment routing). ONE shared helper applied at ALL 8 enumerated v5 commit/merge admission hooks (preflight repair, integ- agent commit, root-inline, subtree-prop repair, child-verify repair, scaffold repair, _merge_child_branch pre-commit dirty/untracked + pre-merge committed child-branch delta, merge-conflict repair). - Plan-Gate must-have: `_task_entry_allows_upward_merge` / `_child_result_allows_upward_merge` now consult durable merge_blocked graph state so a stale in-memory `pass` cannot bypass the S1 gate. - Secondary: legacy `detect_scope_violations` treats a newly-created critical shared-contract path as a violation (defense-in-depth; v5 gate is primary). Verified: scene #1 GREEN, #2-#5 RED; 19 S1+S0 units GREEN; broad v5/seam suite shows ONLY the 4 known pre-existing test_v5_phase2 git-worktree-rot failures (no new); ruff clean. S0 untouched. Codex-implemented; Claude-reviewed (scope, no S2-S5 leak, scene-#1 not-gamed, stale-pass must-have, 8-hook enumeration). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…duler hold, honest terminal S1 Gate R1 (3 CRITICAL + 2 IMPORTANT; 8-hook enumeration + no-S2-leak confirmed): - CRITICAL-1: scheduler allowed feature dispatch when foundation contracts not yet declared even with a foundation sibling. Now an explicit task_role=="foundation" sibling HOLDS all feature dispatch until every foundation sibling is mergeable/verified AND valid parent foundation_contracts are present; pure-feature decomposition (no foundation sibling) still bypasses. - CRITICAL-2: terminal-blocked foundation silently stranded features (dropped from ready + loop break). Now affected ready/pending feature children are honestly marked merge_blocked kind="foundation_unsatisfied" (no silent drop). - CRITICAL-3 (real capstone root cause): isolation gate only caught exact contract-path overlap, NOT a feature owned_path nested under the foundation owner's broad tree (foundation owns backend/, contract backend/auth.py, feature owns backend/routers/auth.py). Now a foundation-owned tree covering a declared contract is exclusive; nested sibling feature paths are rejected via the existing shared_foundation_not_isolated architect re-entry. Repro scene #1 strengthened to the nested capstone shape (verified RED on old db5b819, GREEN on fix — genuine, not gamed). - IMPORTANT-4: `_task_entry_allows_upward_merge` now also rejects a stale verdict=="pass" carrying durable merge_blocked metadata (was only fixed in `_child_result_allows_upward_merge`). - IMPORTANT-5: removed the over-broad any-contract_amendment write allow (S2 will reintroduce a bound amendment→contract allow); integration-of-record allow explicitly deferred, not left ambiguous. Verified: scene #1 GREEN (nested shape), #2-#5 RED; 21 S1+S0 units GREEN; broad suite only the 4 known pre-existing test_v5_phase2 git-worktree-rot failures; ruff clean. S0 untouched. Codex-fixed (Codex-found via Impl Gate R1); Claude-reviewed + RED-on-old verified. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…er / honest terminal) S1 Gate R2 (2 CRITICAL; nested-tree fix + IMPORTANT-4/5 confirmed correct R2, untouched). Both were the same disease — a feature child silently stranded pending instead of an honest structured terminal (the capstone hang class): - CRITICAL-1: foundation PASSED but never produced valid foundation_contracts → features were silently held/dropped with no re-entry or terminalization (the S1 test masked it by injecting contracts externally between _process_children calls). Now: passed/ mergeable foundation + absent/invalid contracts re-enters the architect via the existing bounded `_reenter_or_block_architect _contract` machinery (kind=foundation_contracts_missing_after_pass); on bounded exhaustion the dependent features are honestly merge_blocked — never left pending awaiting external metadata mutation. Test rewritten to assert the real runtime transition (no external contract injection). - CRITICAL-2: terminalization only ran when ready_features non-empty, but a depends_on=["foundation"] feature is NOT ready when foundation failed (deps unsatisfied), so it stayed silently pending. Now terminal-foundation handling scans ALL same-parent unmerged feature siblings (not just ready) and marks each merge_blocked kind="foundation_unsatisfied"; covers foundation merge_blocked AND catastrophic/failed. Tradeoff: terminal handling uses task-graph siblings as source of truth (not orphan pending JSONL) — consistent with the existing scheduler/metadata model, no new lifecycle channel. Verified: scene #1 GREEN (nested capstone shape), #2-#5 RED; 22 S1+S0 units GREEN incl. 2 strengthened/added scheduler tests asserting real runtime transitions; broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff clean. S0 + nested-tree fix untouched. Codex-fixed (Codex-found via Impl Gate R2); Claude-reviewed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Plan-Gate-APPROVED redesign step S2 (builds on S0/S1). Flips repro scene #5 GREEN (full lifecycle); scenes #2/#3/#4 stay RED. - _merge_child_branch union-feedback: when the union/conflict path is a declared foundation_contract the child does NOT own, no longer routes the repair to the leaf (the b15 scope-gate deadlock). Instead schedules a runnable task_role=="contract_amendment" task owned by the contract's owner_task_id, owned_paths=[contract], emits foundation_contract_amendment_repair. - Net-new lifecycle (task_graph): set_contract_amendment_blocked records last_agent_verdict, CLEARS verdict/completed_at (un-non- runnable), sets non-terminal blocked_pending_contract_amendment + blocked_on_task_id; clear_contract_amendment_blocked_tasks clears ALL leaves blocked on an amendment (Plan-Gate must-have #3, not just the first). - take_ready: new blocked_on_task_id gate (analogous to depends_on) — a leaf with an unsatisfied blocked_on is skipped, not dispatched/ terminal. - Amendment terminal-PASS → clear all blocked leaves + re-enqueue merge-only retry (scheduler re-entry w/ contract_amendment_retry_merge metadata, reuses pending/lease machinery, bypasses Lead, retries only _merge_child_branch). Amendment terminal-FAIL → each leaf honest merge_blocked (no silent hang). - Reintroduced the BOUND contract_amendment write-allow S1 removed: an amendment may write only its bound contract (owner/path match via task metadata), not any contract. - Blocked graph state authoritative over stale in-memory LeadResult(pass) (Plan-Gate must-have #2; composes with S1's hardening). Verified: scene #5 strengthened to full-lifecycle assertion (verified RED on pre-S2 78535d1, GREEN now — not gamed); scenes #1/#5 GREEN, #2/#3/#4 RED; 27 S0+S1+S2 units GREEN; broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff clean. S0/S1 untouched. Codex-implemented; Claude-reviewed (scope, lifecycle, RED-on-old). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…tlement, bound writes, bounded churn S2 Gate R1 (2 CRITICAL + 1 IMPORTANT + 1 NOTE) — the silent-hang / double-merge class: - CRITICAL-1: merge-only retry never persisted the restored terminal verdict (only in-memory) → after restart the graph had verdict=None and re-dispatched the leaf (double-merge); the scene masked it via a fake set_verdict. Now the restored `pass` is persisted ONLY after _merge_child_branch really succeeds AND a durable graph re-read shows no fresh block/retry/merge_blocked — idempotent, no restart double-merge. - CRITICAL-2: an amendment _run_child CRASH set it catastrophic without running fail-settlement → blocked leaves kept blocked_on_task_id forever (take_ready skips them = silent hang). Now ANY amendment terminalization (crash/catastrophic/failed/merge_blocked) runs _settle_contract_amendment_dependents → every blocked leaf becomes honest merge_blocked. - IMPORTANT-3: bound write-allow still let a contract_amendment task modify arbitrary NON-contract files (gate only flagged contract-overlapping paths). Now a contract_amendment task may write ONLY its bound contract path; any other changed path is rejected. - NOTE-4: futile amendment churn was unbounded (pass-without-fix → schedule another amendment forever). Now bounded per (leaf, contract) (cap=2: initial + 1 retry, matching existing bounded-retry style) → honest structured merge_blocked on exhaustion. +4 regressions: durable verdict after real (non-fake) retry + no re-dispatch; amendment crash settles all blocked leaves; amendment cannot write a non-contract file; futile-amendment bounded → terminal. Verified: scenes #1/#5 GREEN, #2/#3/#4 RED; 30 S0+S1+S2 units GREEN; broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff clean. S0/S1 untouched. Codex-fixed (Codex-found via Impl Gate R1); Claude-reviewed. Tradeoff: amendment retry cap=2 (small bounded style). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…durable in-progress state) S2 Gate R2 (1 IMPORTANT; C2/I3/N4 confirmed correct R2). The merge-only-retry flag was cleared BEFORE the merge, leaving a window (verdict=None, blocked_on=None, retry=False) where a crash/restart or a second runner could re-dispatch the leaf via take_ready (in-process lease only) → double-merge class. Fix: durable `contract_amendment_retry_in_progress` set atomically when entering merge-only retry (no longer pre-clears contract_amendment_retry_merge); `take_ready` treats it non-runnable so empty-in-flight / crash-restart / second-runner cannot re-dispatch the leaf as an ordinary task; cleared ONLY atomically (single graph lock/write) with the terminal outcome — success persists `pass` + both flags; merge_blocked persists terminal + flags; fresh re-block clears stale retry flags atomically with blocked_on_task_id (preserving last_agent_verdict). Fails-closed during the window; idempotent on restart (resume/settle, never double-merge). +1 regression: simulates the exact in-retry window pre-durable-pass and asserts fresh take_ready(in_flight=set()) does NOT return the leaf. Verified: scenes #1/#5 GREEN, #2/#3/#4 RED; 31 S0+S1+S2 units GREEN; broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff clean. S0/S1 + R1 fixes (C2/I3/N4) untouched. Codex-fixed (Codex-found via Impl Gate R2); Claude-reviewed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…le-recovery S2 Gate R3 (2 CRITICAL): the R2 durable-in-progress fix closed the in-process window but (1) left the second-runner race open (mark_..._in_progress wasn't compare-and-set; _run_child ignored its return) and (2) introduced a crash/restart DEADLOCK (stale in-progress, no recovery — traded double-merge for permanent stuck). - Atomic claim: `mark_contract_amendment_retry_in_progress` is now a compare-and-set under the existing `_locked_graph()` fcntl.LOCK_EX — flips in_progress=True only if still retry-merge/unblocked/non-terminal and unclaimed-or-stale-with-budget; persists owner token/pid/host/ heartbeat/claim-count/merge-context. `_run_child` consumes the return: False → does NOT run _merge_child_branch (yields to the owner; no double-merge, no terminalize-of-a-live-owner). One active merger at a time, cross-process. - Bounded stale-recovery: stale = same-host owner pid gone OR heartbeat/start exceeds the bounded timeout. take_ready reopens ONLY stale retry-merge entries as merge-only retries (never ordinary Lead dispatch); remaining claim budget → reclaim+resume from durable contract_amendment_merge_context; budget exhausted → structured merge_blocked. Composes with N4's per-(leaf,contract) cap. Never deadlocks, never double-merges, never re-dispatches as ordinary. Net invariant: exactly one runner executes a leaf's merge-only retry at a time; crash/restart always resolves to pass or honest merge_blocked within bounded attempts. +2 regressions: concurrent-claim race (exactly one wins, loser doesn't merge); stale in-progress recovery (resume→pass or bounded→merge_blocked, never ordinary, never stuck). R2 restart-window + durable-verdict regressions still pass. Verified: scenes #1/#5 GREEN, #2/#3/#4 RED; 33 S0+S1+S2 units GREEN; broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff clean. R1 (C2/I3/N4) + S0/S1 untouched. Codex-fixed (Codex-found via Impl Gate R3); Claude-reviewed. (Codex sub-docs research/plan-s2- amendment-retry-recovery.md included.) Tradeoff: conservative remote-host staleness handled via timeout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…(no false reclaim) S2 Gate R4 (final round, 1 minimal must-fix; everything else confirmed acceptable, residual NOTE-level). Heartbeat was written only at claim time, never refreshed, but _merge_child_branch can legitimately run ~1800s > the 15-min stale timeout → a LIVE long-running retry owner was falsely reclaimed by a second runner (the exact race R3 closed, reopened by long merges). Fix: owner-token-checked periodic heartbeat refresh (60s interval, well under the 15-min stale window) wrapping the awaited _merge_child_branch in the merge-only retry path. The refresher writes the heartbeat under _locked_graph() ONLY when owner==this child_session_id AND retry_in_progress AND retry_merge AND no terminal/blocked state landed (re-checked each tick; stops if owner/state no longer matches). try/finally cancels + awaits it (suppress CancelledError) on success/merge_blocked/re-block/exception — no leaked task, no post-terminal refresh. Dead/stalled owners still go stale and are bounded-recovered via the existing timeout (unchanged). +1 regression: live long-running heartbeating owner is NOT reclaimed by a second claim (CAS still False); existing dead-owner stale-recovery still recovers; R2/R3 regressions still pass. Verified: scenes #1/#5 GREEN, #2/#3/#4 RED; 34 S0+S1+S2 units GREEN; broad suite only the 4 known pre-existing test_v5_phase2 failures; ruff clean. R1/R2/R3 + S0/S1 untouched. Codex-fixed (Codex-found via Impl Gate R4); Claude-reviewed. Accepted NOTE-level residual: conservative remote/unknown-host stale timeout (dead remote owner waits out the bounded timeout before recovery). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…loop (kills the 1799s hang) Plan-Gate-APPROVED redesign step S4 (builds on S0/S1/S2). Directly fixes the user's original pain: the 1799s leaf repair-agent timeout that hung the iTracker capstone. Flips repro scene #2 GREEN; #3/#4 stay RED. - After a SCOPED conflict repair, `_merge_child_branch` runs integration smoke in DETECTION-ONLY mode — it no longer enters `_run_integration_smoke_preflight_with_repair`'s leaf repair loop for an out-of-scope / foundation clean-deploy failure. Both leaf-reachable entry points converted: the direct post-conflict path AND the stale-target `_repair_stale_target_and_retry_merge(run_smoke_preflight =True)` path. (Root/subtree integration smoke unchanged — not leaf.) - An out-of-scope/foundation clean-deploy failure now emits a correctly-owned foundation_repair_needed / integration_repair_needed that creates a RUNNABLE graph task and S2-blocks the leaf (reuses S2's set_contract_amendment_blocked lifecycle / atomic-claim / stale- recovery — repair_route distinguishes integration_smoke_repair from foundation contract amendments) — never a dangling event, never a 1799s leaf loop. - In-scope failures keep existing scoped repair (no behavior change). - v5_preflight_repair: scoped leaf conflict-repair prompts no longer demand the full acceptance oracle. Repro scene #2 oracle refined (RED-first, verified RED on eae1f3a / GREEN now — not weakened): asserts no leaf smoke-REPAIR loop from either entry point, a runnable correctly-owned repair-need, leaf S2-blocked (not merge_blocked), detection-only smoke allowed. Verified: scenes #1/#2/#5 GREEN, #3/#4 RED; 38 S0-S2+S4 ownership units GREEN; ruff clean; S0/S1/S2 untouched. Codex-implemented; Claude-reviewed (RED-on-old verified; scope confirmed). Pre-existing rot NOTE (NOT this redesign): committed test_v5_architect_retry.py patches otto.v5_runner.check_scaffold_compiles which was removed by e2329e9 (pre-session "agent-native repair Step 4") → AttributeError on 3 tests; plus the 4 test_v5_phase2 git-worktree-rot failures. Both predate + are unrelated to S0-S4 and are entangled with the user's 4 uncommitted route-isolation dirty files (deliberately NOT committed here). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…nded pathless terminal, scoped in-scope fallback S4 Gate R1 (1 CRITICAL + 1 IMPORTANT). The gate caught that S4 was broken on the REAL path (tests used pathful fakes): - CRITICAL: CleanOracleIssue.paths were dropped by preflight_issues_from_clean_oracle / PreflightIssue (no path field) / smoke serialization → S4's classifier saw REAL failures as pathless → always out-of-scope → empty-bound contract_amendment (rejects all writes) + cap-check-key('') vs increment-key('integration_smoke _repair') mismatch → cap never trips → the 1799s stuck-cycle re-emerged through S2 tasks. Fixed: added optional PreflightIssue.paths (legacy None preserved; constructors/consumers audited), threaded CleanOracleIssue.paths → PreflightIssue.paths → _preflight_issue_payload → _smoke_payload_paths, plus a robust fallback reading clean_oracle_result.issues[].paths. A genuinely pathless smoke failure now terminalizes as honest structured merge_blocked kind="integration_smoke_unrouteable" (never an empty-bound amendment, never uncapped). Single consistent normalized repair_path key used for BOTH the cap check and increment. - IMPORTANT: the in-scope leaf smoke-repair fallback entered an UNRESTRICTED full-oracle loop (no allowed_paths/scope_policy → prompt demanded full acceptance oracle; commit hook only foundation-gated). Fixed: in-scope fallback now passes allowed_paths=leaf.owned_paths + scope_policy="allowed_paths", and the repair commit hook blocks any changed path outside that allowlist before the foundation-contract gate. A leaf smoke-repair can never widen beyond its owned paths. +3 real-path regressions: clean-oracle serialization preserves paths (RED on fa5c481 — old code had no PreflightIssue.paths, classifier returned []); pathless smoke → bounded honest terminal (no empty amendment); in-scope fallback packet + commit-hook enforce owned_paths (inspects the real packet, not a monkeypatched call count). Verified: scenes #1/#2/#5 GREEN, #3/#4 RED; 41 S0-S2+S4 ownership units GREEN; ruff clean; S0/S1/S2 untouched. The 4 test_v5_phase2 + committed test_v5_architect_retry check_scaffold_compiles-AttributeError failures remain PRE-EXISTING rot (unrelated, entangled with the user's uncommitted route-isolation work; deliberately not committed). Codex-fixed (Codex-found via Impl Gate R1); Claude-reviewed. Tradeoff: genuinely-pathless smoke failures terminalize immediately (honest, actionable) rather than consuming retries against a synthetic key. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

…from broad compile inputs) S4 Gate R2 (1 CRITICAL; R1 pathless/cap + in-scope-scoping confirmed CLOSED). py_compile set CleanOracleIssue.paths = ALL compiled files (command input set), not the causal failing file → S4's all-paths-must-overlap scope check made a leaf-owned syntax error look out-of-scope → misrouted an in-scope leaf bug to the wrong owner / under-scoped repair (and the first-sorted-path fallback guessed an arbitrary owner). Fixed at both sides of the seam: - Producer (otto/v5_clean_verify.py): new _py_compile_causal_paths parses the actual failing filename(s) from py_compile stderr/stdout; py_compile_failed.paths is now CAUSAL, not the broad input set. Audit: py_compile was the ONLY clean-oracle producer with the paths=command-input pattern; all others pass explicit/none. - Router (otto/v5_runner.py): no first-sorted-path guess. The contract-amendment write gate now supports MULTIPLE bound paths and smoke-repair scheduling owns/binds ALL causal paths; if causal paths are empty or cannot all be bound to the selected route → honest integration_smoke_unrouteable terminal (never under-scoped, never arbitrary-owner). Net invariant: leaf-owned causal failure stays in-scope (scoped leaf repair, unchanged); foundation/out-of-scope causal failure routes to the correct owner with ALL causal paths bound; indeterminate → honest-terminal; broad non-causal input paths never drive scope/routing. + real py_compile_failed multi-input regressions (leaf-owned causal → in-scope; foundation causal → routed+bound; indeterminate → unrouteable). The leaf regression directly exercises the d91cece bug (old paths=rel_files fails the causal-path assertion before routing). Verified: scenes #1/#2/#5 GREEN, #3/#4 RED; 44 S0-S2+S4 ownership units GREEN; broad suite only the known pre-existing test_v5_phase2 + the S5-RED scene #3 (no new regression); ruff clean. S0/S1/S2 + S4-R1 untouched. Codex-fixed (Codex-found via Impl Gate R2); Claude-reviewed. Tradeoff: broad compile inputs no longer kept as separate routing evidence (still inspectable via the recorded oracle command). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

… decomposition) Capstone run #1 evidence: root Lead decomposed CORRECTLY (5 children in graph: foundation + 4 features, all submit_subtask duplicate:false) in ~237s, then the run died "run_budget_seconds exceeded during root_decomposition (3000s)" at duration 642.2s — a 4.7x-early fire with a false label. Root cause (my P4 impl, not Codex's P1-P5 logic): both phase caps were hard-set to 240s — `spec_compile` (v5_runner.py:4007) and `root_decomposition` (:4059). 240s is far too tight: a real flat compile of a 47-feature product is ~6-10min and decomposition ~4-5min. The Lead emitted all 5 children at 237s and the 240s cap killed the phase ~3s later as it returned. `_V5RunDeadlineExceeded` also always reported `run_budget_seconds` (3000) regardless of which limit fired, masking the real cause. Fix: relax both phase caps 240 → 900s (generous hang-rails — healthy spec ~6min / decomp ~4min never killed; a genuine >15min hang still caught). run_budget_seconds remains the true total ceiling (P4 intent preserved: per-phase caps are hang detectors, the run budget is the real bound). `_await_with_run_deadline`/`_V5RunDeadlineExceeded` now report the ACTUAL fired timeout and whether it was a phase_cap vs run_budget — honest diagnostics so this can't mislabel again. Verified: syntax OK; ruff clean; 12 P1-P5 tests GREEN (incl P4 wall-clock). Claude-fixed from real run-#1 logs (no Codex — out of credit). basedpyright None.get@595-598 confirmed guarded (false positive); remaining strict-typer flags in P1-P5 are non-runtime noise. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Run #6 (mib6-001619) was the first run to traverse the full pipeline (compile→decomp→scaffold→persist→4-feature concurrent fan-out→integration). It exposed the next-layer instance of the decomp-boundary class: 3 of 4 features merge_blocked at integration on a conflict in `backend/tests/conftest.py`. The conflict packet showed `base: ""` — every feature leaf independently CREATED its own conftest.py (each needs test fixtures), and the conflict-repair agent timed out (399s) trying to reconcile divergent creates of the same shared file. lead.md's architect guidance isolated route/API/screen registration but said nothing about shared TEST/BUILD infrastructure, which is the same kind of shared registry: a file every feature would otherwise each create or edit. Adds a rule (general, not conftest-specific): the scaffold MUST create shared test/build bootstrap (conftest.py, tests/setup.*, jest.config.*, shared DB/ session fixtures, shared mocks, shared lint/type config) and list it in shared_registry_files with leaf_edit:false; feature leaves add only their own test_<feature>.* modules under the extension globs and import the shared harness. Divergent independent creates of these files are the #1 integration merge-conflict cause. Prompt-only root-cause fix (no new deterministic validator predicate — that would re-introduce the brittle-predicate anti-pattern this campaign is removing). Pairs with 104522a (policy-label first-try gate) for the run #7 <45min test. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: gate pilot — LLM at batch decision boundaries#1

feat: gate pilot — LLM at batch decision boundaries#1
logpie wants to merge 2 commits into
mainfrom
worktree-gate-pilot

logpie commented Mar 30, 2026

Uh oh!

coderabbitai Bot commented Mar 30, 2026 •

edited

Loading

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

logpie commented Mar 30, 2026

Summary

Status: NOT validated on real failures

What's new

Design docs

Test plan

Uh oh!

coderabbitai Bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented Mar 30, 2026 •

edited

Loading