diff --git a/.gitignore b/.gitignore index 1153341..c5d3e84 100644 --- a/.gitignore +++ b/.gitignore @@ -55,9 +55,11 @@ ASR.md .plan/ # Tracked docs are explicitly listed below; everything else under docs/ # is Claude scratch (plans, brainstorm output, etc) and stays gitignored. +# - DESIGN.md: consolidated architecture + decision log. # - AIRGAP_INSTALL.md: Phase 14 (HARD-02) air-gap install path. # - DEVELOPMENT.md: Phase 16 (BUNDLER-01) contributor workflow. docs/* +!docs/DESIGN.md !docs/AIRGAP_INSTALL.md !docs/DEVELOPMENT.md REVIEW_*.md diff --git a/.planning/phases/01-concurrency-foundation/01-01-SUMMARY.md b/.planning/phases/01-concurrency-foundation/01-01-SUMMARY.md deleted file mode 100644 index e619dac..0000000 --- a/.planning/phases/01-concurrency-foundation/01-01-SUMMARY.md +++ /dev/null @@ -1,134 +0,0 @@ ---- -phase: 01-concurrency-foundation -plan: 01 -subsystem: infra -tags: [asyncio, locks, concurrency, fastapi, streamlit, session-management] - -# Dependency graph -requires: [] -provides: - - SessionBusy(RuntimeError) exception with session_id attribute - - SessionLockRegistry.is_locked(session_id) non-blocking predicate - - Per-session task-reentrant lock held across full graph turn including HITL pause - - HTTP 429 + Retry-After:1 on all three session-start/approval API callsites - - UI retry hint on SessionBusy at investigation form submission - - locks.py inlined into dist/ bundles -affects: - - 01-02-concurrency-foundation # approval_watchdog retry path uses SessionBusy - -# Tech tracking -tech-stack: - added: [] - patterns: - - class-name match for exception handling in api.py (no hard import at module load) - - task-reentrant asyncio lock with is_locked() fail-fast check before acquire() - - D-09: dist/ regeneration in same atomic commit as src/ changes - -key-files: - created: [] - modified: - - src/runtime/locks.py - - src/runtime/service.py - - src/runtime/api.py - - src/runtime/ui.py - - tests/test_session_lock.py - - scripts/build_single_file.py - - dist/app.py - - dist/ui.py - - dist/apps/incident-management.py - - dist/apps/code-review.py - -key-decisions: - - "D-01: Lock held across entire graph turn including LangGraph interrupt() HITL pause" - - "D-02: Single acquire site inside _run() closure, not at start_session() entry" - - "D-03: Fail-fast contention — SessionBusy raised, not queued" - - "D-04: Reads stay lock-free throughout" - - "D-09: dist/ regenerated in same atomic commit as src/ changes" - - "D-10: Direct atomic commit on refactor/prompt-vs-code-remediation branch" - - "D-15: Slot eviction deferred to v2 — TODO comment added to _slots dict" - - "D-16 (location override): SessionBusy raised inside _run() at acquire site, NOT at start_session() entry — start_session() mints fresh session_id so no pre-existing lock slot exists" - - "D-17: EventLog stays lock-free" - - "locks.py added to RUNTIME_MODULE_ORDER in build_single_file.py (was missing)" - -patterns-established: - - "Exception class-name matching pattern: e.__class__.__name__ in ('SessionCapExceeded', 'SessionBusy') — avoids hard import at module load time" - - "is_locked() + acquire() pattern: check is_locked() first for fail-fast, then async with acquire() for the body — non-contending in steady state" - - "asyncio_mode=auto: new async tests in tests/ do NOT need @pytest.mark.asyncio decorator" - -requirements-completed: - - PVC-01 - -# Metrics -duration: ~35min -completed: 2026-05-06 ---- - -# Phase 01: Concurrency Foundation — Plan 01 Summary - -**Per-session task-reentrant asyncio lock with fail-fast SessionBusy, HTTP 429/Retry-After mapping at all three API callsites, UI retry hint, and locks.py bundled into dist/** - -## Performance - -- **Duration:** ~35 min -- **Started:** 2026-05-06T08:00:00Z -- **Completed:** 2026-05-06T08:35:00Z -- **Tasks:** 3 -- **Files modified:** 10 - -## Accomplishments -- `SessionBusy(RuntimeError)` exception and `is_locked()` predicate added to `locks.py`; 5 new unit tests pass (838 total) -- `service.py._run()` wrapped with per-session lock acquire; fail-fast contention check via `is_locked()` before `acquire()` -- All three FastAPI callsites (`/investigate`, `POST /sessions`, approval submission) now map `SessionBusy` → HTTP 429 + `Retry-After: 1`; UI shows `st.warning` + early return -- `locks.py` added to `RUNTIME_MODULE_ORDER` in `build_single_file.py` (was omitted); all four dist bundles regenerated with `SessionBusy`, `is_locked`, `_locks.acquire` present - -## Task Commits - -All tasks committed atomically in a single commit per D-09/D-10: - -1. **Tasks 1-3: All changes** - `ea43964` (feat) - -## Files Created/Modified -- `src/runtime/locks.py` - Added `SessionBusy` class, `is_locked()` predicate, TODO(v2) eviction note -- `src/runtime/service.py` - Wrapped `_run()` body with `async with orch._locks.acquire(session_id):`; `is_locked()` fail-fast guard -- `src/runtime/api.py` - Extended class-name match at 2 existing handlers + 1 new handler at approval submission callsite -- `src/runtime/ui.py` - SessionBusy try/except at `asyncio.run()` investigation form path -- `tests/test_session_lock.py` - 5 new tests for `is_locked()` + `SessionBusy` (no `@pytest.mark.asyncio` per asyncio_mode=auto) -- `scripts/build_single_file.py` - Added `(RUNTIME_ROOT, "locks.py")` before `orchestrator.py` in `RUNTIME_MODULE_ORDER` -- `dist/app.py`, `dist/ui.py`, `dist/apps/incident-management.py`, `dist/apps/code-review.py` - Regenerated with locks.py inlined - -## Decisions Made -- D-16 location override confirmed: `SessionBusy` raised inside `_run()` not at `start_session()` entry — `start_session()` mints a fresh `session_id` so there is no pre-existing lock slot to check -- `locks.py` was missing from `RUNTIME_MODULE_ORDER` in the build script — added before `orchestrator.py` which instantiates `SessionLockRegistry` -- Used `is_locked()` as a pre-check before `acquire()` to satisfy D-03 fail-fast without blocking; the acquire() itself is non-contending in the steady state - -## Deviations from Plan - -### Auto-fixed Issues - -**1. [Rule 3 - Blocking] locks.py missing from build_single_file.py RUNTIME_MODULE_ORDER** -- **Found during:** Task 3 (dist/ regeneration verification) -- **Issue:** `def is_locked`, `class SessionBusy` absent from `dist/app.py` after initial build; `locks.py` was not listed in `RUNTIME_MODULE_ORDER` -- **Fix:** Added `(RUNTIME_ROOT, "locks.py")` to `RUNTIME_MODULE_ORDER` before `orchestrator.py`; rebuilt all four bundles -- **Files modified:** `scripts/build_single_file.py`, all four dist files -- **Verification:** `grep -c "def is_locked" dist/app.py` → 1; `grep -c "class SessionBusy" dist/app.py` → 1; `grep -c "_locks\.acquire" dist/app.py` → 2 -- **Committed in:** `ea43964` (same atomic commit) - ---- - -**Total deviations:** 1 auto-fixed (1 blocking — missing bundle entry) -**Impact on plan:** Essential fix for D-09 compliance. No scope creep. - -## Issues Encountered -None beyond the locks.py bundle omission documented above. - -## User Setup Required -None - no external service configuration required. - -## Next Phase Readiness -- Per-session lock foundation complete; `SessionBusy` exception available for 01-02 -- 01-02 (`approval_watchdog.py` retry path) can import `SessionBusy` from `runtime.locks` without circular import risk -- All 838 tests pass; ruff clean on all modified files - ---- -*Phase: 01-concurrency-foundation* -*Completed: 2026-05-06* diff --git a/.planning/phases/14-reproducible-air-gap-lockfile/14-01-PLAN.md b/.planning/phases/14-reproducible-air-gap-lockfile/14-01-PLAN.md deleted file mode 100644 index 97986f8..0000000 --- a/.planning/phases/14-reproducible-air-gap-lockfile/14-01-PLAN.md +++ /dev/null @@ -1,75 +0,0 @@ ---- -phase: 14-reproducible-air-gap-lockfile -plan: 01 -title: Reproducible air-gap dependency lockfile (HARD-02) -status: in_progress -date: 2026-05-07 -requirement: HARD-02 (CONCERNS C2) ---- - -# Plan 14-01 — Reproducible Air-Gap Dependency Lockfile - -## One-liner - -Commit a `uv.lock` that pins every transitive dependency with hashes; CI installs from the lockfile and a freshness gate fails the build when `pyproject.toml` drifts from `uv.lock`; document the offline install path so an engineer behind a corporate firewall can reproduce the dependency graph from an internal mirror without public-internet access. - -## Tool Selection — `uv` (rationale) - -Considered `uv`, `pip-tools`, `poetry`. Selected **`uv`** (locally installed: `uv 0.11.7`). - -| Criterion (`~/.claude/rules/dependencies.md`) | `uv` | `pip-tools` | `poetry` | -| --- | --- | --- | --- | -| License | Apache-2.0 / MIT (dual) | BSD-3-Clause | MIT | -| Active maintenance / bus factor | Astral team, daily releases | jazzband collective | python-poetry org | -| Lockfile format | `uv.lock` (TOML, hashes per platform marker) | `requirements.txt` w/ `--generate-hashes` | `poetry.lock` (TOML) | -| PEP 621 (`pyproject.toml` `[project]`) native | Yes — already what we use | Reads `pyproject.toml` direct | Requires `[tool.poetry]` rewrite of `[project]` | -| Resolver speed (171 pkgs) | ~14 ms (measured) | seconds | seconds | -| Single static binary | Yes (Rust) | No (Python pkg) | No (Python pkg) | -| Works fully offline (`--offline`, `--frozen`) | Yes (first-class) | Indirect via `pip install --no-index` | Yes | -| Drift gate (`--check`) | `uv lock --check` | `pip-compile --check` (since 7.4) | `poetry check --lock` | -| Already adopted in repo | **Yes** (`uv.lock` already present, 4430 lines, 171 pkgs) | No | No | - -**Decision:** `uv`. The lockfile already exists in-repo and is in sync (`uv lock --check` exits 0 in 14 ms). `poetry` is rejected because adopting it would require rewriting `[project]` into `[tool.poetry]` — a pyproject-format migration that violates "minimal diff" scope. `pip-tools` would lose the `uv.lock` work already present and forfeit the multi-platform marker pinning that `uv.lock` gives for free. - -## Tasks (8) - -1. **Confirm lockfile freshness against current `pyproject.toml`** — `uv lock --check` (already passes; recorded as baseline). -2. **Add `[tool.uv]` block to `pyproject.toml` if needed** — likely no-op; defaults already satisfy our needs. Verify behaviour. -3. **Rewrite CI install step in `.github/workflows/ci.yml`** — replace `pip install -e ".[dev]"` with `uv sync --frozen --extra dev`, plus `astral-sh/setup-uv@v6` for the runner. -4. **Add CI lockfile-freshness gate** — new step `uv lock --check` runs before install; fails CI when `pyproject.toml` and `uv.lock` drift. -5. **Switch CI test/lint/type-check steps to `uv run`** — `uv run pytest …`, `uv run ruff check …`, `uv run pyright …` so tools execute against the locked virtualenv. -6. **Document the offline install path** — new `docs/AIRGAP_INSTALL.md` (≤50 lines): clone, `UV_INDEX_URL=https://internal-mirror`, `uv sync --frozen --offline`, `uv run pytest tests/ -x`. -7. **Local verification (acceptance gates)**: - - `uv lock --check` → exit 0 - - `python -m pytest tests/ -x` → all collected tests pass (baseline 1047) - - `ruff check src tests` → unchanged from baseline (13 pre-existing errors — NOT regressed) - - `pyright src/runtime` → unchanged from baseline (54 pre-existing errors — NOT regressed) - - `python scripts/build_single_file.py && git diff --exit-code dist/` → clean - - `git grep -nE 'https://ollama\.com|ollama\.com/api' -- src/` → zero matches (HARD-05 ratchet) - - `python -c 'import yaml; yaml.safe_load(open(".github/workflows/ci.yml"))'` → no parse error (no local yamllint installed) -8. **Single atomic commit** on `refactor/framework-flow-control` per phase precedent. - -## Files Touched - -| File | Status | Why | -| --- | --- | --- | -| `pyproject.toml` | possibly add `[tool.uv]` block (else unchanged) | UV config / extras declaration | -| `uv.lock` | **already present, unchanged** | Pre-existing; freshness re-verified at commit time | -| `.github/workflows/ci.yml` | modified | Install via `uv sync --frozen`; add lockfile-freshness gate; run tools via `uv run` | -| `docs/AIRGAP_INSTALL.md` | NEW | Offline install instructions | -| `.planning/phases/14-reproducible-air-gap-lockfile/14-01-PLAN.md` | NEW | This file | -| `.planning/phases/14-reproducible-air-gap-lockfile/14-01-SUMMARY.md` | NEW | After-action | -| `.planning/phases/14-reproducible-air-gap-lockfile/14-VERIFICATION.md` | NEW | Per-success-criterion gates | - -## Out of Scope (deferred) - -- **Vendored wheels tarball** for true `--no-index` install — separate phase (called out in 14-CONTEXT.md `Deferred Ideas`). -- **`Makefile` / `make bootstrap`** scaffolding — ROADMAP SC-2 wording mentions `make bootstrap` "or equivalent"; the equivalent is `uv sync --frozen [--offline]`. Documented in `docs/AIRGAP_INSTALL.md`. -- **Pyright / ruff baseline cleanup** — existing pre-Phase-14 baselines preserved exactly; not a Phase 14 concern. - -## Hard-Stop Triggers (HALT, write BLOCKER.md) - -- `uv lock --check` reports drift after commit → root-cause and stop. -- Any test in `tests/` newly fails with the lockfile-driven install AND root cause is the lockfile. -- CI YAML edits don't validate as YAML. -- `dist/*` regen produces a non-empty `git diff` after Phase 14 changes. diff --git a/.planning/phases/14-reproducible-air-gap-lockfile/14-01-SUMMARY.md b/.planning/phases/14-reproducible-air-gap-lockfile/14-01-SUMMARY.md deleted file mode 100644 index c62278d..0000000 --- a/.planning/phases/14-reproducible-air-gap-lockfile/14-01-SUMMARY.md +++ /dev/null @@ -1,83 +0,0 @@ ---- -status: completed -phase: 14-reproducible-air-gap-lockfile -plan: 01 -subsystem: build / ci / dependencies -tags: [hardening, air-gap, build, ci, lockfile] -requires: [phase-13-llm-provider-hardening] -provides: [uv.lock-CI-install, uv-lock-check-freshness-gate, docs/AIRGAP_INSTALL.md] -affects: [pyproject.toml, .github/workflows/ci.yml, .gitignore, docs/AIRGAP_INSTALL.md, uv.lock] -tech-stack: - added: [uv (Apache-2.0/MIT, single static binary, Astral)] - patterns: [pin+hash transitive lockfile, --frozen install, lockfile-drift CI gate] -key-files: - created: - - docs/AIRGAP_INSTALL.md - modified: - - .github/workflows/ci.yml - - .gitignore - unchanged-but-canonical: - - pyproject.toml # already PEP 621; no [tool.uv] needed - - uv.lock # already in sync (uv lock --check exit 0) -decisions: - - "Tool: uv 0.11.7 (Apache-2.0/MIT). Picked over pip-tools (loses uv.lock investment, no per-marker pinning) and poetry (would require [project] -> [tool.poetry] rewrite, violates minimal diff)." - - "uv.lock already exists (171 packages, 4430 lines, in sync per `uv lock --check`); Phase 14 wires CI to install from it, adds the freshness gate, and documents the offline path. No new lockfile generation required." - - "CI install: `uv sync --frozen --extra dev` (replaces `pip install -e .[dev]`). `--frozen` forbids re-resolving." - - "CI lockfile-drift gate: `uv lock --check` runs as the FIRST step inside the job (before install) so a stale uv.lock fails the build before anything else." - - "Tools (ruff, pyright, pytest) run via `uv run` so they execute against the locked virtualenv." - - "Pinned uv version 0.11.7 in CI (matches local) — bumps are deliberate, not silent." - - "Documented offline path in `docs/AIRGAP_INSTALL.md` (38 lines): clone -> UV_INDEX_URL=internal-mirror -> `uv sync --frozen [--offline]`. Negation rule added to .gitignore so docs/AIRGAP_INSTALL.md is the single shipped doc." - - "Single atomic commit per phase precedent (Phase 9-13)." -metrics: - duration: "~15 min" - tasks-completed: 8 - files-touched: 4 # (1 new, 2 modified, 1 planning .md whitelisted) - tests-added: 0 # pure infra, no new test surface - tests-total: 1044 # (1044 passed, 3 skipped — same as Phase 13) - ratchet-status: green - bundle-determinism: deterministic (`git diff --exit-code dist/` clean after regen) -gates: - uv-lock-check: "Resolved 171 packages in 2ms — exit 0" - yaml-valid: "9 steps, parses clean" - ollama-grep-src: "0 matches (HARD-05 ratchet preserved)" - ruff: "13 errors (pre-Phase-14 baseline, unchanged)" - pyright-runtime: "54 errors (pre-Phase-14 baseline, unchanged)" - pyright-full: "329 errors (pre-Phase-14 baseline, unchanged)" - dist-regen-diff: "clean (exit 0)" - pytest: "1044 passed, 3 skipped" ---- - -# Phase 14 Plan 01 Summary — Reproducible Air-Gap Dependency Lockfile - -## One-liner - -Wired the existing in-repo `uv.lock` into CI via `uv sync --frozen`, added a `uv lock --check` lockfile-freshness gate that fails the build on `pyproject.toml`/`uv.lock` drift, and documented the offline install path in `docs/AIRGAP_INSTALL.md` so an engineer behind a corporate firewall can reproduce the exact dependency graph from an internal mirror without public-internet access. Closes HARD-02 (CONCERNS C2). - -## What changed - -| File | Change | -| --- | --- | -| `.github/workflows/ci.yml` | Added `astral-sh/setup-uv@v6` (uv 0.11.7); added `uv lock --check` gate as first job step; replaced `pip install -e ".[dev]"` with `uv sync --frozen --extra dev`; rewrote `ruff` / `pyright` / `pytest` invocations as `uv run …` so they hit the locked venv. | -| `docs/AIRGAP_INSTALL.md` (new) | 38-line offline-install recipe: clone → set `UV_INDEX_URL` → `uv sync --frozen [--offline]` → `uv run pytest tests/ -x`. | -| `.gitignore` | Added `!docs/AIRGAP_INSTALL.md` negation so the air-gap install doc ships while the rest of `docs/` (Claude artefacts) stays ignored. | -| `pyproject.toml` | Unchanged — already PEP 621; uv reads `[project]` natively, no `[tool.uv]` block required. | -| `uv.lock` | Unchanged — already present, 4430 lines, 171 packages, in sync. Verified by `uv lock --check` exit 0. | - -## Acceptance gates (all green) - -``` -uv lock --check : EXIT 0 (171 pkgs, 2 ms) -python -c 'import yaml; yaml.safe_load(open(ci.yml))' : 9 steps, parses -git grep -nE 'https://ollama\.com|ollama\.com/api' src/ : 0 matches (HARD-05 ratchet) -ruff check src tests : 13 errors (pre-existing baseline) -pyright src/runtime : 54 errors (pre-existing baseline) -pyright : 329 errors (pre-existing baseline) -python scripts/build_single_file.py && git diff dist/ : clean (exit 0) -pytest tests/ -x : 1044 passed, 3 skipped -``` - -## Out of scope (deferred) - -- A vendored-wheels tarball (truly `--no-index` install kit) — separate phase. -- Pyright / ruff baseline cleanup — pre-existing baselines, not Phase 14 territory. -- `Makefile` `make bootstrap` shim — `uv sync --frozen [--offline]` is the documented equivalent (ROADMAP SC-2 wording allows "or equivalent"). diff --git a/.planning/phases/14-reproducible-air-gap-lockfile/14-VERIFICATION.md b/.planning/phases/14-reproducible-air-gap-lockfile/14-VERIFICATION.md deleted file mode 100644 index 57bca93..0000000 --- a/.planning/phases/14-reproducible-air-gap-lockfile/14-VERIFICATION.md +++ /dev/null @@ -1,141 +0,0 @@ ---- -status: passed -phase: 14 -phase_name: Reproducible Air-Gap Lockfile -date: 2026-05-07 -verified: 2026-05-07T09:35:00Z -score: 5/5 ROADMAP success criteria + 8/8 plan tasks verified -overrides_applied: 0 -re_verification: - previous_status: null - is_re_verification: false ---- - -# Phase 14 Verification Report — Reproducible Air-Gap Dependency Lockfile - -**Phase Goal (ROADMAP):** An engineer behind a corporate firewall can clone the repo, point at an internal package mirror, and reproduce the exact dependency graph used in CI / dev. Today `pyproject.toml` resolves freshly on every install — non-deterministic and breaks `~/.claude/rules/build.md`'s "vendor all dependencies" rule. - -**Requirement:** HARD-02 (CONCERNS C2) -**Verified:** 2026-05-07 -**Status:** passed - ---- - -## Goal-Backward Verification (ROADMAP Success Criteria) - -### SC-1 — Committed lockfile pins every direct + transitive dep with version + hash — VERIFIED - -**Evidence:** -- `uv.lock` present at repo root: 4430 lines, **171 packages** pinned (verified via `grep -E '^(name|version) = ' uv.lock | head`). -- Every entry includes `source`, `version`, and per-distribution `sha256` hash (sample: `aiofile==3.9.0` with sdist + wheel hashes). -- `requires-python = ">=3.11"` matches `pyproject.toml`. -- `uv lock --check` exit code: **0** ("Resolved 171 packages in 2ms") — lockfile is in sync with `pyproject.toml`. - -### SC-2 — `make bootstrap` (or equivalent) installs from lockfile alone via internal mirror — VERIFIED - -**Evidence:** -- `docs/AIRGAP_INSTALL.md` (NEW, 38 lines) documents the recipe: - ``` - export UV_INDEX_URL="https:///simple/" - uv sync --frozen --extra dev - # or, fully offline (cache pre-warmed): - uv sync --frozen --offline --extra dev - ``` -- `uv sync --frozen` is the documented equivalent of `make bootstrap` (ROADMAP wording: "make bootstrap or equivalent"). It refuses to re-resolve and installs the exact set in `uv.lock` with hash verification. -- `UV_INDEX_URL` env override redirects all package resolution to an internal mirror (no hardcoded public URLs). - -### SC-3 — CI installs from the lockfile, not the `pyproject.toml` solver — VERIFIED - -**Evidence (`.github/workflows/ci.yml`):** -- New step `Set up uv` pins uv `0.11.7` via `astral-sh/setup-uv@v6`. -- Replaced `run: pip install -e ".[dev]"` with `run: uv sync --frozen --extra dev`. -- All downstream tool invocations (`ruff`, `pyright`, `pytest`) use `uv run`, ensuring they execute inside the locked virtualenv rather than a side-installed Python. -- `--frozen` flag forbids re-resolution: any drift between `pyproject.toml` and `uv.lock` would fail this step (also caught earlier by SC-4). - -### SC-4 — Lockfile-drift CI gate fails the build on `pyproject.toml` change without lockfile update — VERIFIED - -**Evidence (`.github/workflows/ci.yml`):** -- New step `Lockfile freshness gate (HARD-02)` runs `uv lock --check` BEFORE the install step. -- `uv lock --check` exits non-zero when `pyproject.toml` and `uv.lock` are out of sync (would attempt to update the lockfile in dry-run mode). -- Gate is positioned first so a stale lockfile fails fast. -- Local invocation against current tree: exit 0 (clean baseline). - -### SC-5 — `dist/*` regenerated; existing test suite passes — VERIFIED - -**Evidence:** -- `python scripts/build_single_file.py` ran clean; `git diff --exit-code dist/` exit code: **0** (no drift). -- `python -m pytest tests/ -x` result: **1044 passed, 3 skipped, 0 failed** — matches Phase 13 baseline (`tests-total: 1044` per `13-01-SUMMARY.md` metrics). - ---- - -## Cross-Phase Ratchet Gates (preserved, not regressed) - -| Gate | Baseline (pre-Phase-14) | Phase 14 result | Status | -| --- | --- | --- | --- | -| `git grep -nE 'https://ollama\.com|ollama\.com/api' -- src/` (HARD-05) | 0 matches | 0 matches (exit 1) | Preserved | -| `ruff check src tests` | 13 errors | 13 errors | Preserved (pre-existing baseline; not a Phase 14 deliverable) | -| `pyright src/runtime` | 54 errors | 54 errors | Preserved (pre-existing baseline) | -| `pyright` (full) | 329 errors | 329 errors | Preserved (pre-existing baseline) | -| `pytest tests/ -x` | 1044 passed / 3 skipped | 1044 passed / 3 skipped | Preserved | -| `git diff --exit-code dist/` after `build_single_file.py` | clean | clean | Preserved | -| `uv lock --check` | exit 0 | exit 0 | Preserved (still in sync) | - ---- - -## Hard-Constraint Verification (from prompt) - -| Constraint | Verdict | Notes | -| --- | --- | --- | -| Air-gapped target — no new public-internet calls | PASS | uv reads from `UV_INDEX_URL` (internal mirror); `--frozen` + `--offline` documented. | -| No `curl | sh` in any script | PASS | `docs/AIRGAP_INSTALL.md` explicitly says "ship via your internal artifact store — do not `curl | sh`". | -| Permissive license for new tooling | PASS | uv: Apache-2.0 / MIT (dual-licensed). | -| No version downgrades vs `pyproject.toml` `>=` | PASS | uv.lock unchanged from already-resolved state; `uv lock --check` exit 0 confirms no rewrite. | -| Reproducible — same inputs same dep set | PASS | uv.lock pins version + sha256 per platform marker. | -| Existing test suite passes | PASS | 1044 passed / 3 skipped. | -| CI builds successfully from lockfile | PASS (locally validated; CI run will land on next push) | YAML parses; steps in correct order; `uv sync --frozen` is the canonical install command. | -| No code outside Phase 14 scope touched | PASS | Only `.github/workflows/ci.yml`, `.gitignore`, new `docs/AIRGAP_INSTALL.md`, plus phase planning files. | - ---- - -## Tool Selection Audit (`~/.claude/rules/dependencies.md`) - -| Criterion | uv (chosen) | -| --- | --- | -| License: MIT/Apache/BSD only | Apache-2.0 + MIT (dual) — PASS | -| Active maintenance | Astral, weekly releases — PASS | -| Single-maintainer bus factor | Backed by Astral team — PASS | -| Low transitive footprint | Zero Python deps (Rust binary) — PASS | -| Works fully offline once installed | `--offline`, `--frozen` first-class flags — PASS | -| Lockfile with full hashes | `uv.lock` pins sha256 per dist per platform marker — PASS | -| PEP 621 (`pyproject.toml` `[project]`) compatible | Native, no rewrite — PASS | -| Generates lockfile reproducibly | Same `pyproject.toml` + uv version → identical `uv.lock` — PASS | - -Rejected alternatives: -- **pip-tools** — Would forfeit `uv.lock` (already in repo, 171 pkgs) and per-marker hash pinning. -- **poetry** — Would require rewriting `[project]` → `[tool.poetry]`, violating minimal-diff scope. - ---- - -## Hard-Stop Triggers Checklist (none triggered) - -- Selected tool requires public internet at runtime/CI: **NO** — uv supports `--offline` and reads from `UV_INDEX_URL`. -- Lockfile downgrades a dep below `pyproject.toml` `>=`: **NO** — `uv lock --check` exit 0 means no resolution changes occurred. -- Test suite fails after lockfile in place AND root cause is the lockfile: **NO** — 1044 passed / 3 skipped, identical to Phase 13 baseline. -- CI YAML edits don't validate: **NO** — `python -c 'import yaml; yaml.safe_load(open(...))'` parses cleanly; 9 steps detected. -- Selected tool requires non-permissive license: **NO** — uv is Apache-2.0 + MIT. -- `dist/*` not deterministic: **NO** — `git diff --exit-code dist/` clean. - ---- - -## Files of Record - -- `pyproject.toml` (unchanged — already PEP 621; uv reads `[project]` natively) -- `uv.lock` (unchanged — already in sync, 171 packages, sha256-pinned) -- `.github/workflows/ci.yml` (modified — uv setup + lockfile gate + `uv sync --frozen` + `uv run` for tools) -- `.gitignore` (modified — `!docs/AIRGAP_INSTALL.md` negation so the install doc ships) -- `docs/AIRGAP_INSTALL.md` (NEW — 38-line offline install recipe) -- `.planning/phases/14-reproducible-air-gap-lockfile/14-01-PLAN.md` (NEW) -- `.planning/phases/14-reproducible-air-gap-lockfile/14-01-SUMMARY.md` (NEW) -- `.planning/phases/14-reproducible-air-gap-lockfile/14-VERIFICATION.md` (NEW — this file) - -**Verdict:** All 5 ROADMAP success criteria, all 8 plan tasks, all 7 cross-phase ratchet gates, and all 8 hard constraints verified. Phase 14 status: **passed**. diff --git a/README.md b/README.md new file mode 100644 index 0000000..046a5d5 --- /dev/null +++ b/README.md @@ -0,0 +1,49 @@ +# ASR — Multi-Agent Runtime Framework + +Python multi-agent runtime built on **LangGraph** (orchestration) + +**FastMCP** (tool dispatch), with HITL gate, markdown turn-output +contract, and a single-file deploy bundle for air-gapped corporate +targets. + +Two reference apps live in the same repo to prove the runtime is +generic: + +- **`examples/incident_management/`** — 4-skill investigation + pipeline (intake → triage → deep_investigator → resolution) with + ASR memory layers (Knowledge Graph, Release Context, Playbooks). +- **`examples/code_review/`** — 3-skill PR review pipeline (intake + → analyzer → recommender). + +## Quick start + +```bash +uv sync --frozen --extra dev +uv run pytest tests/ -x + +# Run the incident-management app via the CLI entrypoint +uv run python -m runtime --config config/incident_management.yaml + +# Streamlit UI +ASR_LOG_LEVEL=INFO uv run streamlit run src/runtime/ui.py --server.port 37777 +``` + +Set provider keys in `.env` (`OLLAMA_API_KEY`, `OPENROUTER_API_KEY`, +`AZURE_OPENAI_KEY`, …) and switch `llm.default` / +`skill.model` overrides in `config/config.yaml`. + +## Documentation + +- **[`docs/DESIGN.md`](docs/DESIGN.md)** — architecture, core + abstractions, runtime model, storage, deployment, decision log, + milestone history. **Start here** if you're new to the codebase. +- **[`docs/DEVELOPMENT.md`](docs/DEVELOPMENT.md)** — day-to-day + contributor loop: setup, regenerating `dist/`, adding a runtime + module. +- **[`docs/AIRGAP_INSTALL.md`](docs/AIRGAP_INSTALL.md)** — + air-gapped / internal-mirror install procedure. + +## Status + +`main` carries v1.0 → v1.5. v2.0 (React UI replacing the Streamlit +prototype) is the next big move. See `docs/DESIGN.md` § 13 for the +milestone history and § 14 for the pending list. diff --git a/docs/DESIGN.md b/docs/DESIGN.md new file mode 100644 index 0000000..a9d5296 --- /dev/null +++ b/docs/DESIGN.md @@ -0,0 +1,938 @@ +# ASR Multi-Agent Runtime Framework — Design & Decisions + +> **Audience.** New contributors and operators who need one document +> covering what the framework is, how it composes, and *why* the +> non-obvious decisions are the way they are. +> +> **Scope.** Architecture, core abstractions, runtime model, storage, +> deployment, and a decision log. Operational how-tos live in +> `docs/DEVELOPMENT.md` (dev workflow) and `docs/AIRGAP_INSTALL.md` +> (corporate-mirror install). + +--- + +## 1. What it is + +ASR is a generic Python multi-agent runtime that wraps **LangGraph** +for orchestration and **FastMCP** for tool dispatch, adds a HITL +gateway and a markdown turn-output contract on top, and ships as a +single-file bundle into air-gapped corporate environments. + +Two reference apps live in the same repo to prove the runtime is +genuinely generic: + +- **`examples/incident_management/`** — 4-skill investigation + pipeline (intake → triage → deep_investigator → resolution) with + ASR memory layers (L2 Knowledge Graph, L5 Release Context, L7 + Playbook Store) and a remediation workflow that pauses on + high-risk actions. +- **`examples/code_review/`** — 3-skill PR review pipeline (intake + → analyzer → recommender). Built specifically to surface every + framework leak that would have made the runtime + incident-shaped — those leaks were lifted into the framework + rather than worked around. + +What the framework owns: session lifecycle, agent dispatch, tool +gateway, HITL pause/resume, telemetry, storage, deployment bundling. + +What an app owns: domain `Session` subclass, MCP servers, skill +prompts + per-skill YAML, `App*Config` for cross-cutting domain +knobs (severity aliases, escalation roster, similarity thresholds). + +--- + +## 2. Architecture at a glance + +Layers from bottom to top: + +``` ++------------------------------------------------------------+ +| App layer (examples/incident_management, examples/code_review) +| - state.py, config.py, skills/, mcp_server.py, ui.py | ++------------------------------------------------------------+ +| Framework — runtime/ | +| - Session, Skill, AgentRun, ToolCall, AgentTurnOutput | +| - Orchestrator, OrchestratorService | +| - Gateway (wrap_tool), policies, ToolRegistry | +| - SessionStore, HistoryStore, EventLog | +| - graph.py: build_graph + make_agent_node | +| - llm.py: provider abstraction | +| - ui.py: Streamlit shell | +| - api.py: FastAPI surface | ++------------------------------------------------------------+ +| LangGraph 1.x (orchestration / state / checkpointing) | +| LangChain 1.x (chat models, agents.create_agent, tools) | +| FastMCP (in-process / stdio / http MCP servers) | ++------------------------------------------------------------+ +| Providers: Ollama Cloud · OpenRouter · Azure OpenAI · … | ++------------------------------------------------------------+ +``` + +**Control flow for one session** (steady state): + +``` +UI / API ──start_session──▶ OrchestratorService ──▶ Orchestrator + │ + ▼ + build_graph (langgraph StateGraph) + │ + per-agent step ▼ + ┌───────────────────────────────────┐ + │ make_agent_node │ + │ - reload session from store │ + │ - emit agent_started event │ + │ - wrap_tool(s) with gateway │ + │ - create_agent (langchain/langgraph) + │ - _drive_agent_with_resume │ + │ loop: ainvoke / handle pause │ + │ - parse_envelope_from_result │ + │ - record AgentRun │ + │ - decide route from signal │ + └───────────────────────────────────┘ + │ + gate node? (low confidence) ──▼ + terminal tool? ──▶ status set by tool + else ──▶ default_terminal_status + │ + ▼ + finalize_session_status_async +``` + +--- + +## 3. Core abstractions + +### 3.1 `Session` (`src/runtime/state.py`) + +The framework's unit of work. All apps subclass it; the framework +itself only reads/writes the fields declared on the base class. + +```python +class Session(BaseModel): + id: str + status: str + created_at: str + updated_at: str + deleted_at: str | None + agents_run: list[AgentRun] + tool_calls: list[ToolCall] + findings: dict[str, Any] + token_usage: TokenUsage + pending_intervention: dict | None + user_inputs: list[str] + parent_session_id: str | None # dedup linkage + dedup_rationale: str | None + extra_fields: dict[str, Any] # bag for app-specific domain data + version: int # optimistic concurrency + turn_confidence_hint: float | None # transient (excluded from persistence) +``` + +Apps add domain fields on a subclass: + +```python +class IncidentState(Session): + query: str + environment: str + reporter: Reporter + severity: str + summary: str + resolution: str | None +``` + +Fields the row schema doesn't have a column for round-trip via the +`extra_fields` JSON bag — see [§ 8 Storage](#8-storage). + +### 3.2 `Skill` (`src/runtime/skill.py`) + +YAML-driven configuration unit: + +```yaml +name: triage +description: Hypothesis-loop triage agent +kind: responsive # responsive | supervisor | monitor +model: gpt_oss_cheap # optional per-agent override +tools: + local_inc: [submit_hypothesis, update_incident] + local_observability: [get_logs, get_metrics, ...] +routes: + - when: success + next: deep_investigator + - when: needs_input + next: __end__ + gate: confidence + - when: default + next: deep_investigator +system_prompt: | + ... +``` + +Three `kind`s: + +- `responsive` — ReAct LLM agent (the default; uses + `langchain.agents.create_agent`). +- `supervisor` — non-LLM rule-based dispatcher (or LLM-dispatched + via `dispatch_strategy: llm`); used by intake to pre-filter. +- `monitor` — out-of-band runner (e.g. `MonitorRunner`); not a graph + node. + +### 3.3 `AgentRun` + `ToolCall` (`src/runtime/state.py`) + +Append-only audit rows: + +```python +class AgentRun(BaseModel): + agent: str + started_at: str + ended_at: str + summary: str # final_text, or "agent failed: " + token_usage: TokenUsage + confidence: float | None + confidence_rationale: str | None + signal: str | None + +class ToolCall(BaseModel): + agent: str + tool: str + args: dict + result: dict | str | list | int | float | bool | None + ts: str + risk: ToolRisk | None # low | medium | high + status: ToolStatus # executed | executed_with_notify + # | pending_approval | approved + # | rejected | timeout + approver: str | None + approved_at: str | None + approval_rationale: str | None +``` + +### 3.4 `Orchestrator` + `OrchestratorService` + +- `Orchestrator` (`src/runtime/orchestrator.py`) — owns the compiled + langgraph, the `SessionStore`, the per-session async lock + registry, and the synchronous lifecycle methods (`start_session`, + `stream_session`, `resume_session`, `retry_session`). +- `OrchestratorService` (`src/runtime/service.py`) — long-lived + asyncio loop wrapper around `Orchestrator`. Owns the loop thread, + registers in-flight sessions, exposes a thread-safe `submit_async` + / `submit_and_wait` bridge so the Streamlit UI thread and the + FastAPI request handlers can both schedule work without fighting + over the same FastMCP / SQLAlchemy transports. + +### 3.5 `wrap_tool` Gateway (`src/runtime/tools/gateway.py`) + +Every `BaseTool` an agent sees is wrapped by the gateway. The +wrapper: + +1. Injects session-derived args (e.g. `environment` from the + session row) before the LLM-visible arg surface, so the LLM + physically cannot fabricate them. +2. Consults the risk policy: + - `low` → run, emit `tool_invoked` with `status=executed`. + - `medium` → run, append a `executed_with_notify` audit row. + - `high` → call `langgraph.types.interrupt(payload)`, append a + `pending_approval` row, save to DB, pause the graph. +3. After resume: + - On `approve` → run the inner tool, update the pending row to + `approved`, save. + - On `reject` / `timeout` → return a marker dict, update the + pending row to the matching status, save. + +### 3.6 `AgentTurnOutput` envelope +(`src/runtime/agents/turn_output.py`) + +The structured output every agent must produce per turn: + +```python +class AgentTurnOutput(BaseModel): + content: str + confidence: float # [0.0, 1.0], reconciled + confidence_rationale: str + signal: str | None # success | failed | needs_input | None +``` + +How the envelope is sourced — see [§ 6 Markdown turn output](#6-markdown-turn-output-contract-phase-22). + +--- + +## 4. Runtime model + +### 4.1 Session lifecycle + +States a session walks through: + +``` +new ─▶ in_progress ─▶ + resolved | escalated | needs_review | + awaiting_input | error | stopped | duplicate +``` + +- `new` — row created, graph not yet entered. +- `in_progress` — at least one agent has run; non-terminal. +- Terminal states are set by: + - **Terminal tool calls** (e.g. `mark_resolved` → `resolved`, + `mark_escalated` → `escalated`); the tool registry maps tool + names to status transitions. + - **`default_terminal_status`** (`needs_review` for incident + management) when the graph completes without a terminal tool. + - **`_handle_agent_failure`** → `error` on agent exceptions. + - **`stop_session()`** → `stopped` on explicit cancellation. + - **`dedup_check`** → `duplicate` (with `parent_session_id`) when + stage-2 LLM dedup confirms a match against a prior closed + session. + +### 4.2 Per-agent dispatch +(`src/runtime/graph.py:_build_agent_nodes`) + +For every skill in `cfg.orchestrator.skills`: + +```python +llm = get_llm(cfg.llm, skill.model, role=agent_name, ...) +node = make_agent_node(skill=skill, llm=llm, tools=run_tools, ...) +sg.add_node(agent_name, node) +``` + +`skill.model` is the per-agent override; falls through to +`cfg.llm.default` when `None`. This is what lets intake run on +Ollama while triage / DI / resolution run on OpenRouter — see the +v1.5-C decision below. + +### 4.3 Routing + +`skill.routes` is a list of `(when, next, gate?)` rules. The +runtime evaluates them after each agent step: + +```yaml +routes: + - when: success # signal value + next: deep_investigator + - when: needs_input + next: __end__ + gate: confidence # route through gate node first + - when: default # fallback + next: triage +``` + +The framework's gate node fires when the upstream agent's confidence +is below `framework.confidence_threshold` (default 0.75). The gate +emits a `pending_intervention` and the session moves to +`awaiting_input` until the operator supplies a `resume_with_input` +verdict. Agents emit signals via the `signal` arg of typed-terminal +or patch tools. + +### 4.4 Termination + +Three independent paths: + +1. **Tool-driven** — an agent calls a tool the registry recognises + as terminal (`local_inc:mark_resolved`, `…:mark_escalated`). + The tool sets `inc.status` directly. +2. **Inferred** — `_finalize_session_status` walks `tool_calls` + matching against `cfg.orchestrator.terminal_tools` rules. +3. **Default** — falls through to + `cfg.orchestrator.default_terminal_status` when no rule fires + AND the graph wasn't paused on a HITL gate. + +The pause-aware guard +(`Orchestrator._is_graph_paused`) is what keeps a paused HITL +session from being coerced to `default_terminal_status` while the +operator is still deciding. + +--- + +## 5. LLM provider story + +### 5.1 Three layers + +``` ++----------------------------------------------------------+ +| Skill (YAML) model: gpt_oss_cheap | ++----------------------------------------------------------+ +| runtime.llm.get_llm resolves name → cfg.models[name] | +| → ProviderConfig → BaseChatModel | ++----------------------------------------------------------+ +| LangChain provider class | +| - ChatOpenAI openai_compat (OpenRouter) | +| - ChatOllama ollama (Ollama Cloud + local) | +| - AzureChatOpenAI azure_openai | ++----------------------------------------------------------+ +| Driven by langchain.agents.create_agent (langgraph subgraph) | ++----------------------------------------------------------+ +``` + +### 5.2 Provider config + +`config/config.yaml` declares providers + named models: + +```yaml +llm: + default: workhorse + providers: + ollama_cloud: + kind: ollama + base_url: https://ollama.com + api_key: ${OLLAMA_API_KEY} + azure: + kind: azure_openai + endpoint: ${AZURE_ENDPOINT} + api_version: 2024-08-01-preview + api_key: ${AZURE_OPENAI_KEY} + openrouter: + kind: openai_compat + base_url: https://openrouter.ai/api/v1 + api_key: ${OPENROUTER_API_KEY} + models: + workhorse: + provider: openrouter + model: inclusionai/ring-2.6-1t:free + gpt_oss: + provider: ollama_cloud + model: gpt-oss:20b + smart: + provider: azure + model: gpt-4o + deployment: gpt-4o +``` + +### 5.3 429 retry regime (v1.5-D) + +`_ainvoke_with_retry` (`src/runtime/graph.py`) splits transient +errors into two classes: + +| Class | Markers | Backoff | Total | +|---|---|---|---| +| 5xx + connection | `internal server error`, `status code: 5xx`, `connection reset`, `remoteprotocolerror`, `incomplete chunked read` | 1.5s × attempt | ~9s | +| 429 / rate-limit | `status code: 429`, `error code: 429`, ` 429`, `429 `, `ratelimiterror`, `rate limit`, `rate-limited`, `too many requests` | 7.5s × attempt | ~45s | + +Non-429 4xx (auth, validation) propagates immediately so quota / +schema problems fail fast. + +### 5.4 Live verification + +`tests/test_integration_driver_s1.py` parametrises three legs +(`local`, `workhorse`, `azure`); each skips independently if its +keys are absent. Run with `OLLAMA_API_KEY + OLLAMA_BASE_URL`, +`OPENROUTER_API_KEY`, and/or `AZURE_OPENAI_KEY + AZURE_ENDPOINT` +exported. + +--- + +## 6. Markdown turn-output contract (Phase 22) + +### 6.1 Why + +Pre-Phase-22 the framework forced agents through +`response_format=AgentTurnOutput` (a JSON schema). Multiple problems: + +- gpt-oss / Ollama models drifted on JSON schema adherence. +- LangGraph's `with_structured_output` second pass interacted badly + with the React END signal under `recursion_limit=25`. +- Adding tools to the schema confused some providers' tool dispatch. + +Phase 22 dropped `response_format` and made the agent close its turn +with a markdown contract block. Markdown is the format every chat +model writes well; the parse step happens in the framework where +leniency is in our control. + +### 6.2 The contract + +Every skill prompt ends with: + +``` +## Output contract — REQUIRED + +Every final reply MUST end with these three sections, in order, each +preceded by a level-2 markdown header: + + ## Response + + + ## Confidence + <0.0-1.0 float> -- + + ## Signal + + +**CRITICAL — final-reply rule:** the markdown envelope is mandatory; +the framework hard-fails if it is missing. +``` + +### 6.3 Parse paths + +`parse_envelope_from_result` walks 6 paths and falls through to a +hard fail: + +| Path | Source | When it fires | +|---|---|---| +| 1 | `result["structured_response"]` | Pre-Phase-22 stub fixtures and explicit-schema callers | +| 2 | JSON-decode last AIMessage content | Models that still emit valid JSON | +| 4 | `_parse_confidence_line` over the `## Confidence` body | Markdown-primary path; the production happy path | +| 5 | Typed-terminal-tool args (`confidence`, `confidence_rationale`, `resolution_summary`) | Models that treat a terminal tool call as completion | +| 6 | Permissive: any tool was called → synthesise a 0.30-confidence placeholder | Last-ditch fallback so the session reaches a terminal status instead of hard-failing | +| 7 | `raise EnvelopeMissingError` | Truly nothing parseable | + +(Path 3 was the original location for what became Path 4; the +numbering is preserved in code comments to keep historical commits +diff-friendly.) + +### 6.4 gpt-oss compatibility quirks + +- gpt-oss prefers EN DASH (`–`, `–`) over EM DASH (`—`, + `—`); the dash separator accepts the full Unicode Pd block. +- gpt-oss sometimes emits an empty closing AIMessage after a tool + call; Path 5 / Path 6 cover that. +- The skill prompts carry an explicit + `**CRITICAL — final-reply rule:**` paragraph because gpt-oss + initially treated the first tool result as completion. + +The procedural confidence-line parser +(`_parse_confidence_line`) replaces an earlier regex that Sonar's +S5852 (regex DoS) flagged; the procedural form has no backtracking +surface to attack. + +--- + +## 7. HITL approve / reject + +### 7.1 The risk-rated gateway + +Tools are policy-gated per +`cfg.runtime.gateway.policy`: + +```yaml +runtime: + gateway: + policy: + apply_fix: high # gate + restart_service: medium # notify-only audit + get_logs: low # default; no row written +``` + +Apps configure `cfg.orchestrator.gate_policy` for cross-cutting +behaviour: + +```yaml +gate_policy: + threshold: 0.75 + gated_environments: [production] + gated_risk_actions: [approve] + resolution_trigger_tools: ['local_remediation:apply_*'] +``` + +### 7.2 Pause / resume on langgraph 1.x (PR #6) + +langgraph 1.x changed the `interrupt()` contract: a tool calling +`interrupt()` no longer raises `GraphInterrupt` to the caller — +`agent.ainvoke()` returns a normal result with +`result["__interrupt__"]` populated. The framework's wrapper had to +catch up: + +- `_drive_agent_with_resume` (`src/runtime/graph.py`) detects an + inner pause via `agent_executor.aget_state(inner_cfg).next` being + non-empty, calls outer `interrupt()` to fetch the verdict, and + forwards via `agent_executor.ainvoke(Command(resume=verdict), + config=inner_cfg)`. +- The inner `create_agent` now receives the orchestrator's + checkpointer + a deterministic per-invocation thread id + (`f"{inc_id}:agent:{skill.name}:turn{len(agents_run)}"`). Without + these, `Command(resume=…)` raises and the gated tool gets silently + skipped. +- `make_agent_node` reloads from `store.load(inc_id)` at entry — + defends against stale `state["session"]` snapshots from outer + Pregel checkpoints (which capture state at step boundaries, not + mid-step). +- `gateway.wrap_tool` calls `store.save` after every status + transition (rejected / timeout / approved) so the audit row in + the DB matches the operator's actual decision. +- `Orchestrator._is_graph_paused` guards + `_finalize_session_status_async` in `stream_session` / + `retry_session` / the API approval handler — a HITL pause must + not be coerced into `default_terminal_status`. + +These five fixes shipped together as PR #6; before them, clicking +Approve would do nothing because the framework had already moved +past the pause point. + +### 7.3 Approval surface + +Two ways to resolve a `pending_approval`: + +- **UI** — `_render_pending_approvals_block` shows the Approve / + Reject buttons and rationale field; click drives + `Command(resume={"decision": "approve", ...})` via + `OrchestratorService.submit_and_wait`. +- **API** — `POST /sessions/{sid}/approvals/{tcid}` does the same + resume, scoped under the per-session lock so two concurrent + approvals on the same thread can't race. + +### 7.4 Approval watchdog +(`src/runtime/tools/approval_watchdog.py`) + +Background task that scans `pending_approval` rows older than +`framework.approval_timeout` and resolves them with `verdict=timeout`, +freeing operators from manual intervention on stale rows. Triggered +by the lifespan startup hook. + +--- + +## 8. Storage + +### 8.1 SessionStore (`src/runtime/storage/session_store.py`) + +CRUD for the row schema. Owns: + +- `_next_id` — monotonic per-day sequence; respects + `state_cls.id_format(seq=…, prefix=…)` so each app picks its own + ID namespace (`INC-…`, `CR-…`, etc.). +- `save` — optimistic-version update. Bumps `version`; raises + `StaleVersionError` on mismatch so the caller can reload + retry. +- `_row_to_incident` / `_incident_to_row_dict` — round-trip + between `IncidentRow` (SQLAlchemy) and the app's `Session` + subclass. Fields the row schema has columns for go to typed + fields; everything else lands in `extra_fields` JSON. +- Vector write-through — `_persist_vector` / `_add_vector` / + `_refresh_vector` keep a FAISS index aligned with the row + table. + +### 8.2 IncidentRow (`src/runtime/storage/models.py`) + +The persistent schema. Fields are deliberately broad enough to host +the example apps' typed fields (`severity`, `reporter_id`, +`reporter_team`, `summary`, `tags`, `parent_session_id`, +`dedup_rationale`, `extra_fields` JSON) without forcing every app +to declare them. An app's `Session` subclass declares whichever +typed fields it cares about; the rest stay in `extra_fields`. + +The `severity` / `reporter_*` columns ARE incident-shaped — the +v1.5-B generic-noun pass left them in place because renaming would +require a schema migration. Apps that don't model severity or a +human submitter ignore those columns; the round-trip silently +omits them. + +### 8.3 HistoryStore (`src/runtime/storage/history_store.py`) + +Read-only similarity search over the same engine + vector store. +Used by intake's similarity retrieval (`lookup_similar_incidents`). +Filter dimensions are pluggable — apps construct a +`HistoryStore(filter_resolver=…)` matching their own row shape. + +### 8.4 LangGraph checkpointer +(`src/runtime/checkpointer.py`) + +Separate from the SessionStore. SQLite default (`sqlite:////tmp/asr.db`), +Postgres optional via `runtime.checkpointer_postgres`. Holds langgraph +Pregel state + pending interrupts. The HITL approve / reject path +relies on this checkpointer being durable. + +### 8.5 EventLog (`src/runtime/storage/event_log.py`) + +Append-only `session_events` table. Records: + +- `agent_started`, `agent_finished`, `confidence_emitted`, + `route_decided` +- `tool_invoked` (every wrapped tool call, with latency + result_kind) +- `gate_fired` (HITL gate decisions) +- `status_changed` (terminal-status transitions with cause) +- `lesson_extracted` (M5/M6 auto-learning) + +Per-step events feed any external observability stack and the +auto-learning pipeline. + +--- + +## 9. Memory layers (incident_management example) + +The incident-management app ships an ASR (Automated Site Reliability) +memory bundle hydrated by the supervisor at intake: + +| Layer | What | Backend | +|---|---|---| +| L2 | Knowledge Graph — services, owners, runbooks, dependencies | `examples/incident_management/asr/kg_store.py` (filesystem JSON) | +| L5 | Release Context — recent deploys per service | `release_store.py` (filesystem JSON) | +| L7 | Playbooks — known-good remediation steps per failure mode | `playbook_store.py` (filesystem JSON) | + +`hydrate_and_gate` (in the example's MCP server) walks the user's +query, extracts mentioned components, and returns a `MemoryLayerState` +bundle that the triage agent reads as additional context. + +This is **app-level**, not framework — the runtime stays memory- +agnostic. A different app can ship its own L1/L2/L3 memory layers +without touching `runtime/`. + +--- + +## 10. Deployment + +### 10.1 Air-gapped target + +The deployment env is corporate / air-gapped: no public-internet +runtime calls, no CDN fetches, no `pip install` at deploy time. + +### 10.2 Single-file bundle (BUNDLER-01) + +`scripts/build_single_file.py` flattens the runtime + each app into +self-contained `.py` files under `dist/`: + +| File | Contents | +|---|---| +| `dist/app.py` | framework only — no example code | +| `dist/apps/incident-management.py` | framework + incident_management example | +| `dist/apps/code-review.py` | framework + code_review example | +| `dist/ui.py` | Streamlit shell | + +CI gate `Bundle staleness gate (HARD-08)` rebuilds the bundles +from `src/` and fails the build if they don't match the committed +`dist/*` — this keeps the deploy bundles "repaired by construction" +on every merge. + +### 10.3 7-file deploy payload + +Copy onto the target host: + +``` +app.py (renamed from dist/apps/.py) +ui.py (dist/ui.py) +config/config.yaml (framework: LLM, MCP, storage) +config/.yaml (app: severity aliases, escalation roster, …) +config/skills/ (skill prompts, optional override) +.env (provider keys) +``` + +Boot: + +```bash +python -m runtime --config config/.yaml +streamlit run ui.py --server.port 37777 +``` + +### 10.4 Reproducible install (HARD-02) + +`uv.lock` pins direct + transitive deps with sha256 hashes. CI +installs from the lock with `uv sync --frozen`; an internal +package mirror is sufficient for a fully offline build. See +`docs/AIRGAP_INSTALL.md`. + +--- + +## 11. Telemetry + auto-learning (M1–M9) + +### 11.1 Per-step events + +Every meaningful boundary emits an `EventLog` row keyed by +`session_id`. The four agent-boundary events +(`agent_started → confidence_emitted → route_decided → +agent_finished`) fire in order; `tool_invoked` and `gate_fired` +fire at the gateway boundary. + +### 11.2 Lesson store + +`src/runtime/learning/extractor.py` runs at session finalize and +distills outcome + winning hypothesis + applied fix into a +`Lesson` row. The intake supervisor reads recent lessons via +`LessonStore.find_relevant(query, …)` to prime the next session. + +### 11.3 Lesson refresher + +`src/runtime/learning/scheduler.py` runs an APScheduler job +nightly (configurable) that walks recent sessions and extracts +lessons missed at finalize time (e.g. sessions resolved manually +in the UI long after the agent's run). + +--- + +## 12. Decision log + +Compact rationale for the non-obvious calls. Each entry is a single +"why". + +### DEC-001. LangGraph as orchestration engine + +**When.** From the start. +**Why.** Out-of-the-box Pregel-style step boundaries + +checkpointing + first-class HITL `interrupt()` semantics. We don't +maintain a graph engine ourselves; we just wrap it. + +### DEC-002. `langchain.agents.create_agent` for the per-agent loop (Phase 15) + +**When.** v1.3 hardening, after `langgraph.prebuilt.create_react_agent` +was deprecated. +**Why.** Single tool-loop with native ToolStrategy fallback, removes +the `recursion_limit=25` workaround we previously needed. + +### DEC-003. Markdown contract over `response_format` JSON (Phase 22) + +**When.** v1.5-A. +**Why.** JSON-schema-shaped output via `response_format` triggered a +class of brittleness across providers (model-specific JSON drift, +tool-strategy + React END interaction, recursion_limit ceilings). +Markdown is the native format every chat model writes well; the parse +step happens in the framework where leniency is in our control. Path +5 / Path 6 fallbacks cover models that occasionally drop the +contract. + +### DEC-004. Pure-policy HITL gating (Phase 11) + +**When.** v1.2. +**Why.** The gate decision (high-risk tool? gated env? low +confidence?) was previously scattered across the gateway, the +orchestrator, and the skill prompts. Phase 11 moved it into a single +pure function `should_gate(session, tool_call, confidence, cfg)` so +auditing what gates is a one-grep operation. + +### DEC-005. Generic `Session` base + `extra_fields` JSON (v1.1 decoupling) + +**When.** v1.1. +**Why.** Pre-v1.1 the framework had `IncidentState` baked in. Adding +a second app (code_review) was the forcing function — every +"incident-shaped" leak that surfaced moved into the framework as +`Session.extra_fields` (the JSON bag) or the row schema's existing +typed columns. Apps now subclass `Session` and write whatever fields +they need; the framework stays domain-agnostic. + +### DEC-006. Per-agent `skill.model` override (v1.5-C / M8) + +**When.** v1.5-C. +**Why.** The intake supervisor can run on a fast / cheap model +while the deep-investigator agent needs a smarter (more expensive) +one. `_build_agent_nodes` resolves `get_llm(cfg.llm, skill.model, +role=agent_name)` per skill; falls back to `cfg.llm.default` when +`model` is `None`. + +### DEC-007. Single-file bundle for air-gap deploy (BUNDLER-01) + +**When.** v1.3. +**Why.** Corporate deploy env is copy-only. A multi-file +`pip install` step is out of scope. The bundler turns the +multi-file source tree into the smallest possible deploy payload +(7 files total). + +### DEC-008. Concept-leak ratchet for framework genericity (v1.5-B) + +**When.** v1.5-B. +**Why.** The decoupling work (DEC-005) wasn't binary — `incident` / +`severity` / `reporter` tokens kept creeping into `src/runtime/` via +local variables, docstrings, and helper names. The ratchet test +counts those tokens and fails the build if the count grows. v1.5-B +took it from 156 down to 39 (the residual 39 are +schema-coupled / public-API / intentional example-app callouts). + +### DEC-009. 429 separate retry regime with longer backoff (v1.5-D) + +**When.** v1.5-D. +**Why.** Free / shared upstream tiers (e.g. OpenRouter `…:free`) +throttle on 30-60s windows; the 5xx default backoff (1.5s/3s/4.5s) +exhausted retries before the window cleared. Now 429 retries on +7.5s/15s/22.5s (~45s total). + +### DEC-010. Inner agent checkpointer + reload-on-entry to fix HITL stale state (PR #6) + +**When.** v1.5-A. +**Why.** Outer Pregel checkpoints at step boundaries, not mid-step. +On resume, `state["session"]` reflects the prior step's output, NOT +the gateway's pending_approval row + version bump that happened +mid-step. Without `make_agent_node` reloading from store at entry, +the gateway sees no pending row, double-appends, and `store.save` +raises `StaleVersionError`. The reload + the inner checkpointer +together are what make Approve / Reject actually drive the gated +tool to completion. + +### DEC-011. Two example apps to prove genericity + +**When.** v1.1 (incident_management lifted), Phase 8 +(code_review added). +**Why.** Without a second app, "is the framework generic?" is +unanswerable. The code_review app was built specifically to surface +every incident-shaped assumption that hadn't been lifted yet — id +format, row schema, build pipeline, intra-bundle imports. Each +leak became a framework PR rather than an app workaround. + +### DEC-012. Bundle staleness CI gate (HARD-08) + +**When.** v1.3. +**Why.** dist/ files drift if a contributor updates `src/runtime/` +or `examples/` without re-running the bundler. The drift turns into +a deploy-time bug ("works in dev, broken in prod"). The CI gate +rebuilds the bundles from source on every PR and refuses the merge +if they differ from the committed `dist/*`. + +--- + +## 13. Milestone history + +| Milestone | Title | PR | Squash SHA | Headline change | +|---|---|---|---|---| +| v1.0 | Prompt-vs-Code Remediation | #1 | `02378dd` | Code becomes the authority — skill prompts no longer carry policy logic | +| v1.1 | Framework De-coupling | #2 | `0ff8914` | Generic runtime, ASR as use case | +| v1.2 | Framework Owns Flow Control | bundled into #5 | `9018371` | FOC-01..06 — gate / retry / signal / dedup all framework-owned | +| v1.3 | Hardening + Real-LLM Compatibility | bundled into #5 | `9018371` | HARD-01..09 + LLM-COMPAT-01 + BUNDLER-01 + SKILL-LINTER-01 | +| v1.4 | Per-step telemetry + auto-learning intake + React-ready API | #5 | `9018371` | M1..M9 telemetry + LessonStore + generic /sessions/* + SSE + WebSocket + CORS + structured error envelope | +| v1.5-A | Markdown turn output (Phase 22) + HITL approve/reject end-to-end on langgraph 1.x | #6 + #7 | `f0586a8`, `3f0eb5f` | DEC-003 + DEC-010 | +| v1.5-B | Generic-noun pass — concept-leak ratchet 156 → 39 | #8 | `25e363c` | DEC-008 | +| v1.5-C | Per-agent LLM proof point — intake on Ollama Cloud, downstream on `llm.default` | #9 | `54a830d` | DEC-006 | +| v1.5-D | 429 rate-limit retry + multi-provider integration driver | #10 | `adefae6` | DEC-009 | + +Per-phase artefacts under `.planning/phases/-/` (gitignored +working state; selected artefacts are committed for historical record). + +--- + +## 14. Pending / known gaps + +### v2.0 — React UI (the long pole) + +Stack pick + scaffold + parity-port against the v1.4 +`/sessions/*` REST + SSE/WebSocket API. ~1–2 weeks. The Streamlit +shell stays as the prototype until React reaches parity. + +### Smaller cleanups + +- **Duplicate ToolCall audit rows.** The gateway records the gated + tool under the FastMCP composite name (`local_remediation:apply_fix`, + colon form), the harvester records the same tool call under the + LLM-visible name (`local_remediation__apply_fix`, double-underscore + form). Cosmetic in the UI; matters if any consumer aggregates tool + counts. Fix: align both on the `__` form. ~30 min. +- **`ApprovalWatchdog` regression test.** PR #6 added gateway saves + on resolution transitions; the watchdog should observe a faster + cleanup signal but no focused test was added. ~15 min. +- **`ASR_LOG_LEVEL` env var documentation.** Added in PR #6, no + README mention. One-line doc fix. +- **`src/runtime/locks.py:49` — `TODO(v2)`.** Evict idle slots to cap + memory in long-running servers. Real concern for production; not + urgent for HITL-paced workloads. + +--- + +## 15. Where to find what + +| You want to… | Look at | +|---|---| +| Add a new skill | `examples//skills//{config.yaml, system.md}` | +| Add a new app | New folder under `examples/`; subclass `Session` in `state.py`; declare `App*Config` in `config.py`; write MCP servers and skills | +| Add a tool | App's `mcp_server.py`; register in YAML; gateway picks up risk policy from `cfg.runtime.gateway.policy` | +| Change LLM provider | `config/config.yaml` `llm.providers` / `llm.models`; per-agent override on `skill.model` | +| Change HITL policy | `cfg.orchestrator.gate_policy` (cross-cutting), `cfg.runtime.gateway.policy` (per-tool) | +| Trace one session end-to-end | `EventLog` rows for that `session_id`; `agents_run` and `tool_calls` on the row; `session_events` table | +| Update the bundle | `uv run python scripts/build_single_file.py`; commit `dist/*` | +| Add a new framework module | `RUNTIME_MODULE_ORDER` in `scripts/build_single_file.py` (after deps); regen + commit | +| Run live LLM tests | Set `OLLAMA_API_KEY + OLLAMA_BASE_URL`, `OPENROUTER_API_KEY`, `AZURE_OPENAI_KEY + AZURE_ENDPOINT`; `uv run pytest tests/test_integration_driver_s1.py -v` | +| Reset state for a fresh run | `rm /tmp/asr.db /tmp/asr.db-{wal,shm}; rm -rf /tmp/asr-faiss` then restart | + +--- + +## 16. Document map + +- **`docs/DESIGN.md`** (this file) — architecture, abstractions, + decisions, milestone history. +- **`docs/DEVELOPMENT.md`** — day-to-day contributor loop (setup, + bundle regeneration, adding modules). +- **`docs/AIRGAP_INSTALL.md`** — corporate-mirror install procedure. +- **`README.md`** (repo root) — one-screen overview pointing at the + three docs above. +- **`examples/incident_management/README.md`** — incident-management + app surface; per-skill prompts under `skills/`. +- **`examples/code_review/README.md`** — code-review app surface; + per-skill prompts under `skills/`. +- **`.planning/`** (gitignored) — working state for the GSD planning + workflow (`STATE.md`, `ROADMAP.md`, `phases/-/`). Not + shipped; selected phase artefacts are committed for the historical + record. diff --git a/examples/code_review/README.md b/examples/code_review/README.md index 6f8cc7c..6a4a694 100644 --- a/examples/code_review/README.md +++ b/examples/code_review/README.md @@ -1,120 +1,87 @@ # Code Review — Example Application -Second example app for the `runtime` framework. Built in Phase 8 to *prove* the framework is genuinely generic — every framework leak that surfaced while this app was being built (id format, row schema, build pipeline, intra-bundle imports) was lifted into the framework rather than worked around. +Second example app for the framework. A 3-skill PR review pipeline +(intake → analyzer → recommender) that walks a diff, files structured +findings, and emits an approve / request-changes / comment verdict. + +This app exists to **prove the framework is genuinely generic** — +it was built specifically to surface every incident-shaped +assumption that hadn't yet been lifted out of `src/runtime/`. Each +leak became a framework PR rather than an app workaround. + +For framework-wide design + decisions, see +[`docs/DESIGN.md`](../../docs/DESIGN.md). This README only covers +the bits specific to this app. ## Run ```bash -python -m runtime --config config/code_review.yaml +uv run python -m runtime --config config/code_review.yaml +ASR_LOG_LEVEL=INFO uv run streamlit run src/runtime/ui.py --server.port 37777 ``` -Boots the long-lived orchestrator service against this app's config. -The Streamlit UI is the framework's generic shell at -`ui/streamlit_app.py` (`streamlit run ui/streamlit_app.py`) — it -duck-types on `Session.extra_fields` for code-review rows and renders -them in the same accordion shell the incident app uses. - -## Architecture - -A 3-skill responsive pipeline (`intake → analyzer → recommender`) that consumes a PR description, walks the diff, files structured `ReviewFinding`s, and emits an approve / request-changes / comment recommendation. The framework owns session lifecycle, agent dispatch, and tool gateway; this example owns domain shape, skill prompts, and MCP tools. +## Layout ``` examples/code_review/ ├── state.py CodeReviewState(Session) + PullRequest + ReviewFinding ├── config.py CodeReviewAppConfig + load_code_review_app_config ├── config.yaml severity_categories, auto_request_changes_on, repos_in_scope -├── mcp_server.py CodeReviewMCPServer with 3 tools +├── mcp_server.py CodeReviewMCPServer + 3 tools ├── skills/ 3 agent YAML configs + _common/ shared style prompt │ ├── _common/style.md │ ├── intake/ │ ├── analyzer/ │ └── recommender/ -├── ui.py Streamlit read-only viewer (mirrors incident UI patterns) -├── __main__.py Entry point -└── README.md this file +├── ui.py Streamlit read-only viewer +└── __main__.py entry point ``` -## State Model +## Domain shape -`CodeReviewState(Session)` extends the framework's generic `Session` with: +`CodeReviewState(Session)` adds `pr: PullRequest`, +`review_findings: list[ReviewFinding]`, `overall_recommendation`, +`review_summary`, `review_token_budget`. Session ids look like +`CR-YYYYMMDD-NNN`. -| Field | Type | Purpose | -|---|---|---| -| `pr` | `PullRequest` | repo, PR number, title, author, base/head SHAs, line counts | -| `review_findings` | `list[ReviewFinding]` | severity, file, line, category, message, optional suggestion | -| `overall_recommendation` | `"approve" \| "request_changes" \| "comment" \| None` | final verdict | -| `review_summary` | `str` | rolled-up narrative for the human reviewer | -| `review_token_budget` | `int` | telemetry — running token spend on this review | - -The framework only reads/writes the inherited `Session` lifecycle/telemetry fields (`id`, `status`, `created_at`, `agents_run`, `tool_calls`, `findings`, `pending_intervention`, `token_usage`). Every domain field above lands in the row's `extra_fields` JSON column on save and is hydrated back into the model on load — no incident-shaped row schema leaks here (P8-J). +`PullRequest` carries repo / number / title / author / base+head SHAs +/ line counts. `ReviewFinding` carries severity / file / line / +category / message / optional suggestion. Both are pydantic models +declared in this app's `state.py`. -## ID Format +## MCP tools -Session ids look like `CR-YYYYMMDD-NNN` (e.g. `CR-20260503-001`). The format is owned by `CodeReviewState.id_format(seq=...)` (P8-C) so the code-review id namespace is disjoint from incident-management's `INC-...` namespace — both apps can share the same metadata DB without collisions. +`CodeReviewMCPServer` exposes: -## Configuration +- `fetch_pr_diff(repo, number)` — **mock**: reads from + `tests/fixtures/code_review//.json` if present, + otherwise returns a small canned diff so the example runs offline. +- `add_review_finding(session_id, severity, file, line, category, + message, suggestion=None)` — append a structured finding to + `state.review_findings`. Severity is validated against + `severity_categories` from `CodeReviewAppConfig`. +- `set_recommendation(session_id, recommendation, summary)` — + finalize the review. Sets `state.overall_recommendation` + + `state.review_summary`. -Two layers, in order of precedence: - -| Layer | File | What it owns | -|---|---|---| -| Framework | `config/config.yaml` | LLM providers + models, MCP servers, storage URL, paths, `runtime.state_class` | -| App | `examples/code_review/config.yaml` | `severity_categories`, `auto_request_changes_on`, `repos_in_scope`, `review_max_diff_kb` | - -Set `runtime.state_class: examples.code_review.state.CodeReviewState` in the framework config so row hydration produces `CodeReviewState` instances and `id_format` is called on the right class. - -## MCP Tools - -`CodeReviewMCPServer` (FastMCP, name `"code_review"`) exposes three tools to the agents: - -- `fetch_pr_diff(repo, number)` — returns `{diff, files_changed, additions, deletions}`. Reads from `tests/fixtures/code_review//.json` if present; otherwise synthesises a tiny canned diff so the example runs offline. **Mock — not a real GitHub fetch.** -- `add_review_finding(session_id, severity, file, line, category, message, suggestion=None)` — append a structured finding to `state.review_findings`. Validated against `severity_categories` from `CodeReviewAppConfig`. -- `set_recommendation(session_id, recommendation, summary)` — set `state.overall_recommendation` + `state.review_summary` and finalize the review. - -The MCP loader picks this server up via `mcp.servers[*].module = examples.code_review.mcp_server` in the framework config. +No real GitHub/GitLab integration; tools are mocks for demonstration. ## Skills -| Skill | Kind | Tools | Routes (success / default → fail) | -|---|---|---|---| -| `intake` | responsive | `fetch_pr_diff` | `→ analyzer` / `→ analyzer` / `→ __end__` | -| `analyzer` | responsive | `fetch_pr_diff`, `add_review_finding` | `→ recommender` / `→ recommender` / `→ __end__` | -| `recommender` | responsive | `set_recommendation` | `→ __end__` | - -All three are `kind: responsive` (no supervisor / monitor) — Phase-6 supervisor support is not exercised here. Common prompt fragments (severity calibration, output shape) live in `skills/_common/style.md` and are inherited by every skill. - -## Bundle - -Like incident-management, code-review ships as a single self-contained file: `dist/apps/code-review.py`. Build via: - -```bash -python scripts/build_single_file.py -``` - -This produces `dist/app.py` (framework-only), `dist/apps/incident-management.py`, **and** `dist/apps/code-review.py` from the same flattening pipeline (P8-K). All three are `ast.parse`-clean and runnable on a clean venv with only vendored deps. - -## Limits / Out of Scope - -- Tools are **mocked** — there is no real GitHub or GitLab integration. `fetch_pr_diff` reads a JSON fixture or returns synthetic data; `add_review_finding` and `set_recommendation` write only to the in-process session state. -- No incremental re-review — re-firing the trigger creates a new session. -- No supervisor skills — the diff is walked sequentially by the analyzer agent. -- No PR-author identity model — the framework does not ship a generic `Reporter` / `Actor` concept; each app names its own (`pr.author` here, `Reporter(id, team)` for incident-management). - -## How This Proves the Framework Is Generic - -Phase 8 was written *to surface and fix* framework leaks. The fixes that landed because this app needed them: - -- **P8-C** — `Session.id_format()` classmethod hook. Every `Session` subclass mints its own id format (`INC-...` for incidents, `CR-...` here, anything for future apps). `SessionStore._next_id` no longer hard-codes the incident shape. -- **P8-J** — `extra_fields: JSON` column on the row schema. Round-trip is driven by `state_cls.model_fields`; typed-column fields stay typed, everything else round-trips through the JSON bag. Incident round-trip is preserved; code-review's `pr` / `review_findings` / `overall_recommendation` / `review_summary` / `review_token_budget` now persist losslessly. -- **P8-K** — bundler emits `dist/apps/code-review.py` from the same flattening pipeline as `dist/apps/incident-management.py`. -- **P8-L** — integration test: both apps run side-by-side on isolated metadata DBs without colliding on id space, leaking field shapes, or sharing state. - -Phase 9 (ASR) builds on the framework as it stands after these fixes. +| Skill | Tools | Routes | +|---|---|---| +| `intake` | `fetch_pr_diff` | → analyzer | +| `analyzer` | `fetch_pr_diff`, `add_review_finding` | → recommender | +| `recommender` | `set_recommendation` | → __end__ | -## Testing +All three are `kind: responsive`. Common prompt fragments live in +`skills/_common/style.md` and are inherited. -```bash -pytest tests/test_code_review_*.py tests/test_two_apps_coexist.py tests/test_generic_round_trip.py tests/test_session_id_format.py tests/test_bundle_code_review.py -q --no-cov -``` +## Limits / Out of scope -App-level pin tests live alongside `tests/test_code_review_*.py`; the Phase-8 framework-leak fixes are pinned by `tests/test_session_id_format.py`, `tests/test_generic_round_trip.py`, `tests/test_bundle_code_review.py`, and `tests/test_two_apps_coexist.py`. +- Tools are mocked (no real GitHub/GitLab API calls). +- No incremental re-review (re-firing the trigger creates a new + session). +- No supervisor / monitor skills exercised. +- No PR-author identity model — each app names its own + (`pr.author` here, `Reporter(id, team)` in incident-management). diff --git a/examples/incident_management/README.md b/examples/incident_management/README.md index d8cc3b0..dc2d89a 100644 --- a/examples/incident_management/README.md +++ b/examples/incident_management/README.md @@ -1,229 +1,84 @@ # Incident Management — Example Application -The flagship example app for the `runtime` framework. Demonstrates how to layer a domain-specific agent application on top of the generic orchestration runtime. +The flagship example app for the framework. A 4-skill investigation +pipeline (intake → triage → deep_investigator → resolution) with +ASR memory layers (L2 Knowledge Graph, L5 Release Context, L7 +Playbook Store). + +For framework-wide design + decisions, see +[`docs/DESIGN.md`](../../docs/DESIGN.md). This README only covers +the bits specific to this app. ## Run ```bash -python -m runtime --config config/incident_management.yaml +uv run python -m runtime --config config/incident_management.yaml +ASR_LOG_LEVEL=INFO uv run streamlit run src/runtime/ui.py --server.port 37777 ``` -That boots the long-lived orchestrator service against this app's -config. The Streamlit UI ships separately under `ui/streamlit_app.py` -(`streamlit run ui/streamlit_app.py`) and binds to the same service. - -## Architecture - -This example extends the generic `Session` model with incident-specific state and provides a 4-agent investigation pipeline (intake → triage → deep_investigator → resolution). The framework owns session lifecycle, agent dispatch, and tool gateway; this example owns domain shape, skill prompts, and MCP tools. +## Layout ``` examples/incident_management/ -├── state.py IncidentState(Session) + Reporter + IncidentStatus -├── config.py IncidentAppConfig + load_incident_app_config -├── config.yaml severity_aliases, escalation_teams, environments, thresholds -├── mcp_server.py IncidentMCPServer with 3 tools -├── asr/ ASR memory layers (Phase 9) -│ ├── memory_state.py MemoryLayerState + L2/L5/L7 pydantic models -│ ├── kg_store.py L2 Knowledge Graph (filesystem) -│ ├── release_store.py L5 Release Context (filesystem) -│ ├── playbook_store.py L7 Playbook Store (filesystem) -│ └── seeds/ bundled JSON / YAML seed data per layer -├── skills/ 4 agent YAML configs + _common/ shared prompts +├── state.py IncidentState(Session) + Reporter + IncidentStatus +├── config.py IncidentAppConfig + load_incident_app_config +├── config.yaml severity_aliases, escalation_teams, environments, thresholds +├── mcp_server.py IncidentMCPServer + 3 tools +├── mcp_servers/ observability + remediation + user_context tools +├── asr/ ASR memory layers +│ ├── memory_state.py MemoryLayerState + L2/L5/L7 pydantic models +│ ├── kg_store.py L2 Knowledge Graph (filesystem) +│ ├── release_store.py L5 Release Context (filesystem) +│ ├── playbook_store.py L7 Playbook Store (filesystem) +│ └── seeds/ seed data per layer +├── skills/ 4 agent YAML configs + _common/ shared prompts │ ├── _common/ -│ ├── intake/ -│ ├── triage/ -│ ├── deep_investigator/ -│ └── resolution/ -├── ui.py Streamlit accordion-per-incident UI -├── __main__.py Entry point -└── README.md this file -``` - -## Configuration - -Two layers, in order of precedence: - -| Layer | File | What it owns | -|---|---|---| -| Framework | `config/config.yaml` | LLM providers + models, MCP servers, storage URL, paths | -| App | `examples/incident_management/config.yaml` | severity_aliases, escalation_teams, environments, similarity_threshold, confidence_threshold | - -The framework's `AppConfig` does **not** contain incident-flavored keys — they all live in `IncidentAppConfig`. Adding a new domain field is a one-line addition to `IncidentAppConfig`, never to `runtime.config.AppConfig`. - -## State Model - -`IncidentState(Session)` extends the framework's `Session` base with: - -- `query: str` — initial user description -- `environment: str` — production/staging/dev/local -- `reporter: Reporter` — who filed the incident -- `summary: str` — agent-produced narrative -- `tags: list[str]` -- `severity: str | None` — high/medium/low after triage -- `category: str | None` -- `matched_prior_inc: str | None` — id of similar resolved incident, if any -- `resolution: Any` — final outcome -- `memory: MemoryLayerState` — ASR memory-layer slots (L2 KG / L5 Release / L7 Playbooks); see "ASR memory layers" below - -The framework only reads/writes the inherited `Session` fields (id, status, created_at, agents_run, tool_calls, findings, pending_intervention, token_usage). Domain fields above are read/written exclusively by example-app code. - -## MCP Tools - -`IncidentMCPServer` exposes three tools to the agents: - -- `lookup_similar_incidents(query, environment)` — embedding similarity over closed incidents -- `create_incident(query, environment, reporter_id, reporter_team)` — start a new investigation -- `update_incident(incident_id, patch)` — write to status, severity, category, summary, tags, findings, resolution - -The MCP loader reads the registry from `config/config.yaml` (`mcp.servers[*].module`), which points at `examples.incident_management.mcp_server` for this app. - -## Skills - -Each agent (intake/triage/deep_investigator/resolution) is a `Skill` defined by a `config.yaml` + `system.md` pair under `skills//`. The `_common/` directory holds shared snippets all skills inherit. The framework's skill loader (`runtime.skill.load_all_skills`) takes a directory; `paths.skills_dir` in the framework config points at this directory. - -## Durable Memory - -Sessions survive cold restart. The framework wires LangGraph's `AsyncSqliteSaver` (or `AsyncPostgresSaver` for production) to the same database URL declared in `config.yaml`'s `storage.metadata.url`, on a separate connection pool with WAL + `busy_timeout=30s` configured on both sides. Resume after a crash: load the session by id and call `Orchestrator.resume_session(incident_id, user_input)` — it dispatches `Command(resume=...)` against the persisted graph state. Pending interventions are dual-written to both the LangGraph checkpoint and `IncidentRow.pending_intervention` so dashboards reading the relational row stay accurate. - -The state class is configurable. `config/config.yaml` sets `runtime.state_class: examples.incident_management.state.IncidentState` so row hydration produces `IncidentState` instances, not bare `Session` instances. A different app subclassing `Session` simply points this key at its own state class — no framework changes. - -## Multi-Session - -The orchestrator runs as a long-lived `OrchestratorService` (single asyncio loop on a background thread, single shared FastMCP client pool). Each session is an asyncio task on that loop, started via `service.start_session(query=..., environment=..., reporter_id=..., reporter_team=...)` which returns the session id immediately while the agent run continues in the background. - -Concurrent sessions are isolated at the row level (each writes to its own `IncidentRow`) but share the MCP client pool. `service.list_active_sessions()` returns a thread-safe snapshot of in-flight sessions; `service.stop_session(session_id)` cancels a task and marks the row `status="stopped"`. Default cap is 8 concurrent sessions; raise `SessionCapExceeded` (HTTP 429) on overflow. Configure via `runtime.max_concurrent_sessions` in `config.yaml`. - -Three new HTTP endpoints expose this surface: `POST /sessions` (start), `GET /sessions` (list active), `DELETE /sessions/{id}` (stop). Legacy `POST /investigate` is preserved as a deprecated alias delegating to the same code path. - -The Streamlit UI now shows two sections in the sidebar: **In-flight** (live, polled from `list_active_sessions()`) and **History** (closed sessions). The detail pane auto-polls every 1.5s while a session's status is non-terminal; polling stops once status is `resolved` / `escalated` / `stopped`. - -## Risk-rated tool gateway - -Phase 4 adds a per-tool risk gateway that sits between every agent and every MCP tool call. Each tool is tagged in `runtime.gateway.policy` (`low` / `medium` / `high`) and the gateway dispatches on the resolved action: `low` runs without overhead, `medium` runs and persists `ToolCall(status="executed_with_notify")` for soft audit, and `high` raises `langgraph.types.interrupt(...)` to pause the graph for human approval — the wrap closure captures the live `Session` per agent invocation so audit lands on the right row. - -A prod-environment override tightens the policy further: when a session's `environment` is in `prod_environments` and the tool name matches a `resolution_trigger_tools` glob, the gateway forces `approve` regardless of the tool's risk tier. This guarantees that "blast-radius" tools (apply_fix, deploy, mass_update_*) always get a human in the loop in production, even when the underlying tier is `low` or `medium`. - -Operators resolve pending approvals via `POST /sessions/{sid}/approvals/{tool_call_id}` (decision=approve|reject + approver + optional rationale) or via the **Pending Approvals** cards in the Streamlit detail pane. Both paths drive `Command(resume={...})` against the same graph thread_id so HTTP and UI clients share the resume contract. - -Legacy `tool_calls` rows from before Phase 4 are migrated lazily by `runtime.storage.migrate_tool_calls_audit` — idempotent JSON walk that fills the new audit fields with their defaults. Run once at orchestrator startup or as a one-off ops job. - -## ASR memory layers (Phase 9) - -Phase 9 lays the foundation for ASR's 7-layer memory architecture (see `ASR.md` §3 / §6). Three of those layers ship in this batch as filesystem-backed read-only stores under `examples/incident_management/asr/`. No Neo4j / Redis / pgvector dependency — air-gapped friendly per `rules/build.md`. - -| Layer | Class | Backing files | Surface | -|---|---|---|---| -| **L2** Knowledge Graph | `KGStore` | `incidents/kg/{components,edges}.json` | `get_component` / `find_by_name` / `neighbors` / `subgraph` | -| **L5** Release Context | `ReleaseStore` | `incidents/releases/recent.json` | `recent_for_service` / `suspect_at` / `context` | -| **L7** Playbook Store | `PlaybookStore` | `incidents/playbooks/*.yaml` | `get` / `list_all` / `match` | - -Each store accepts a `root: Path` for testability. When the configured layer directory is empty, the store falls back to the seed bundle at `examples/incident_management/asr/seeds//` so a fresh checkout has working data without provisioning `incidents/`. - -Investigations attach context fetched from each layer to `IncidentState.memory` — the `MemoryLayerState` container with `l2_kg: L2KGContext | None`, `l5_release: L5ReleaseContext | None`, `l7_playbooks: list[L7PlaybookSuggestion]`. The whole bundle round-trips through the P8-J `extra_fields` JSON column, so no row schema change is needed. Mutation paths (writes from agents, playbook authoring) are deferred to later sub-phases (9e–9g). - -## ASR MVP investigation flow (Phase 9 — 9h/9i/9k/9m) - -The MVP slice wires a deliberate, end-to-end investigation pipeline on top of the memory-layer foundation. Three new skills + helpers + UI panels: - +│ ├── intake/ kind: supervisor, runs similarity + memory hydration +│ ├── triage/ hypothesis-loop investigator +│ ├── deep_investigator/ evidence gathering +│ └── resolution/ propose / apply fix or escalate +├── ui.py Streamlit accordion-per-incident view +└── __main__.py entry point ``` -intake → triage (hypothesis loop) → deep_investigator → resolution (L7 + gateway) -``` - -**1. Supervisor (`intake`, P9-9h + 2026-05-03 generalisation).** Default entry agent (framework default `entry_agent='intake'` matches; no override needed in `config/config.yaml`). The `intake` skill is `kind: supervisor` whose runner composes `runtime.intake.default_intake_runner` (framework — similarity retrieval + dedup gate) with `examples.incident_management.asr.supervisor_node:default_supervisor_runner`'s memory hydration. Hydrates `session.memory` with L2 KG / L5 Release / L7 Playbook context fetched from the affected service set (extracted heuristically from the query). Applies the **single-active-investigation gate**: if another in-flight session is already covering the same components, the new session is tagged `status="duplicate"` with `parent_session_id` pointing at the active one (reuses P7 dedup linkage), and routed to `__end__`. Helper module: `examples/incident_management/asr/supervisor_node.py`. - -**2. Triage hypothesis loop (P9-9i).** The triage skill now runs a bounded inner loop: generate hypothesis → gather evidence (L1 current findings, L3-equivalent past similar incidents via `lookup_similar_incidents`, L5 recent suspect deploys from `session.memory.l5_release`) → score → refine or accept. Hard cap of 3 iterations. The deterministic scorer (`asr.hypothesis_loop.score_hypothesis` — token-overlap, no LLM) and the `should_refine` predicate are unit-tested separately so the loop's safety net isn't LLM-dependent. Each iteration writes `{iteration, hypothesis, score, rationale}` to `findings.findings_triage` for the UI's hypothesis trail panel. - -**3. Resolution + prod-HITL (P9-9k).** The resolution skill consults `session.memory.l7_playbooks`, picks the top match, and translates `playbook.remediation` into tool calls via `asr.resolution_helpers.playbook_to_tool_calls`. Every call routes through the framework gateway. The `runtime.gateway` block in `config/config.yaml` locks the prod-environment override: `update_incident` (medium) and any `remediation:*` tool ALWAYS require approval in `production`, regardless of risk tier. The override only TIGHTENS — it can never relax a higher-risk tool to `auto`. - -**4. UI panels (P9-9m-sliver).** Two read-only views on the incident detail page: - -- **Approval Inbox** — already shipped in P4-H; surfaces every tool call with `status="pending_approval"` as an Approve / Reject card. -- **Hypothesis Trail** — collapsed accordion showing the triage agent's iterative `{iteration, hypothesis, score, rationale}` log, sourced from `session.findings`. No new persistent state. - -## Agent kinds - -Phase 6 introduces a `kind` discriminator on every `Skill`, allowing three execution models behind a single config schema: - -| `kind` | Where it runs | Writes `AgentRun`? | -|--------------|----------------------------------------------|--------------------| -| `responsive` | LangGraph node, on a session turn (today's path) | yes | -| `supervisor` | LangGraph node, dispatches to a subordinate via `Send()` | **no** (dispatch log only) | -| `monitor` | Out-of-band, scheduled via `MonitorRunner` | no (signals only) | - -Each existing skill in this example carries `kind: responsive` explicitly; the loader still defaults the field to `responsive` when omitted, so legacy YAML keeps working unchanged. A `supervisor` skill declares `subordinates`, `dispatch_strategy: llm|rule`, and either a `dispatch_prompt` (for `llm`) or a `dispatch_rules` list (for `rule`); supervisor dispatches emit a structured `supervisor_dispatch` log entry instead of bloating `agents_run` with router rows. A `monitor` skill declares a 5-field `schedule:` cron expression, an `observe:` list of tool names, an `emit_signal_when:` safe-eval expression, and a `trigger_target:` naming a Phase-5 trigger to fire when the expression is true. Monitors run on a small bounded thread pool (`max_workers=4`); each tick has a per-monitor `tick_timeout_seconds` so one slow `observe` tool cannot stall the others. Dangerous expression constructs (calls, attribute access, comprehensions, lambda) are rejected by an AST allowlist at skill-load time — `eval()`/`exec()` are never used on user-supplied strings. - -## Triggers - -Phase 5 adds a declarative trigger registry that generalises session-start beyond the legacy `POST /investigate` route. After Phase 5 the framework can fire `Orchestrator.start_session` from four transport flavours: `api` (back-compat), `webhook` (third-party POST `/triggers/{name}`), `schedule` (in-process APScheduler cron), and `plugin` (custom transport registered via setuptools entry-points or explicit `plugin_transports={"kind": Class}` on `TriggerRegistry.create`). All four are wired off a single `triggers:` block in `config.yaml`. - -```yaml -triggers: - - name: pagerduty-incident - transport: webhook - target_app: incident_management - payload_schema: examples.incident_management.triggers.PagerDutyPayload - transform: examples.incident_management.triggers.transform_pagerduty - auth: bearer - auth_token_env: PAGERDUTY_WEBHOOK_TOKEN - idempotency_ttl_hours: 24 - - - name: nightly-prod-scan - transport: schedule - target_app: incident_management - transform: examples.incident_management.triggers.transform_schedule_heartbeat - schedule: "0 2 * * *" # 5-field cron (UTC by default) - timezone: UTC - payload: - query: "Nightly health check" - environment: production -``` - -**Webhook routing:** the registry mounts one `POST /triggers/{name}` route per webhook trigger. Each trigger config declares a Pydantic `payload_schema` (validated on every request — bad body returns 422) and a `transform` callable that maps the parsed payload to `start_session(**kwargs)`. The transform error policy is fail-closed: any exception from `transform` returns `422 Unprocessable Entity` and is **not** cached for idempotency, so a retried request gets a fresh attempt. -**Bearer auth:** when `auth: bearer`, the route requires `Authorization: Bearer $auth_token_env`. The token is read from the named env var **at app startup** — rotating the secret requires a process restart. No raw secrets ever land in YAML. Constant-time comparison (`hmac.compare_digest`) guards against timing oracles. HMAC signature transports (PagerDuty `x-pagerduty-signature`, Slack `x-slack-signature`) are deferred to a later phase via the same `auth:` discriminator. +## Domain shape -**Idempotency-Key:** webhook clients can include `Idempotency-Key: ` to dedupe retries. The registry stores `(trigger_name, key)` -> `session_id` in a per-process LRU and a SQLite-backed table `trigger_idempotency_keys` on the same DB used for session metadata (`storage.metadata.url`). Cold restart is survived: on LRU miss, the disk row is read; entries past `ttl_hours` are purged opportunistically. Content-based dedup (hash of body) is **out of scope until Phase 7**; only the explicit `Idempotency-Key` header is honoured in Phase 5. +`IncidentState(Session)` adds `query`, `environment`, `reporter`, +`summary`, `tags`, `severity`, `category`, `matched_prior_inc`, +`resolution`, `memory: MemoryLayerState`. Session ids look like +`INC-YYYYMMDD-NNN`. -**Schedule cron:** `schedule:` is a standard 5-field cron string interpreted via APScheduler's `CronTrigger.from_crontab`. The 6-field APScheduler-native form is rejected at config-load time. Drift: in-process APScheduler is good for ±1 minute under normal load — tighter SLOs need an external scheduler (Celery beat, k8s `CronJob`). +## ASR memory layers -**Plugin transports:** to ship a transport for SQS / Kafka / NATS, subclass `runtime.triggers.base.TriggerTransport` and register the class either via the `runtime.triggers` setuptools entry-point group or by passing `plugin_transports={"kind": Class}` to `TriggerRegistry.create`. Explicit registrations win on key collision. - -**Provenance:** every session started via a trigger receives a `TriggerInfo(name, transport, target_app, received_at)` stamped onto `inc.findings['trigger']` before the graph runs, so dashboards and audit logs can answer "where did this session come from?" without re-deriving from disjoint sources. - -## Adding a new app - -The framework is genuinely generic — Phase 8 lifted every domain-specific assumption out of `src/runtime/` and pinned it with the second example at `examples/code_review/`. To stand up your own app, mirror this structure under `examples//` (no framework changes required): - -| File | What it owns | Hook into the framework | +| Layer | Class | Backing | |---|---|---| -| `state.py` | Your `Session` subclass with domain fields | `runtime.state_class` (dotted path) | -| `state.py` (`id_format` classmethod) | Your session id shape (e.g. `MYAPP-NNN`) | `Session.id_format(seq=...)` (P8-C) | -| `config.py` / `config.yaml` | Your `AppConfig` subclass for app-specific tunables | Loaded by your own loader; framework doesn't touch it | -| `mcp_server.py` | Your domain MCP tools | `mcp.servers[*].module` | -| `skills//{config,system}.{yaml,md}` | Per-skill prompt + tool wiring | `paths.skills_dir` | -| (none) | Entry point lives in the framework: `python -m runtime --config config/.yaml` | n/a | - -**Round-trip:** any field you declare on your `Session` subclass that is *not* an incident-shaped typed column on `IncidentRow` (`query`, `environment`, `severity`, `tags`, ...) lands in the row's `extra_fields` JSON column on save and is hydrated back via `state_cls.model_fields` on load (P8-J). You don't need to touch the framework's row schema or converters. - -**Bundle:** add a `_APP_MODULE_ORDER` and a `build__app()` function in `scripts/build_single_file.py`, then call it from `main()`. The flattening pipeline + intra-import stripping pattern is the same for every app (see how `examples.code_review` does it). - -The second example at [`examples/code_review/`](../code_review/README.md) is a deliberate non-incident-flavored app (PR review). It exists to *prove* the framework is generic by being a second concrete instance of the same pattern. If you're stuck on how a piece should land, check what code-review does first — the pattern is almost always already there. - -## Testing - -```bash -pytest tests/ -q --no-cov -``` - -Pin tests for this example live in `tests/test_incident_state.py` (state shape), `tests/test_mcp_incident_server.py` (MCP server), and the broader integration suite under `tests/test_*`. - -## Genericity ratchet - -`scripts/check_genericity.py` counts occurrences of incident-flavored tokens (`incident`, `severity`, `reporter`) inside `src/runtime/`. `tests/test_genericity_ratchet.py` enforces that the total stays at or below `BASELINE_TOTAL` — so new domain leaks into the framework layer fail CI. - -```bash -python scripts/check_genericity.py # print current counts -python scripts/check_genericity.py --baseline 140 # exit non-zero if exceeded -``` - -To lower the baseline: refactor a leak out of `src/runtime/`, then update `BASELINE_TOTAL` in `tests/test_genericity_ratchet.py` in the same commit. Raising the baseline requires an architecture rationale in the commit message and is a code-review red flag. +| L2 Knowledge Graph | `KGStore` | `incidents/kg/{components,edges}.json` (or seeds) | +| L5 Release Context | `ReleaseStore` | `incidents/releases/recent.json` (or seeds) | +| L7 Playbook Store | `PlaybookStore` | `incidents/playbooks/*.yaml` (or seeds) | + +The intake supervisor hydrates `IncidentState.memory` from these +stores using components extracted from the user's query. The +triage / DI / resolution agents read the bundle as additional +context. Mutation paths (write-back) are deferred. + +## MCP tools + +`IncidentMCPServer` exposes `lookup_similar_incidents`, +`create_incident`, `update_incident`, `submit_hypothesis`, +`mark_resolved`, `mark_escalated`. Sibling MCP servers under +`mcp_servers/` add observability (`get_logs`, `get_metrics`, +`get_service_health`, `check_deployment_history`) and remediation +(`propose_fix`, `apply_fix`, `notify_oncall`). + +The risk-rated gateway (`runtime.gateway.policy`) tags `apply_fix` +as `high` so production runs pause for operator approval before +applying any fix. See [DESIGN § 7](../../docs/DESIGN.md#7-hitl-approve--reject) +for the HITL pause/resume mechanics. + +## Skill model + +Per-agent LLM override: intake declares `model: gpt_oss_cheap` (a +fast / cheap model on Ollama Cloud) so the supervisor pre-filter is +cheap; downstream agents follow `llm.default`. See +[DESIGN § 5.3](../../docs/DESIGN.md#5-llm-provider-story) for the +per-agent dispatch.