diff --git a/.gitignore b/.gitignore index c5d3e84..21a3985 100644 --- a/.gitignore +++ b/.gitignore @@ -58,10 +58,26 @@ ASR.md # - DESIGN.md: consolidated architecture + decision log. # - AIRGAP_INSTALL.md: Phase 14 (HARD-02) air-gap install path. # - DEVELOPMENT.md: Phase 16 (BUNDLER-01) contributor workflow. +# - 00-…-11-…: brownfield documentation set (per-topic). +# - adr/*.md: Architecture Decision Records. docs/* !docs/DESIGN.md !docs/AIRGAP_INSTALL.md !docs/DEVELOPMENT.md +!docs/00-project-overview.md +!docs/01-local-setup.md +!docs/02-architecture.md +!docs/03-code-map.md +!docs/04-main-flows.md +!docs/05-configuration.md +!docs/06-data-model.md +!docs/07-integrations.md +!docs/08-testing.md +!docs/09-build-deploy-release.md +!docs/10-known-risks-and-todos.md +!docs/11-agent-handoff.md +!docs/adr/ +!docs/adr/*.md REVIEW_*.md review_*.md .planning/ diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..e3776ff --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,199 @@ +# CLAUDE.md — project context for AI agents + +> Loaded automatically by Claude Code (and equivalent agents) for +> every session in this repo. Companion to +> [`docs/11-agent-handoff.md`](docs/11-agent-handoff.md), which has +> the longer "action card" format with explanations. + +## What this project is + +Generic Python multi-agent runtime framework on **LangGraph** +(orchestration) + **LangChain** (provider + agent factory) + +**FastMCP** (tools). Single-file deploy bundle for air-gapped +corporate environments. Two reference apps in `examples/`: +`incident_management` (flagship) and `code_review` (proves the +framework is generic). + +`main` is at v1.5 (see [`docs/DESIGN.md`](docs/DESIGN.md) § 13 for +milestone history). + +## Read these first + +In order: +1. [`docs/DESIGN.md`](docs/DESIGN.md) — long-form architecture + + 12 numbered DEC-NNN decisions + milestone history +2. [`docs/11-agent-handoff.md`](docs/11-agent-handoff.md) — top + 20 files to read, command allowlist / denylist, common traps +3. [`docs/02-architecture.md`](docs/02-architecture.md) — + quick-scan layered diagram +4. [`docs/04-main-flows.md`](docs/04-main-flows.md) — entry points + + failure modes per flow + +## Always-on commands + +```bash +# install / sync deps (uses uv.lock) +uv sync --frozen --extra dev + +# tests (full) +uv run pytest -x + +# tests (single file, fast) +uv run pytest tests/.py -xvs --no-cov + +# lint + type check + ratchets +uv run ruff check src/ tests/ +uv run pyright src/runtime +python scripts/check_genericity.py +uv run python scripts/lint_skill_prompts.py + +# regenerate single-file bundle (REQUIRED after touching src/runtime/ or examples/) +uv run python scripts/build_single_file.py + +# coverage gate +uv run pytest --cov=src/runtime --cov-fail-under=85 -x +``` + +## DO + +- Use `uv run pytest …` (NOT bare `pytest`) — pythonpath is in + `pyproject.toml`. +- Regenerate `dist/*` after ANY change to `src/runtime/` or + `examples/`. CI's "Bundle staleness gate (HARD-08)" fails + otherwise. +- Run `uv lock` and commit `uv.lock` if you change `pyproject.toml`. + CI's "Lockfile freshness gate (HARD-02)" fails otherwise. +- Work on a feature branch, open a PR, squash-merge. + Conventional-commit subjects: `feat(area): …`, `fix(area): …`, + `refactor(area): …`, `docs: …`, `build: …`, `chore(area): …`. +- Use `extra_fields` JSON for app-specific fields. Do NOT add + app-specific columns to `IncidentRow`. +- Use stub LLMs (`LLMConfig.stub()` + `EnvelopeStubChatModel` from + `tests/_envelope_helpers.py`) in tests. Live LLM tests are + env-gated. +- Re-read [`docs/DESIGN.md`](docs/DESIGN.md) § 12 (decision log) + before any architectural change. + +## DO NOT + +- Do NOT `pip install …` — bypasses uv lockfile. Use `uv add` + + `uv sync`. +- Do NOT edit `dist/*` directly — they're generated. +- Do NOT add `TODO`/`FIXME`/`HACK` comments — fix root cause or + open an issue. The only intentional `TODO(v2)` is in + `src/runtime/locks.py:49` (slot eviction; documented). +- Do NOT add `except Exception: pass` — Phase 18 / HARD-04 + removed all of these. Log + re-raise or catch a typed exception. +- Do NOT touch SQLAlchemy column names on `IncidentRow` — + destructive migration. Add to `extra_fields` instead. +- Do NOT commit anything in `.planning/` — gitignored; + local-only working state for the GSD planning workflow. +- Do NOT commit agent-generated `*.md` outside `docs/` unless + the user explicitly asks them to ship. `docs/*` is gitignored + except for the explicit allowlist in `.gitignore`. +- Do NOT call live LLM providers in CI tests — keys are dummy in + `.github/workflows/ci.yml`. +- Do NOT introduce a public-internet runtime dependency in + `src/runtime/`. Air-gap is the deploy target. The hardcoded + `https://ollama.com` fallback was explicitly removed in + Phase 13 (HARD-05); don't re-introduce. +- Do NOT force-push or rewrite history on `main` (or any branch + with collaborators). PRs only. +- Do NOT skip the bundle regeneration step ("I'll do it before + PR" leads to CI fails and time wasted on rebases). +- Do NOT bypass the concept-leak ratchet by raising + `BASELINE_TOTAL` without a rationale entry. Lowering it is + encouraged; raising requires architectural justification in the + commit message. + +## Architectural rules (load-bearing) + +See [`docs/11-agent-handoff.md`](docs/11-agent-handoff.md) § +"Architectural rules" for the 8 rules. Quick recap: + +1. Framework stays domain-agnostic +2. One source of truth per concern (`should_gate`, `should_retry`, + `_finalize_session_status`) +3. HITL pause is NOT an error +4. Append-only audit trails +5. The bundle is the deploy unit +6. Provider abstraction stays in `runtime.llm` +7. Tests use stubs by default +8. No public-internet runtime calls in air-gap path + +## Common traps (skim before debugging) + +- `pytest` (bare) → `ModuleNotFoundError: runtime`. Use `uv run pytest …`. +- Touching `src/` without regenerating `dist/` → CI bundle gate fails. +- Approving a HITL session created on pre-PR-#6 code → silent no-op. + Tell the user to start a fresh session. +- Live OpenRouter `:free` model 429s on first call → retry usually + works (v1.5-D 429 backoff is 7.5s/15s/22.5s). +- Streamlit `AssertionError: scope["type"] == "http"` storm under + Python 3.14 → cosmetic Starlette compat bug; HTTP traffic still + works. + +## Repo conventions + +- **Branches:** `feat/`, `fix/`, `refactor/`, `docs/`, `chore/`, + `build/`. Squash-merge into `main`. +- **Commits:** Conventional Commits style. Verbose body with the + "why" + key file references when non-trivial. +- **PRs:** Use `gh pr create` with title + body; CI runs lint / + type / test / sonar / bundle / skill-lint. Squash-merge with + `gh pr merge --squash --delete-branch --subject "…"`. +- **Tests:** `tests/test_*.py`. Async tests need no decorator + (`asyncio_mode=auto`). Stub LLMs from `tests/_envelope_helpers.py`. +- **Coverage:** ≥ 85% on `src/runtime/`. UI / `__main__` / + postgres saver / plugin transport are excluded + (`pyproject.toml:[tool.coverage.run].omit`). +- **Type-checker:** pyright fail-on-error (Phase 19 / HARD-03); + use `# pyright: ignore[] -- ` for legitimate + stub gaps. +- **Skill prompts:** `examples//skills//{config.yaml, system.md}`. + Must include the markdown turn-output contract block (see + `_common/output.md`). + +## Worktree workflow + +This repo is set up for parallel-agent worktrees under +`.claude/worktrees/`. If you're given the EnterWorktree tool: + +- Use it BEFORE making any code changes — keeps the user's main + checkout clean. +- After CI passes and the PR merges, ExitWorktree with + `action=remove, discard_changes=true` (the squashed commits are + on `main`; the original SHAs are dropped, content is preserved). + +If you're not given EnterWorktree, work in the main checkout but +let the user know. + +## Current state snapshot (as of last update) + +- Tests: 1265 passing, 8 skipped +- Coverage: 87.04% +- Concept-leak ratchet: 39 (down from 156 pre-v1.5-B) +- Ruff: clean +- SonarCloud quality gate: green +- Latest milestone: v1.5 (markdown turn output + HITL fix + + generic-noun pass + per-agent LLM + 429 retry) +- Next big move: v2.0 React UI (Streamlit retirement) + +## Where to find what + +| You want to … | Read | +|---|---| +| Understand the architecture | [`docs/DESIGN.md`](docs/DESIGN.md), [`docs/02-architecture.md`](docs/02-architecture.md) | +| Local setup | [`docs/01-local-setup.md`](docs/01-local-setup.md) | +| Find a file by purpose | [`docs/03-code-map.md`](docs/03-code-map.md) | +| Understand a flow end-to-end | [`docs/04-main-flows.md`](docs/04-main-flows.md) | +| Configure deployment | [`docs/05-configuration.md`](docs/05-configuration.md) | +| Inspect storage / data | [`docs/06-data-model.md`](docs/06-data-model.md) | +| External integrations | [`docs/07-integrations.md`](docs/07-integrations.md) | +| Run / write tests | [`docs/08-testing.md`](docs/08-testing.md) | +| Build / deploy / release | [`docs/09-build-deploy-release.md`](docs/09-build-deploy-release.md) | +| Risk / debt inventory | [`docs/10-known-risks-and-todos.md`](docs/10-known-risks-and-todos.md) | +| Action card for AI agents | [`docs/11-agent-handoff.md`](docs/11-agent-handoff.md) | +| Architectural baseline | [`docs/adr/0001-current-architecture.md`](docs/adr/0001-current-architecture.md) | +| Dev workflow (regenerate dist, add module) | [`docs/DEVELOPMENT.md`](docs/DEVELOPMENT.md) | +| Air-gap install | [`docs/AIRGAP_INSTALL.md`](docs/AIRGAP_INSTALL.md) | diff --git a/README.md b/README.md index 046a5d5..0bb84cd 100644 --- a/README.md +++ b/README.md @@ -1,5 +1,15 @@ # ASR — Multi-Agent Runtime Framework +[![Python](https://img.shields.io/badge/python-3.11%2B-blue?style=for-the-badge&logo=python&logoColor=white)](https://www.python.org/) +[![LangGraph](https://img.shields.io/badge/LangGraph-1.x-orange?style=for-the-badge)](https://github.com/langchain-ai/langgraph) +[![FastMCP](https://img.shields.io/badge/FastMCP-2.x-purple?style=for-the-badge)](https://github.com/jlowin/fastmcp) +[![CI](https://img.shields.io/github/actions/workflow/status/RandomCodeSpace/asr/ci.yml?branch=main&style=for-the-badge&logo=github)](https://github.com/RandomCodeSpace/asr/actions/workflows/ci.yml) +[![Quality Gate](https://img.shields.io/sonar/quality_gate/RandomCodeSpace_asr?server=https%3A%2F%2Fsonarcloud.io&style=for-the-badge&logo=sonarcloud)](https://sonarcloud.io/project/overview?id=RandomCodeSpace_asr) +[![Coverage](https://img.shields.io/sonar/coverage/RandomCodeSpace_asr?server=https%3A%2F%2Fsonarcloud.io&style=for-the-badge&logo=sonarcloud)](https://sonarcloud.io/component_measures?id=RandomCodeSpace_asr&metric=coverage) +[![Tests](https://img.shields.io/badge/tests-1265%20passing-brightgreen?style=for-the-badge)](https://github.com/RandomCodeSpace/asr/actions) +[![Ruff](https://img.shields.io/badge/lint-ruff-261230?style=for-the-badge&logo=ruff)](https://github.com/astral-sh/ruff) +[![Pyright](https://img.shields.io/badge/types-pyright-yellow?style=for-the-badge)](https://github.com/microsoft/pyright) + Python multi-agent runtime built on **LangGraph** (orchestration) + **FastMCP** (tool dispatch), with HITL gate, markdown turn-output contract, and a single-file deploy bundle for air-gapped corporate diff --git a/docs/00-project-overview.md b/docs/00-project-overview.md new file mode 100644 index 0000000..4d9a1e5 --- /dev/null +++ b/docs/00-project-overview.md @@ -0,0 +1,97 @@ +# 00 — Project overview + +## What it does + +ASR is a generic Python multi-agent runtime framework that wraps +**LangGraph** (orchestration), **LangChain** (LLM provider +abstraction + agent factory), and **FastMCP** (tool dispatch). It +adds a risk-rated HITL gateway, a markdown turn-output contract, +per-step telemetry, an auto-learning lesson store, and a single-file +deploy bundle for air-gapped corporate targets. + +Two reference apps live in the same repo to prove the runtime is +genuinely generic: + +- **`examples/incident_management/`** — 4-skill investigation pipeline + (intake → triage → deep_investigator → resolution) with ASR memory + layers (L2 Knowledge Graph, L5 Release Context, L7 Playbook Store). +- **`examples/code_review/`** — 3-skill PR review pipeline (intake + → analyzer → recommender). Built specifically to surface every + framework leak that would have made the runtime + incident-shaped — those leaks were lifted into the framework. + +References: [`docs/DESIGN.md`](DESIGN.md), [`pyproject.toml`](../pyproject.toml). + +## Target users + +- **Operators** of internal SRE / on-call automation in regulated / + air-gapped corporate environments. The deployment story is a + copy-only 7-file payload (no `pip install` at deploy time, no + runtime CDN/internet calls). +- **Application authors** building domain-specific agent apps on top + of the framework. Add a folder under `examples//` with a + `Session` subclass, MCP servers, and skill prompts. +- **Framework contributors** working on the `src/runtime/` layer. + +## Core features + +| Feature | Implemented in | +|---|---| +| LangGraph-driven multi-agent dispatch | `src/runtime/graph.py`, `src/runtime/agents/*.py` | +| LangChain-driven LLM provider abstraction (Ollama, Azure OpenAI, OpenAI-compat) | `src/runtime/llm.py` | +| FastMCP tool servers (in-process / stdio / http) | `src/runtime/mcp_loader.py` | +| Risk-rated HITL gateway with `interrupt()` / `Command(resume=…)` | `src/runtime/tools/gateway.py` | +| Markdown turn-output contract + 6-path parser + permissive fallback | `src/runtime/agents/turn_output.py` | +| Per-step telemetry events (agent_started, tool_invoked, gate_fired, etc.) | `src/runtime/storage/event_log.py` | +| Auto-learning lesson store + nightly refresher | `src/runtime/learning/extractor.py`, `src/runtime/learning/scheduler.py` | +| Two-stage dedup (embedding + LLM) | `src/runtime/dedup.py` | +| Optimistic-concurrency `SessionStore` over SQLAlchemy | `src/runtime/storage/session_store.py` | +| Read-only similarity store | `src/runtime/storage/history_store.py` | +| Trigger registry (api / webhook / schedule / plugin) | `src/runtime/triggers/` | +| Single-file deploy bundle (`dist/`) | `scripts/build_single_file.py` | +| Streamlit UI shell | `src/runtime/ui.py`, `ui/streamlit_app.py` | +| FastAPI surface (`/sessions/*`, SSE/WebSocket, approvals) | `src/runtime/api.py` | +| Concept-leak ratchet (CI-enforced framework genericity) | `tests/test_genericity_ratchet.py`, `scripts/check_genericity.py` | + +## Current status + +`main` is at v1.5 (last squash commit `b97ddb3`). All milestones +shipped: + +| Milestone | Title | PR | +|---|---|---| +| v1.0 | Prompt-vs-Code Remediation | #1 | +| v1.1 | Framework De-coupling (generic runtime) | #2 | +| v1.2 + v1.3 + v1.4 | FOC + HARD + telemetry + auto-learning + React-ready API | bundled into #5 | +| v1.5-A | Markdown turn output + HITL fix on langgraph 1.x | #6 / #7 | +| v1.5-B | Generic-noun pass (concept-leak ratchet 156 → 39) | #8 | +| v1.5-C | Per-agent LLM proof point | #9 | +| v1.5-D | 429 rate-limit retry + multi-provider integration driver | #10 | + +**1265 tests passing**, **87% coverage**, **ratchet at 39**, ruff +clean, SonarCloud quality gate green. See [`docs/DESIGN.md` § 13](DESIGN.md#13-milestone-history) +for the full history. + +## Production-ready vs experimental + +| Surface | Status | Notes | +|---|---|---| +| Framework runtime (`src/runtime/`) | **production** | Used in air-gapped corporate environments | +| `incident_management` example | **production** | Flagship use case | +| `code_review` example | **demo / proof-of-genericity** | Tools are mocks (no real GitHub/GitLab fetch) — `examples/code_review/README.md` | +| Streamlit UI | **prototype** | Stable but slated for replacement by React in v2.0 | +| FastAPI surface | **production-ready** | v1.4 added generic `/sessions/*` REST + SSE/WebSocket + CORS + structured error envelope | +| Postgres checkpointer | **optional / opt-in** | Default is SQLite; install `pip install asr[postgres]` (`pyproject.toml:39`) | +| Trigger registry — webhook / schedule | **functional, lightly exercised** | Used by the example apps; no large-scale fan-in tested | +| Trigger registry — plugin transport | **stub** (`src/runtime/triggers/transports/plugin.py`) — Inference: scaffold for future SQS/Kafka/NATS transports | +| ASR memory layers (incident_management) | **read-only** | Mutation paths (write-back) deferred per `examples/incident_management/README.md` | +| Auto-learning lesson refresher | **production** | Nightly APScheduler job, gated on config | + +## What's next + +- **v2.0 — React UI**, replacing the Streamlit prototype, parity-port + against the v1.4 `/sessions/*` API surface. The long pole. +- Smaller cleanups: duplicate `ToolCall` audit rows + (gateway colon-form vs harvester `__`-form), `ApprovalWatchdog` + regression test, `ASR_LOG_LEVEL` doc, `src/runtime/locks.py:49` + TODO. See [`docs/10-known-risks-and-todos.md`](10-known-risks-and-todos.md). diff --git a/docs/01-local-setup.md b/docs/01-local-setup.md new file mode 100644 index 0000000..cba0a50 --- /dev/null +++ b/docs/01-local-setup.md @@ -0,0 +1,167 @@ +# 01 — Local setup + +## Prerequisites + +- **Python 3.11** (`pyproject.toml:7` requires `>=3.11`; `pyrightconfig` / + CI also pin 3.11). The dev environment in this repo runs Python + 3.13 / 3.14 successfully — Inference: 3.11 is the *floor*, newer + 3.x versions work in practice. +- **`uv`** package manager `>= 0.11.7` (CI pins this exact version + in `.github/workflows/ci.yml`). Install via `pipx install uv` or + the `uv` binary; do not `curl | sh`. +- **git** (for branch / PR workflow). +- **Optional, for live LLM smoke**: provider API keys — + `OLLAMA_API_KEY`, `OPENROUTER_API_KEY`, `AZURE_OPENAI_KEY` (+ + `AZURE_ENDPOINT`, `AZURE_DEPLOYMENT`). Stub-mode tests do NOT + need any keys. +- **Optional, for postgres deployments**: install + `pip install asr[postgres]` to pull `langgraph-checkpoint-postgres` + and `psycopg-pool`. SQLite is the default and CI-tested path. + +## Install + +From a clean checkout: + +```bash +git clone +cd asr +uv sync --frozen --extra dev +``` + +`--frozen` forbids re-resolving — installs the exact set pinned in +`uv.lock` with hash verification (HARD-02 reproducibility gate). For +fully air-gapped install with an internal mirror, see +[`docs/AIRGAP_INSTALL.md`](AIRGAP_INSTALL.md). + +`--extra dev` pulls test runner, type checker, and linters per +`pyproject.toml:42-50`. + +## Run + +Two entry points share the same orchestrator service. + +### CLI / API + +```bash +uv run python -m runtime --config config/incident_management.yaml +``` + +Boots the long-lived `OrchestratorService` and FastAPI surface (`/sessions/*` +REST, SSE, WebSocket). Source: `src/runtime/__main__.py`. + +### Streamlit UI + +```bash +ASR_LOG_LEVEL=INFO uv run streamlit run src/runtime/ui.py --server.port 37777 +``` + +`ASR_LOG_LEVEL` env var enables structured logs at the chosen level +(`DEBUG` / `INFO` / `WARNING` / `ERROR`). Source: +`src/runtime/ui.py:46-65` (`_maybe_configure_logging`). + +The UI binds to the same `OrchestratorService` instance as the CLI; +both can run in the same process (Streamlit script imports the +service lazily on first session). + +## Test + +```bash +# Full suite +uv run pytest -x + +# Without coverage (faster) +uv run pytest -x --no-cov + +# A single file or test +uv run pytest tests/test_interrupt_detection.py -x -v + +# With coverage gate (fails below 85%) +uv run pytest --cov=src/runtime --cov-fail-under=85 -x +``` + +Pytest config: `pyproject.toml:53-58` — `asyncio_mode = "auto"`, +`testpaths = ["tests"]`, `pythonpath = ["src", "."]`. + +Coverage omits: `src/runtime/ui.py`, +`src/runtime/__main__.py`, `src/runtime/checkpointer_postgres.py`, +`src/runtime/triggers/transports/plugin.py` +(`pyproject.toml:71-76`). + +## Lint + type check + +```bash +uv run ruff check src/ tests/ +uv run pyright src/runtime +``` + +CI runs both with `fail-on-error`. + +## Concept-leak ratchet + +```bash +python scripts/check_genericity.py # current count +python scripts/check_genericity.py --baseline 39 # exit non-zero if exceeded +``` + +Enforced by `tests/test_genericity_ratchet.py` — the count must stay +at or below `BASELINE_TOTAL` (currently 39). + +## Bundle regeneration + +After ANY change to `src/runtime/` or `examples/*/`: + +```bash +uv run python scripts/build_single_file.py +git add dist/ +``` + +CI's "Bundle staleness gate (HARD-08)" rebuilds and fails the build +if `dist/*` doesn't match. See `docs/DEVELOPMENT.md` for the full +flow. + +## Required services + +Default config uses local-only services: + +- **SQLite** at `/tmp/asr.db` (auto-created on first run) +- **FAISS** vector index at `/tmp/asr-faiss/` (auto-created) +- **Ollama Cloud** (when `llm.default` points there) — needs + `OLLAMA_API_KEY` + +To start fresh after testing: +```bash +rm /tmp/asr.db /tmp/asr.db-wal /tmp/asr.db-shm +rm -rf /tmp/asr-faiss +``` + +## Common setup issues + +| Symptom | Cause | Fix | +|---|---|---| +| `ModuleNotFoundError: runtime` when running tests | `pythonpath` not picked up | Run via `uv run pytest …` (NOT bare `pytest`); pytest reads `[tool.pytest.ini_options].pythonpath` from `pyproject.toml` | +| CI fails "Lockfile freshness gate" | `pyproject.toml` changed without `uv lock` | Run `uv lock` and commit `uv.lock` | +| CI fails "Bundle staleness gate" | `src/runtime/` or `examples/*/` changed without `dist/` regen | Run `uv run python scripts/build_single_file.py` and commit `dist/*` | +| Live LLM tests fail with `Error code: 402` (OpenRouter) | Account out of credits | Switch `llm.default` to `gpt_oss` (Ollama) or another model in `config/config.yaml` | +| Live LLM tests fail with `Connection error` (Azure) | `.env` `AZURE_ENDPOINT` is a placeholder or unreachable | Set a real Azure endpoint, or skip Azure leg (test gates on `AZURE_OPENAI_KEY` + `AZURE_ENDPOINT` per `tests/test_integration_driver_s1.py`) | +| Streamlit dies on every Python 3.14 WebSocket request with `AssertionError: scope["type"] == "http"` | Streamlit ≤ x + Starlette static-files compat bug under Python 3.14 | Cosmetic — HTTP traffic still works. Filter logs with `grep -v "AssertionError\|scope.*type"`. Inference: fixed in newer Streamlit; not yet retested. | +| `gpt-oss:20b` returns errors / no envelope | Model dropped the markdown contract | Path 6 permissive synthesis (`turn_output.py`) emits a 0.30-confidence placeholder so the session still finalizes; retry the session for a real envelope | + +## Environment variables + +| Var | Required when | Notes | +|---|---|---| +| `OLLAMA_API_KEY` | `ollama_cloud` provider used | `config/config.yaml` references via `${OLLAMA_API_KEY}` | +| `OPENROUTER_API_KEY` | OpenRouter provider used | | +| `AZURE_OPENAI_KEY` | Azure provider used | | +| `AZURE_ENDPOINT` | Azure provider used | Full URL incl. trailing `/` | +| `AZURE_DEPLOYMENT` | Azure provider used | Defaults to `gpt-4o` in test driver | +| `EXTERNAL_MCP_URL` | external HTTP MCP server configured | (see `tests/fixtures/sample_config.yaml`) | +| `EXT_TOKEN` | external HTTP MCP server with bearer auth | | +| `ASR_LOG_LEVEL` | optional | `DEBUG` / `INFO` / `WARNING` / `ERROR`; UI uses it via `force=True` basicConfig | +| `APP_CONFIG` | optional | overrides default `config/config.yaml` path; read by `src/runtime/ui.py:68` | +| `OLLAMA_LIVE` | optional | gates live-LLM smoke tests in `tests/test_llm_providers_smoke.py` | +| `OLLAMA_BASE_URL` | required for `tests/test_integration_driver_s1.py` | Typically `https://ollama.com` for cloud or `http://localhost:11434` for local | + +CI uses dummy values for the API keys (see `ci.yml` — they only need +to *exist* for the strict-mode `_interpolate` check; tests don't call +live providers). diff --git a/docs/02-architecture.md b/docs/02-architecture.md new file mode 100644 index 0000000..146f3fc --- /dev/null +++ b/docs/02-architecture.md @@ -0,0 +1,146 @@ +# 02 — Architecture + +> Companion to [`docs/DESIGN.md`](DESIGN.md), which carries the +> long-form design narrative. This file is the quick-scan summary. + +## Major components + +``` ++------------------------------------------------------------+ +| App layer (examples/incident_management, examples/code_review) +| - state.py, config.py, skills/, mcp_server.py, ui.py | ++------------------------------------------------------------+ +| Framework — runtime/ | +| - Session, Skill, AgentRun, ToolCall, AgentTurnOutput | +| - Orchestrator, OrchestratorService | +| - Gateway (wrap_tool), policies, ToolRegistry | +| - SessionStore, HistoryStore, EventLog | +| - graph.py: build_graph + make_agent_node | +| - llm.py: provider abstraction | +| - ui.py: Streamlit shell | +| - api.py: FastAPI surface | ++------------------------------------------------------------+ +| LangGraph 1.x (orchestration / state / checkpointing) | +| LangChain 1.x (chat models, agents.create_agent, tools) | +| FastMCP (in-process / stdio / http MCP servers) | ++------------------------------------------------------------+ +| Providers: Ollama Cloud · OpenRouter · Azure OpenAI · … | ++------------------------------------------------------------+ +``` + +| Component | Source | Responsibility | +|---|---|---| +| `Session` (model) | `src/runtime/state.py:70-172` | Lifecycle + telemetry fields. Apps subclass. | +| `Skill` (config) | `src/runtime/skill.py` | YAML-driven agent declaration: kind, model, tools, routes, system_prompt | +| `Orchestrator` | `src/runtime/orchestrator.py` | Owns compiled langgraph + SessionStore + per-session lock | +| `OrchestratorService` | `src/runtime/service.py` | Long-lived asyncio loop wrapper. Thread-safe `submit_async` / `submit_and_wait` bridge | +| `make_agent_node` | `src/runtime/graph.py:539+` and `src/runtime/agents/responsive.py:49+` | Builds one langgraph node per skill | +| `_drive_agent_with_resume` | `src/runtime/graph.py:202+` | Drives `langchain.agents.create_agent` executor with HITL pause/resume | +| `wrap_tool` (Gateway) | `src/runtime/tools/gateway.py:224+` | Risk-rated tool wrapper; injects session-derived args; raises `interrupt()` on high-risk | +| `parse_envelope_from_result` | `src/runtime/agents/turn_output.py` | 6-path envelope parser (markdown-primary, with synthesis fallbacks) | +| `SessionStore` | `src/runtime/storage/session_store.py` | CRUD over `IncidentRow` + FAISS write-through | +| `HistoryStore` | `src/runtime/storage/history_store.py` | Read-only similarity search over the same engine | +| `EventLog` | `src/runtime/storage/event_log.py` | Append-only `session_events` table | +| `ApprovalWatchdog` | `src/runtime/tools/approval_watchdog.py` | Background task that times out stale `pending_approval` rows | + +## Request / data flow (one session, end-to-end) + +``` +UI / API ──start_session(query, environment, …)──▶ OrchestratorService + │ + ▼ + Orchestrator (per-session lock) + │ + new IncidentRow inserted ▼ + langgraph compiled graph (Pregel) + │ + ┌────────────────────────┴────────────────┐ + │ for each skill in topological order: │ + │ make_agent_node: │ + │ reload session from store │ + │ wrap tools (gateway) │ + │ create_agent (langgraph subgraph) │ + │ _drive_agent_with_resume: │ + │ inner.ainvoke(messages) │ + │ if __interrupt__: │ + │ raise GraphInterrupt → outer │ + │ pauses; UI sees pending_approval│ + │ else parse envelope, record │ + │ AgentRun, route on signal │ + │ gate node (low-confidence)? │ + │ route to next skill / __end__ │ + └─────────────────────────────────────────┘ + │ + finalize: terminal-tool match? ▼ + default_terminal_status? + agent failure? → status='error' + paused on HITL? → SKIP finalize +``` + +Detailed contract: `docs/DESIGN.md` § 4 + § 7. + +## Storage choices + +| Store | Backend | Default URL / path | Owner | +|---|---|---|---| +| Session metadata | SQLAlchemy (SQLite or Postgres) | `sqlite:////tmp/asr.db` | `SessionStore`, `HistoryStore` | +| Vector similarity | FAISS (filesystem) | `/tmp/asr-faiss/` | `SessionStore._add_vector` | +| LangGraph checkpoints | `langgraph-checkpoint-sqlite` (default) or `langgraph-checkpoint-postgres` (opt-in) | Same SQLite DB as session metadata | `make_checkpointer` (`src/runtime/checkpointer.py`) | +| Event log | SQLAlchemy `session_events` table | Same SQLite DB | `EventLog.append` | +| Memory layers (incident_management only) | Filesystem JSON | `incidents/{kg,releases,playbooks}/` (or seed bundle) | `examples/incident_management/asr/*_store.py` | +| Lesson store (auto-learning) | SQLAlchemy `session_lessons` table | Same SQLite DB | `LessonStore` | + +The whole framework runs against ONE durable backend (SQLite or +Postgres) carrying four separate concerns. Apps don't get to choose +backends per-store — the storage URL is a single config knob. + +## External systems + +The runtime in production reaches out to: + +- **LLM providers** (variable): Ollama Cloud, Azure OpenAI, + OpenAI-compatible endpoints (OpenRouter, etc.). Configured per + `llm.providers` in `config/config.yaml`. Stub provider for tests. +- **MCP servers**: in-process (Python module) by default; `stdio` + and `http` transports also supported per `mcp.servers[*].transport`. + Schema: `MCPServerConfig` in `src/runtime/config.py`. +- **APScheduler** (in-process): drives nightly `LessonRefresher` + jobs and any `schedule:` triggers from the trigger registry. + +The runtime does NOT reach out to: + +- The public internet at boot or runtime in air-gapped deploys — + every provider URL is configurable; the hardcoded + `https://ollama.com` fallback was removed in Phase 13 (HARD-05). +- Any package mirror at deploy time — the deploy is copy-only; + `uv sync` runs at *install* time inside CI / the dev box. + +## Important tradeoffs + +| Decision | Trade | Where decided | +|---|---|---| +| LangGraph as orchestration engine | Don't maintain a graph engine; pay for langgraph version churn | `docs/DESIGN.md` DEC-001 | +| `langchain.agents.create_agent` for the per-agent loop | Single tool-loop with native ToolStrategy fallback; we're tied to langchain v1.x's agent API | `docs/DESIGN.md` DEC-002, Phase 15 | +| Markdown contract over `response_format` JSON | Lenient parsing in our code; 7 parse paths instead of 1 schema | DEC-003, Phase 22 | +| Pure-policy HITL gating | One source of truth (`should_gate`); everywhere else just calls it | DEC-004, Phase 11 | +| Generic `Session` + `extra_fields` JSON | Apps can extend without schema migrations; loses some type safety on app fields | DEC-005, v1.1 | +| Per-agent `skill.model` override | Cheap models for cheap agents; one config to think about | DEC-006, v1.5-C | +| Single-file bundle | Air-gap deployable; large files for review (~600KB each) | DEC-007, BUNDLER-01 | +| Concept-leak ratchet | CI gate keeps framework generic; some legitimate `incident` references look like leaks until cleaned | DEC-008, v1.5-B | +| 429 separate retry regime (longer backoff) | Free-tier OpenRouter survives transient throttles; non-429 4xx still fail fast | DEC-009, v1.5-D | +| Inner agent checkpointer + reload-on-entry | HITL Approve/Reject actually drives the gated tool; more state per agent invocation | DEC-010, PR #6 | + +## What this architecture is NOT + +- **Not a workflow engine** — agents are LLM-driven, not declarative + state machines. Routing is signal-based, not condition-tree. +- **Not multi-tenant by default** — one process, one orchestrator, + one storage URL. Multi-tenant deployments need a separate + process/DB per tenant. +- **Not horizontally scalable** — `OrchestratorService` is a + single-process / single-loop model. The lock registry + (`SessionLockRegistry`) prevents concurrent writes per session + but assumes one orchestrator per DB. +- **Not authenticated** — there's no built-in user authentication on + the FastAPI surface. Air-gap deploys live behind corporate + network controls; trigger webhook auth is bearer-token only. diff --git a/docs/03-code-map.md b/docs/03-code-map.md new file mode 100644 index 0000000..1fbb72d --- /dev/null +++ b/docs/03-code-map.md @@ -0,0 +1,169 @@ +# 03 — Code map + +> Approximate line counts and per-folder purpose. Verify with +> `find -name '*.py' -exec wc -l {} +`. + +## Top-level directories + +| Path | Purpose | +|---|---| +| `src/runtime/` | Framework code — the only thing the bundler reads to produce `dist/app.py` | +| `examples/incident_management/` | Flagship example app: SRE incident investigation pipeline | +| `examples/code_review/` | Second example app: PR review pipeline (proves framework genericity) | +| `tests/` | Pytest suite (149 test files; 1265 tests; 87% coverage on `src/runtime/`) | +| `scripts/` | Build + lint utilities | +| `config/` | Default framework config + per-app config + skill prompt directory | +| `docs/` | This documentation set + DESIGN narrative + dev / install how-tos | +| `dist/` | **Generated** by `scripts/build_single_file.py`; never hand-edit | +| `ui/` | Streamlit launcher shim (`streamlit_app.py`) | +| `.github/workflows/` | CI: lint + type-check + test + sonar (`ci.yml`) | +| `.planning/` | **Gitignored** local working state (GSD planning workflow); selected artifacts can be committed but rarely should be | + +Top-level files: + +| File | Purpose | +|---|---| +| `pyproject.toml` | Project metadata, deps, pytest/ruff/pyright/coverage config | +| `uv.lock` | Pinned dependency graph with hashes — reproducible installs | +| `pyrightconfig.json` | Pyright typing config (CI gate fails on errors per `ci.yml`) | +| `sonar-project.properties` | SonarCloud analysis config (sources, exclusions, CPD exclusions, coverage paths) | +| `README.md` | Repo intro pointing at `docs/DESIGN.md` | + +--- + +## `src/runtime/` (~18 200 lines total) + +### Top-level modules + +| File | LOC | Purpose | Related | +|---|---|---|---| +| `__main__.py` | ~70 | argparse-only CLI entry: `python -m runtime --config ` | `orchestrator.py`, `service.py`, `api.py` | +| `__init__.py` | 0 | empty | | +| `state.py` | 173 | `Session`, `AgentRun`, `ToolCall`, `TokenUsage` pydantic models | All app `Session` subclasses extend the model here | +| `state_resolver.py` | ~70 | Loads the app's `state_class` from a dotted path (`runtime.state_class` config) | Wave-2 generic-runtime decoupling | +| `skill.py` | ~520 | `Skill`, `RouteRule`, `DispatchRule`, skill loader (reads YAML + system.md per skill folder) | `examples/*/skills/*/config.yaml` | +| `config.py` | ~1100 | All pydantic config schemas: `AppConfig`, `LLMConfig`, `MCPConfig`, `OrchestratorConfig`, `GatewayConfig`, `GatePolicy`, etc. | `config/config.yaml` | +| `errors.py` | ~50 | Typed exceptions: `LLMTimeoutError`, `LLMConfigError`, `EnvelopeMissingError` | +| `llm.py` | ~600 | `get_llm`, `get_embedding` — provider abstraction + `StubChatModel` for tests | `langchain-openai`, `langchain-ollama` | +| `mcp_loader.py` | ~270 | Loads MCP servers per `mcp.servers[*]`, builds `ToolRegistry` | `fastmcp`, `langchain-mcp-adapters` | +| `orchestrator.py` | ~1400 | `Orchestrator` class — `start_session`, `stream_session`, `resume_session`, `retry_session`, `_finalize_session_status_async`, `_is_graph_paused` | `graph.py`, `service.py` | +| `service.py` | ~830 | `OrchestratorService` — long-lived asyncio loop; thread-safe bridge | Used by both UI + API | +| `api.py` | ~880 | FastAPI surface — `/sessions/*` REST + SSE + WebSocket + approvals | `service.py` | +| `api_dedup.py` | ~110 | API endpoint for retracting dedup matches | `dedup.py` | +| `graph.py` | ~1430 | LangGraph build (`build_graph`, `_build_agent_nodes`), `make_agent_node`, `_drive_agent_with_resume`, `_ainvoke_with_retry`, `parse_envelope_from_result` callers | `langgraph`, `langchain.agents.create_agent` | +| `intake.py` | ~250 | Default intake supervisor runner — similarity retrieval + dedup gate | `dedup.py`, `LessonStore` | +| `dedup.py` | ~430 | Two-stage dedup pipeline (embedding similarity + LLM stage 2) | `HistoryStore` | +| `similarity.py` | ~50 | Cosine similarity helper | +| `policy.py` | ~270 | Pure functions — `should_gate`, `should_retry`, gate decision dataclass | `gateway.py`, `orchestrator.py` | +| `locks.py` | ~120 | `SessionLockRegistry` — per-session asyncio locks; `SessionBusy` exception; D-01 contract | `orchestrator.py`, `service.py` | +| `checkpointer.py` | ~120 | LangGraph checkpointer factory (sqlite default); `make_checkpointer` | `langgraph-checkpoint-sqlite` | +| `checkpointer_postgres.py` | ~80 | Postgres checkpointer (lazy-imported; `pip install asr[postgres]`) | `langgraph-checkpoint-postgres` | +| `dedup.py` | ~430 | Two-stage dedup (embedding + LLM) | (above) | +| `terminal_tools.py` | ~80 | Maps terminal tool names → status transitions per `cfg.orchestrator.terminal_tools` | +| `skill_validator.py` | ~110 | Validates `skill.model` references against `LLMConfig.models` at orchestrator boot | + +### Subpackages + +| Path | Purpose | Key files | +|---|---|---| +| `agents/` | Agent-kind factories | `responsive.py` (default LLM agent — mirrors `graph.py:make_agent_node`), `supervisor.py` (rule/llm dispatch), `monitor.py` (out-of-band runner), `turn_output.py` (envelope parser + `AgentTurnOutput` model) | +| `tools/` | Gateway + arg-injection + watchdog | `gateway.py` (~830 lines — risk-rated wrap + interrupt/resume), `arg_injection.py` (session-derived args), `approval_watchdog.py` (~320 lines — stale-approval timeout), `__init__.py` | +| `storage/` | Persistence | `models.py` (SQLAlchemy `IncidentRow` + `EventRow` + `LessonRow`), `engine.py` (engine factory), `embeddings.py` (FAISS-backed embedder), `vector.py` (vector store), `session_store.py` (~660 lines — CRUD), `history_store.py` (~230 lines — read-only similarity), `event_log.py` (~135 lines), `lesson_store.py` (~150 lines), `migrations.py` (~210 lines), `checkpoint_gc.py` (~50 lines) | +| `learning/` | Auto-learning (M5/M6) | `extractor.py` (lesson extraction at finalize), `scheduler.py` (~160 lines — APScheduler nightly refresher) | +| `memory/` | App-overridable memory hooks | `session_state.py`, `hypothesis.py` (triage hypothesis loop), `knowledge_graph.py`, `release_context.py`, `playbook_store.py`, `resolution.py` — these are runtime-agnostic helpers; the L2/L5/L7 stores in `examples/incident_management/asr/` use them | +| `triggers/` | Trigger registry | `base.py` (TriggerTransport ABC), `config.py`, `registry.py` (~320 lines), `idempotency.py` (~210 lines), `auth.py` (bearer), `resolve.py`, `transports/api.py`, `transports/webhook.py` (~140 lines), `transports/schedule.py` (~85 lines), `transports/__init__.py` | + +--- + +## `examples/incident_management/` + +| File | Purpose | +|---|---| +| `__init__.py` | empty | +| `state.py` | `IncidentState(Session)` subclass — `query`, `environment`, `reporter`, `summary`, `tags`, `severity`, `category`, `matched_prior_inc`, `resolution`, `memory: MemoryLayerState` | +| `mcp_server.py` | `IncidentMCPServer` — `lookup_similar_incidents`, `create_incident`, `update_incident`, `submit_hypothesis`, `mark_resolved`, `mark_escalated`, `hydrate_and_gate` (memory hydration + dedup gate) | +| `mcp_servers/observability.py` | Observability tools: `get_logs`, `get_metrics`, `get_service_health`, `check_deployment_history` | +| `mcp_servers/remediation.py` | Remediation tools: `propose_fix`, `apply_fix` (gated `high`), `notify_oncall` | +| `mcp_servers/user_context.py` | User-context tools | +| `asr/` | L2 Knowledge Graph + L5 Release Context + L7 Playbook stores (filesystem-backed) | +| `skills/intake/` | Supervisor skill: rule-dispatch to triage; runs similarity + memory hydration | +| `skills/triage/` | Hypothesis-loop investigator | +| `skills/deep_investigator/` | Evidence gathering | +| `skills/resolution/` | Propose / apply fix or escalate | +| `skills/_common/` | Shared prompt fragments (output contract, confidence calibration) | + +Per-skill structure: `/config.yaml` + `/system.md`. + +--- + +## `examples/code_review/` + +| File | Purpose | +|---|---| +| `state.py` | `CodeReviewState(Session)` — `pr: PullRequest`, `review_findings: list[ReviewFinding]`, `overall_recommendation`, `review_summary`, `review_token_budget` | +| `mcp_server.py` | `CodeReviewMCPServer` — `fetch_pr_diff` (mock), `add_review_finding`, `set_recommendation` | +| `skills/intake/` `analyzer/` `recommender/` | 3-skill responsive pipeline | + +Demonstration / mock; the diff fetch reads `tests/fixtures/code_review//.json` if present. + +--- + +## `tests/` (149 files) + +Test groups by topic (sample): + +| Pattern | Topic | +|---|---| +| `test_agent_node*.py`, `test_real_llm_tool_loop_termination.py`, `test_integration_driver_s1.py` | Agent runner contract, live-LLM smoke | +| `test_interrupt_detection.py`, `test_gateway_persist_resolution.py`, `test_orchestrator_pause_detection.py`, `test_approval_*.py` | HITL approve/reject end-to-end | +| `test_markdown_turn_output.py` | Phase 22 envelope parser (36 tests) | +| `test_ainvoke_retry_429.py` | 429 retry backoff regime | +| `test_per_agent_model_dispatch.py` | v1.5-C per-agent dispatch contract | +| `test_genericity_ratchet.py`, `test_concept_leak_ratchet.py` | Framework-leak counters | +| `test_session_store.py`, `test_incident_store.py`, `test_history_store.py`, `test_dedup_*.py` | Storage layer | +| `test_telemetry_integration.py`, `test_event_log.py` | Per-step events | +| `test_api*.py`, `test_approval_api.py`, `test_session_lock.py` | FastAPI surface + lock contract | +| `test_bundle_*.py`, `test_build_*.py` | Bundler + bundle completeness | +| `test_triggers/` | Trigger registry transports | +| `test_ui_*.py`, `test_render_*.py` | Streamlit UI helpers | +| `test_skill*.py` | Skill loader, model override resolution | + +Helpers: `tests/_envelope_helpers.py`, `tests/_policy_helpers.py`, +`tests/conftest.py` (if present), `tests/fixtures/` (sample +configs, mock PR diffs). + +--- + +## `scripts/` + +| Script | Purpose | +|---|---| +| `build_single_file.py` | The bundler. Reads `RUNTIME_MODULE_ORDER` + per-app order lists, flattens into `dist/`. **Must run after any change to `src/runtime/` or `examples/`** | +| `check_genericity.py` | Counts `incident` / `severity` / `reporter` tokens in `src/runtime/`. Powers the ratchet test | +| `lint_skill_prompts.py` | Phase 21 (SKILL-LINTER-01) — walks every `examples/*/skills/*/system.md` and asserts referenced tool names + arg fields exist in the inventory | +| `migrate_jsonl_to_sql.py` | One-off migration for legacy JSONL incident store → SQLAlchemy | +| `seed_demo_incidents.py` | Seeds the FAISS index + sqlite DB with demo data for UI walkthroughs | + +--- + +## `config/` + +| File | Purpose | +|---|---| +| `config.yaml` | Default framework config — LLM providers + models, MCP servers, storage URL, trigger registry, gateway policy | +| `config.yaml.example` | Annotated template for new deploys | +| `incident_management.yaml` | Incident-app composite config (framework + app keys) | +| `code_review.yaml`, `code_review.runtime.yaml` | Code-review composite config | +| `skills/` | Optional shared skill prompts (rare; usually skills live under `examples//skills/`) | + +--- + +## `docs/` + +| File | Purpose | +|---|---| +| `DESIGN.md` | Long-form architecture + decisions narrative | +| `DEVELOPMENT.md` | Day-to-day dev workflow | +| `AIRGAP_INSTALL.md` | Air-gap install procedure | +| `00-…` through `11-…` | This brownfield documentation set (you're reading it) | +| `adr/0001-…` | Architecture Decision Record | diff --git a/docs/04-main-flows.md b/docs/04-main-flows.md new file mode 100644 index 0000000..bfd2e36 --- /dev/null +++ b/docs/04-main-flows.md @@ -0,0 +1,288 @@ +# 04 — Main flows + +For each flow: **entry points**, **key files**, and **failure +modes**. Companion to `docs/DESIGN.md` § 2 (architecture overview) +and § 7 (HITL). + +--- + +## Auth / login + +**Status: not present in framework.** Air-gap deploys rely on +corporate network controls (the runtime never opens its own +auth surface). + +The only auth surface in the framework is **bearer-token +auth on webhook trigger endpoints** (`auth: bearer` in +`triggers:` config; token read from env var at startup; constant- +time comparison via `hmac.compare_digest`). + +Entry point: `src/runtime/triggers/auth.py`, +`src/runtime/triggers/transports/webhook.py`. + +Failure modes: +- Missing/empty token env → trigger refuses to start + (`LLMConfigError` analogue at config-load) +- Wrong bearer → `HTTP 401` +- Timing-safe comparison only — no rate limiting in the framework + +--- + +## Session lifecycle (request → terminal) + +Entry points (any of): + +- **CLI** — `python -m runtime --config ` (boots the FastAPI surface) +- **API** — `POST /sessions` (`src/runtime/api.py`) +- **Streamlit UI** — "Start Investigation" button (`src/runtime/ui.py`) +- **Webhook trigger** — `POST /triggers/{name}` (configured per `triggers:` block in YAML) +- **Schedule trigger** — APScheduler cron (in-process) +- **Plugin trigger** — custom transport via setuptools entry-point + +All entry points converge on +`OrchestratorService.start_session(query=…, environment=…, …)`, +which: + +1. Allocates the session ID synchronously on the loop +2. Inserts the row (`status='new'`) +3. Spawns an `asyncio.Task` for `Orchestrator.graph.ainvoke(...)` +4. Returns the session ID immediately (caller polls or streams) + +Key files: + +- `src/runtime/service.py:start_session` (entry point) +- `src/runtime/orchestrator.py:start_session` (per-session lock + graph kick-off) +- `src/runtime/graph.py:make_agent_node` (per-skill agent step) +- `src/runtime/agents/turn_output.py:parse_envelope_from_result` (envelope contract enforcement) +- `src/runtime/orchestrator.py:_finalize_session_status_async` (terminal status assignment) + +Per-step events emitted to `EventLog`: +`agent_started → tool_invoked* → confidence_emitted → route_decided +→ agent_finished` per agent; `gate_fired` at HITL boundaries; +`status_changed` on terminal transitions. + +Failure modes: + +| What | Symptom | Where caught | +|---|---|---| +| LLM 5xx / connection reset | Retried 3× with 1.5s/3s/4.5s backoff | `_ainvoke_with_retry` | +| LLM 429 rate-limit | Retried 3× with 7.5s/15s/22.5s backoff | `_ainvoke_with_retry` | +| LLM 4xx (non-429) | Fail immediately → `_handle_agent_failure` → `status='error'` | `make_agent_node` exception arm | +| LLM dropped markdown contract (no envelope) | Path 5 (terminal-tool args) → Path 6 (permissive synthesis) → 0.30-confidence placeholder | `parse_envelope_from_result` | +| LLM dropped contract AND no tool calls | Hard fail → `EnvelopeMissingError` → `status='error'` | Path 7 | +| HITL high-risk tool gate fires | `interrupt()` raised, session stays `in_progress`, pending_approval row written | `gateway.wrap_tool` | +| Operator times out an approval | `ApprovalWatchdog` resolves with `verdict=timeout` | `tools/approval_watchdog.py` | +| Stale-version save (concurrent writers) | `StaleVersionError` raised; caller reloads + retries | `SessionStore.save` | +| Recursion limit hit on inner agent | LangGraph `GraphRecursionError` propagates → `_handle_agent_failure` | langgraph default bound | + +--- + +## HITL approve / reject (high-risk tool) + +Trigger: an agent calls a tool tagged `high` in +`runtime.gateway.policy` (or matching `gate_policy.resolution_trigger_tools` +in production env). + +Flow: + +``` +agent calls apply_fix + └─ gateway _arun + ├─ inject session-derived args (e.g. environment) + ├─ should_gate → GateDecision(gate=True, reason=…) + ├─ append ToolCall(status='pending_approval') + store.save + └─ langgraph.types.interrupt(payload) ← pauses inner agent + ↓ +inner.ainvoke returns with __interrupt__ in result dict + └─ _drive_agent_with_resume detects, raises GraphInterrupt + ↓ +outer Pregel pauses (state checkpointed) + └─ ainvoke returns with __interrupt__ on outer state + ↓ +finalize SKIPPED (Orchestrator._is_graph_paused → True) + ↓ +[UI / API: operator clicks Approve or POSTs to /approvals/{tcid}] + ↓ +graph.ainvoke(Command(resume={"decision": "approve", ...})) + └─ outer node re-runs + └─ _drive_agent_with_resume: aget_state(inner_cfg).next non-empty + └─ outer interrupt() → returns the verdict dict + └─ inner.ainvoke(Command(resume=verdict), config=inner_cfg) + └─ gateway _arun re-enters + └─ verdict == "approve" → run apply_fix + └─ update pending row → status='approved' + save + ↓ +inner agent finishes + └─ envelope parsed, AgentRun recorded, route to next node / END + ↓ +outer ainvoke returns + └─ finalize runs (no longer paused) → terminal status set +``` + +Key files: + +- `src/runtime/tools/gateway.py:_arun` (and `_run` mirror) — the + pause + resume entry points +- `src/runtime/graph.py:_drive_agent_with_resume` — the + langgraph 1.x `__interrupt__` plumbing +- `src/runtime/orchestrator.py:_is_graph_paused` — finalize guard +- `src/runtime/api.py:submit_approval_decision` — HTTP approval handler +- `src/runtime/ui.py:_submit_approval_via_service` — UI approval handler +- `src/runtime/tools/approval_watchdog.py` — stale-approval timeout + +Failure modes: + +| What | Symptom | Where caught | +|---|---|---| +| `Command(resume=…)` raises `Cannot use Command(resume=...) without checkpointer` | Inner agent missing checkpointer | Inner `create_agent` always gets `checkpointer=` per PR #6 | +| Stale `state["session"]` on resume → gateway double-appends → `StaleVersionError` | Outer Pregel checkpoint at step boundaries doesn't reflect mid-step gateway saves | `make_agent_node` reloads from store at entry per PR #6 | +| Operator approves but DB row stays `pending_approval` | Gateway didn't save after status transition | `_record_pending_resolution` saves after every transition (approved/rejected/timeout) per PR #6 | +| Session goes to `error` instead of resuming | Pre-PR-#6 langgraph 1.x silently swallowed `interrupt()` and finalized the session | Fixed by `_drive_agent_with_resume` | + +--- + +## Background jobs + +### `LessonRefresher` (auto-learning, M5/M6) + +Source: `src/runtime/learning/scheduler.py`. + +Runs an APScheduler job (default: nightly 02:00 UTC; configurable +via `learning.scheduler` block in YAML). For each session resolved +since the last run, extracts a `Lesson` row capturing the winning +hypothesis + applied fix. + +Entry: `LessonRefresher.start()` (called by lifespan hook in +`src/runtime/api.py`). + +Failure modes: +- Job exception → APScheduler logs and continues to next tick + (defensive `try/except` around the per-session extraction) +- Long-running extraction blocks subsequent ticks within the same + scheduler — bounded by per-session timeout + +### `ApprovalWatchdog` + +Source: `src/runtime/tools/approval_watchdog.py`. + +Polls the DB for `pending_approval` rows older than +`framework.approval_timeout`. Resolves them with +`verdict=timeout` so operators don't end up with permanently-paused +sessions. + +Entry: `ApprovalWatchdog.start()` (called by lifespan hook). + +Failure modes: +- DB unreachable → logged, retried next tick +- Resolution race with concurrent operator approval → + `StaleVersionError`; watchdog reloads + retries + +--- + +## Data ingestion / sync + +### Trigger registry + +Source: `src/runtime/triggers/`. + +Three transport flavours configurable in `config.yaml`'s +`triggers:` block: + +| Transport | Entry | Per-trigger config | +|---|---|---| +| `webhook` | `POST /triggers/{name}` (FastAPI route) | `payload_schema`, `transform`, `auth`, `idempotency_ttl_hours` | +| `schedule` | APScheduler in-process cron | `schedule:` 5-field cron, `payload:` static | +| `plugin` | custom (`TriggerTransport` ABC, setuptools entry-point) | per-plugin | +| `api` | back-compat for `POST /investigate` | (deprecated alias) | + +Each trigger fires `OrchestratorService.start_session(...)` with +a synthetic payload. Provenance stamped on +`session.findings['trigger']` so dashboards can answer "where did +this come from?" + +Failure modes: + +| What | Symptom | Where caught | +|---|---|---| +| Transform raises | `HTTP 422 Unprocessable Entity`, NOT cached for idempotency | `transports/webhook.py` | +| Auth fails | `HTTP 401` | `triggers/auth.py` | +| Idempotency-Key replay | First request's `session_id` returned | `triggers/idempotency.py` | +| Schedule drift | ±1 minute under normal load (in-process APScheduler limit) | Inference: not measured; documented in legacy README | + +### Two-stage dedup pipeline + +Source: `src/runtime/dedup.py`. + +Stage 1: embedding similarity over closed sessions +(`HistoryStore.find_similar`). Stage 2: LLM judge confirms (or +rejects) the match. Confirmed matches mark the new session +`status='duplicate'` with `parent_session_id` linkage. + +Entry: `Orchestrator._run_dedup_check` called early in +`start_session`. + +Failure modes: +- LLM stage 2 throws → degrades to "not a duplicate" so dedup + never crashes intake (`Orchestrator._run_dedup_check` catches `Exception`) +- No similar sessions → returns False, normal flow proceeds + +--- + +## Deployment + +Source: `scripts/build_single_file.py`, +`docs/AIRGAP_INSTALL.md`, `docs/DEVELOPMENT.md`. + +**Build (CI / dev box):** +```bash +uv sync --frozen --extra dev +uv run python scripts/build_single_file.py +git add dist/ && git commit +``` + +**Deploy (target host, copy-only):** + +7-file payload: +``` +app.py (renamed from dist/apps/.py) +ui.py (dist/ui.py) +config/config.yaml (framework: LLM, MCP, storage) +config/.yaml (app: severity aliases, escalation roster, …) +config/skills/ (optional skill prompt overrides) +.env (provider keys) +``` + +Boot: +```bash +python -m runtime --config config/.yaml +streamlit run ui.py --server.port 37777 +``` + +CI gate `Bundle staleness gate (HARD-08)` rebuilds the bundles +from source on every PR and refuses the merge if `dist/*` differs +from a fresh build. Means `dist/*` on `main` is always +deploy-ready. + +Failure modes: + +| What | Symptom | Where caught | +|---|---|---| +| New `src/runtime/` module not in `RUNTIME_MODULE_ORDER` | `tests/test_bundle_completeness.py` fails | Local pytest before push | +| Bundle drift (changed src without dist regen) | CI's "Bundle staleness gate" fails | CI | +| Bundle doesn't boot from a clean tmpdir | `tests/test_build_single_file.py` smoke check | Local | +| Lockfile drift | CI's "Lockfile freshness gate" fails | CI (`uv lock --check`) | + +--- + +## Error handling (cross-cutting patterns) + +| Pattern | Example | Source | +|---|---|---| +| Typed exception hierarchy | `LLMTimeoutError`, `LLMConfigError`, `EnvelopeMissingError`, `SessionBusy`, `StaleVersionError` | `src/runtime/errors.py`, `storage/session_store.py`, `locks.py`, `agents/turn_output.py` | +| Bounded retries on transient cloud errors | `_ainvoke_with_retry` (5xx + 429) | `src/runtime/graph.py` | +| Fail-fast on policy errors | `should_gate` raises before tool runs | `src/runtime/policy.py` | +| Defensive try/except around telemetry | EventLog failures NEVER break a tool call | `gateway.py` `_emit_invoked` | +| `_handle_agent_failure` for caught LLM exceptions | Marks session `error` + records failure agent_run | `src/runtime/graph.py` | +| Per-session async lock prevents concurrent writes | `SessionLockRegistry.acquire(session_id)` | `src/runtime/locks.py`, used by `service.py` + `api.py` | +| Optimistic concurrency on save | `version` column on `IncidentRow`; `StaleVersionError` on mismatch | `storage/session_store.py:save` | +| Silent-failure sweep (Phase 18 / HARD-04) | All `except Exception: pass` blocks replaced with logged re-raise or typed handler | `tests/test_silent_failure_sweep.py` (Inference: name based on phase) | diff --git a/docs/05-configuration.md b/docs/05-configuration.md new file mode 100644 index 0000000..9687f7c --- /dev/null +++ b/docs/05-configuration.md @@ -0,0 +1,285 @@ +# 05 — Configuration + +## Layered config + +Two layers, in order of precedence: + +| Layer | File(s) | Owns | +|---|---|---| +| **Framework** | `config/config.yaml` (or `${APP_CONFIG}`) | LLM providers + models, MCP servers, storage URL, gateway policy, framework knobs (confidence threshold, escalation roster, dedup), trigger registry, runtime tunables | +| **App** | `examples//config.yaml`, `config/.yaml` (composite) | Domain-specific knobs: severity aliases, escalation teams, environments, similarity thresholds | + +Source: `src/runtime/config.py` (~1100 lines) holds every pydantic +schema. Framework reads + validates at orchestrator boot via +`load_config(path)`. + +The framework's `AppConfig` does **not** contain incident-shaped +keys — they live on `IncidentAppConfig`. Adding a new domain +field is a one-line addition to `IncidentAppConfig`, never to +`runtime.config.AppConfig`. + +--- + +## Environment variables + +Used in `config.yaml` via `${VAR_NAME}` interpolation +(`src/runtime/config.py:_interpolate`). Strict-mode resolver +**fails at config-load** if a referenced var is missing — this +is by design, so missing keys can't silently fall through to +"use default model". + +| Var | Used by | Default | Notes | +|---|---|---|---| +| `OLLAMA_API_KEY` | `ollama_cloud` provider | none | Required if any `llm.providers.*.kind: ollama` entry references it | +| `OPENROUTER_API_KEY` | `openai_compat` provider via OpenRouter | none | | +| `AZURE_OPENAI_KEY` | `azure_openai` provider | none | | +| `AZURE_ENDPOINT` | `azure_openai` provider | none | Full URL incl. trailing `/` | +| `AZURE_DEPLOYMENT` | `smart` model in default config | `gpt-4o` (test driver default) | Per-deployment Azure name | +| `EXTERNAL_MCP_URL` | external HTTP MCP server | none | See `tests/fixtures/sample_config.yaml` | +| `EXT_TOKEN` | external HTTP MCP server bearer auth | none | | +| `ASR_LOG_LEVEL` | `src/runtime/ui.py:46-65` | unset (silent) | `DEBUG` / `INFO` / `WARNING` / `ERROR`; takes effect via `force=True` `logging.basicConfig` | +| `APP_CONFIG` | `src/runtime/ui.py:68` | `config/config.yaml` | Path override | +| `OLLAMA_LIVE` | `tests/test_llm_providers_smoke.py` | unset (skip) | Set to `1` to opt into live Ollama smoke | +| `OLLAMA_BASE_URL` | `tests/test_integration_driver_s1.py` | unset | Required for the integration driver `local` arm | + +CI config (`.github/workflows/ci.yml:71-83`) sets dummy values for +all the above so the strict `_interpolate` check passes — tests +don't call live providers. + +--- + +## Config file: `config/config.yaml` + +Top-level structure (see +`config/config.yaml.example` for an annotated template): + +```yaml +storage: + metadata: + url: "sqlite:////tmp/asr.db" # SQLAlchemy URL + pool_size: 5 # postgres only; sqlite uses NullPool + echo: false # SQL echo to stdout + vector: + backend: faiss # faiss | pgvector | none + path: "/tmp/asr-faiss" # FAISS only + collection_name: "incidents" + distance_strategy: cosine # cosine | euclidean | inner_product + +llm: + default: workhorse # name from llm.models below + providers: + ollama_cloud: + kind: ollama + base_url: https://ollama.com + api_key: ${OLLAMA_API_KEY} + azure: + kind: azure_openai + endpoint: ${AZURE_ENDPOINT} + api_version: 2024-08-01-preview + api_key: ${AZURE_OPENAI_KEY} + openrouter: + kind: openai_compat + base_url: https://openrouter.ai/api/v1 + api_key: ${OPENROUTER_API_KEY} + stub: + kind: stub # in-memory canned responses for tests + models: + workhorse: + provider: openrouter + model: inclusionai/ring-2.6-1t:free + temperature: 0.0 + gpt_oss: + provider: ollama_cloud + model: gpt-oss:20b + temperature: 0.0 + gpt_oss_cheap: + provider: ollama_cloud + model: gpt-oss:20b + temperature: 0.4 + smart: + provider: azure + model: gpt-4o + deployment: gpt-4o + temperature: 0.0 + embedding: + provider: ollama_cloud + model: nomic-embed-text # single embedding model + +mcp: + servers: + - name: local_inc + transport: in_process # in_process | stdio | http | sse + module: examples.incident_management.mcp_server + category: incident_management + - name: local_observability + transport: in_process + module: examples.incident_management.mcp_servers.observability + category: observability + # ... + +runtime: + state_class: examples.incident_management.state.IncidentState + gateway: + policy: # tool_name -> low | medium | high + apply_fix: high + restart_service: medium + get_logs: low + max_concurrent_sessions: 8 # SessionCapExceeded → HTTP 429 + +orchestrator: + entry_agent: intake # name of the first skill in the graph + default_terminal_status: needs_review + signals: [success, failed, needs_input] + injected_args: + environment: state.environment # session-derived args injected before LLM-visible signature + terminal_tools: # tool_name -> status transition rules + - tool_name: mark_resolved + status: resolved + kind: terminal + - tool_name: mark_escalated + status: escalated + kind: escalation + extract_fields: { team: args.team } + patch_tools: [submit_hypothesis, update_incident] + default_llm_request_timeout: 120.0 + +framework: + confidence_threshold: 0.75 + escalation_teams: [payments-oncall, infra-oncall, ...] + approval_timeout: 1800 # seconds; ApprovalWatchdog timeout + intake_context: {} # generic intake bag + session_id_prefix: INC # apps override (CR for code-review) + +dedup: + enabled: true + stage1_top_k: 5 + stage1_threshold: 0.82 + stage2_model: workhorse + prompt_template: | # LLM judge prompt (defaultable) + ... + +triggers: # optional; trigger registry transports + - name: pagerduty-incident + transport: webhook + target_app: incident_management + payload_schema: examples.incident_management.triggers.PagerDutyPayload + transform: examples.incident_management.triggers.transform_pagerduty + auth: bearer + auth_token_env: PAGERDUTY_WEBHOOK_TOKEN + idempotency_ttl_hours: 24 + +learning: + scheduler: + enabled: true + cron: "0 2 * * *" # nightly 02:00 UTC +``` + +Inference: not every block above is required for a minimal boot; +omitting `triggers` / `dedup` / `learning` is supported (they're +optional). + +--- + +## Per-skill config + +Each skill is a `/config.yaml` + `/system.md` +pair under `examples//skills/`. + +```yaml +# examples/incident_management/skills/triage/config.yaml +description: Hypothesis-loop triage agent +kind: responsive # responsive | supervisor | monitor +model: gpt_oss_cheap # optional per-agent override; falls back to llm.default +tools: + local_inc: + - submit_hypothesis + - update_incident + local_observability: + - get_logs + - get_metrics + - get_service_health + - check_deployment_history +routes: + - when: success + next: deep_investigator + - when: needs_input + next: __end__ + gate: confidence + - when: default + next: deep_investigator +``` + +The accompanying `system.md` is the system prompt template. It must +include the markdown turn-output contract block (see +`examples/incident_management/skills/_common/output.md`) — failure +to include it will trip the envelope parser unless gpt-oss +synthesises something Path 6 can salvage. + +--- + +## Feature flags + +There are no first-class feature flags. Toggles are config-driven: + +| Toggle | Mechanism | +|---|---| +| Disable dedup | `dedup.enabled: false` | +| Disable auto-learning scheduler | `learning.scheduler.enabled: false` | +| Disable HITL gating per env | `gate_policy.gated_environments: []` | +| Disable a tool's risk tier | Remove from `runtime.gateway.policy` (defaults to `auto`) | +| Disable a trigger | Remove from `triggers:` block; restart | +| Switch checkpointer to postgres | Install `asr[postgres]`; change `storage.metadata.url` to a postgres URL | + +--- + +## Secrets required (production) + +For a typical incident-management deploy: + +| Secret | Purpose | +|---|---| +| `OLLAMA_API_KEY` (or `OPENROUTER_API_KEY`, etc.) | LLM provider auth | +| `AZURE_OPENAI_KEY` + `AZURE_ENDPOINT` | If Azure provider used | +| Webhook bearer tokens (e.g. `PAGERDUTY_WEBHOOK_TOKEN`) | If webhook triggers configured | +| Postgres credentials in the SQLAlchemy URL | If `storage.metadata.url` points at postgres | + +**Do NOT commit secrets.** The framework reads them from env vars +via `${VAR_NAME}` interpolation; bind them via your deploy's +secret manager (k8s secret / docker `--env-file` / etc.). + +`.env` is gitignored at the repo root. CI uses dummy values. + +--- + +## Safe defaults + +The shipped `config/config.yaml.example` documents safe defaults: + +- `llm.default: stub_default` — runs without any LLM + provider keys (useful for first boot / smoke) +- `storage.metadata.url: sqlite:///incidents/incidents.db` — local + SQLite, no external service +- `vector.backend: faiss` — local FAISS, no external service +- No `triggers:` block — trigger registry off; only `POST /sessions` + works +- No `dedup:` block — dedup off +- No `learning.scheduler.enabled` block — scheduler off + +These give a working framework boot with zero external dependencies. +Production deploys swap in a real LLM provider and (optionally) +real triggers / dedup / scheduler. + +--- + +## Validators + +`src/runtime/config.py` enforces: + +- `LLMConfig.default` must exist in `llm.models` +- Every `llm.models[*].provider` must exist in `llm.providers` +- Every `${VAR}` placeholder must resolve at config-load (strict) +- Every `skill.model` must exist in `llm.models` (skill-level + validator, separate from `LLMConfig`) + +Errors raise typed exceptions (`LLMConfigError`, `ValueError`) at +boot — the framework refuses to start with a misconfigured registry. diff --git a/docs/06-data-model.md b/docs/06-data-model.md new file mode 100644 index 0000000..9b96998 --- /dev/null +++ b/docs/06-data-model.md @@ -0,0 +1,292 @@ +# 06 — Data model + +## Storage backends in use + +| Concern | Backend | Default URL/path | Source | +|---|---|---|---| +| Session metadata | SQLAlchemy (SQLite default; Postgres optional via `asr[postgres]`) | `sqlite:////tmp/asr.db` | `src/runtime/storage/models.py`, `engine.py`, `session_store.py` | +| Vector similarity | FAISS (filesystem) | `/tmp/asr-faiss/` | `src/runtime/storage/vector.py`, `embeddings.py` | +| LangGraph checkpoints | `langgraph-checkpoint-sqlite` (default) or `langgraph-checkpoint-postgres` | Same SQLite DB as session metadata | `src/runtime/checkpointer.py` | +| Per-step events | SQLAlchemy `session_events` table | Same SQLite DB | `src/runtime/storage/event_log.py` | +| Lessons (auto-learning) | SQLAlchemy `session_lessons` table | Same SQLite DB | `src/runtime/storage/lesson_store.py` | +| Dedup retractions | SQLAlchemy `dedup_retractions` table | Same SQLite DB | `storage/session_store.py:un_duplicate` | +| Trigger idempotency keys | SQLAlchemy `trigger_idempotency_keys` table | Same SQLite DB | `src/runtime/triggers/idempotency.py` | +| Memory layers (incident_management) | Filesystem JSON / YAML | `incidents/{kg,releases,playbooks}/` (or seed bundle) | `examples/incident_management/asr/*_store.py` | + +All SQLAlchemy concerns share the **same engine** +(`storage.metadata.url`). One DB, one connection pool, four +logical tables. + +--- + +## Entities + +### `IncidentRow` — primary table + +Source: `src/runtime/storage/models.py`. + +```python +class IncidentRow(Base): + __tablename__ = "incidents" + id: str # PK; format: "-YYYYMMDD-NNN" + status: str # new | in_progress | resolved | escalated | + # needs_review | awaiting_input | error | + # stopped | duplicate + created_at: datetime + updated_at: datetime + deleted_at: datetime | None # soft delete + query: str + environment: str + reporter_id: str # incident-shaped column; apps without + reporter_team: str # the concept ignore (round-trip omits) + summary: str + severity: str | None # incident-shaped column + category: str | None # incident-shaped column + matched_prior_inc: str | None # FK to another row; dedup linkage + resolution: str | None + tags: list[str] # JSON + agents_run: list[AgentRun] # JSON; append-only audit + tool_calls: list[ToolCall] # JSON; append-only audit + findings: dict[str, Any] # JSON; per-agent finding bag + pending_intervention: dict | None # JSON; gate node payload when paused + user_inputs: list[str] # JSON + input_tokens: int # accumulated TokenUsage + output_tokens: int + total_tokens: int + parent_session_id: str | None # dedup linkage to confirmed parent + dedup_rationale: str | None # stage-2 LLM rationale text + extra_fields: dict[str, Any] # JSON; per-app extension bag + version: int # optimistic concurrency token +``` + +**Why so many incident-shaped columns?** History — the framework was +born incident-management-shaped. v1.1 (DEC-005) lifted the runtime +out of the incident shape, but renaming the schema columns would +have required a destructive migration. The columns are tolerated: an +app whose `Session` subclass doesn't declare `severity` or `reporter` +just leaves those columns NULL (round-trip silently omits them per +`_row_to_incident`). + +The v1.5-B generic-noun pass (DEC-008) renamed local variables and +docstrings but **left the SQLAlchemy columns alone** — they would +require a migration. See `docs/DESIGN.md` § 8.2 for rationale. + +### `EventRow` — per-step telemetry + +Source: `src/runtime/storage/models.py`, `event_log.py`. + +```python +class EventRow(Base): + __tablename__ = "session_events" + id: int # autoincrement + session_id: str # FK to incidents.id + kind: EventKind # tool_invoked | gate_fired | + # agent_started | agent_finished | + # confidence_emitted | route_decided | + # status_changed | lesson_extracted | ... + payload: dict # JSON; per-event shape + ts: datetime +``` + +Append-only. Every meaningful boundary in the runtime emits a row. + +### `LessonRow` — auto-learning corpus + +Source: `src/runtime/storage/models.py`, `lesson_store.py`. + +```python +class LessonRow(Base): + __tablename__ = "session_lessons" + id: int + source_session_id: str # FK to incidents.id + title: str + body: str # extracted narrative + embedding: list[float] | None # JSON; for similarity lookup + metadata: dict # JSON + created_at: datetime + updated_at: datetime + deleted_at: datetime | None # soft delete (intake's "still relevant?" gate) +``` + +Built by `LessonExtractor` at session finalize; refreshed nightly by +`LessonRefresher` for sessions resolved manually after the fact. + +### `DedupRetractionRow` — operator un-duplicate audit + +Source: `src/runtime/storage/models.py`, `session_store.py:un_duplicate`. + +```python +class DedupRetractionRow(Base): + __tablename__ = "dedup_retractions" + id: int + session_id: str + original_match_id: str + retracted_at: datetime + retracted_by: str | None + note: str | None +``` + +### `TriggerIdempotencyRow` + +Source: `src/runtime/triggers/idempotency.py`. + +```python +class TriggerIdempotencyRow(Base): + __tablename__ = "trigger_idempotency_keys" + trigger_name: str # PK part 1 + key: str # PK part 2 (Idempotency-Key header) + session_id: str # session minted by the original request + created_at: datetime +``` + +Inference: rows expire opportunistically per `idempotency_ttl_hours` +on each trigger config. + +--- + +## Pydantic models (in-memory; round-trip via `extra_fields`) + +The `Session` base class (`src/runtime/state.py:70-117`) corresponds +roughly to the typed columns on `IncidentRow`. Apps subclass to add +domain fields: + +```python +class IncidentState(Session): + query: str + environment: str + reporter: Reporter + summary: str + tags: list[str] + severity: str | None + category: str | None + matched_prior_inc: str | None + resolution: Any + memory: MemoryLayerState # ASR memory bundle (read-only) + +class CodeReviewState(Session): + pr: PullRequest + review_findings: list[ReviewFinding] + overall_recommendation: Literal["approve", "request_changes", "comment"] | None + review_summary: str + review_token_budget: int +``` + +Round-trip pattern (`SessionStore._row_to_incident` / +`_incident_to_row_dict`): + +- For each field declared on the state class: + - If `IncidentRow` has a typed column for it → write to that column + - Else → write to `extra_fields` JSON +- On load, fields with typed columns hydrate from those columns; + everything else reads from `extra_fields[name]`. + +This keeps row schema migrations rare — apps freely add domain +fields without touching the row schema. + +--- + +## Relationships + +``` +incidents (PK: id) + │ + ├──< session_events.session_id (one-to-many, append-only) + │ + ├──< session_lessons.source_session_id (one-to-many, soft-deletable) + │ + ├──< dedup_retractions.session_id (one-to-many) + │ + ├──> incidents.parent_session_id (self-FK; dedup linkage) + │ + └──> incidents.matched_prior_inc (self-FK; legacy linkage) + +trigger_idempotency_keys (PK: trigger_name + key) + │ + └──> incidents.id (loose ref; not enforced FK) + +LangGraph checkpointer state + └─ keyed by `configurable.thread_id` + (= session_id by default; bumped to ":retry-N" on retry) +``` + +--- + +## Migrations + +Source: `src/runtime/storage/migrations.py` (~210 lines). + +The framework runs **idempotent JSON-walk migrations** at orchestrator +boot, not Alembic. Pre-existing rows get their new fields filled with +defaults so the audit history reads consistently after a schema +extension. + +Two named migrations exist (Inference: based on tests + +`migrations.py` content): + +- `migrate_tool_calls_audit` — added when Phase 4 introduced the + risk-rated gateway audit fields (`risk`, `status`, `approver`, + `approved_at`, `approval_rationale`). Walks every `tool_calls` + JSON and fills missing audit fields with their pydantic defaults. +- `migrate_extra_fields` (Inference) — for the v1.1 decoupling + (DEC-005) extension column. + +There is no Alembic / SQLAlchemy migration framework — schema +changes are additive (new column, new table) and rely on +`Base.metadata.create_all(engine)` at boot for new tables. **Risk: +destructive schema changes (drop column, change type, rename) +require a hand-rolled migration script.** + +--- + +## Persistence assumptions + +- **Single writer per session** — enforced by `SessionLockRegistry` + (`src/runtime/locks.py`); `SessionBusy` raised on contention. +- **Optimistic concurrency on save** — every `SessionStore.save` + bumps `version` and rejects stale-version writes with + `StaleVersionError`. Caller's contract is reload + retry. +- **Append-only audit logs** — `agents_run`, `tool_calls`, + `session_events` are never updated in place (the gateway DOES + update individual `tool_calls[idx]` for status transitions, but + the rest of the row stays pristine). +- **Soft delete** — `deleted_at` column on `IncidentRow` and + `LessonRow`. Hard delete is rare; the `delete_session` API is a + soft delete + vector-store removal. +- **Dual write for pending intervention** — both LangGraph + checkpoint AND `IncidentRow.pending_intervention` are written + when a gate pauses, so dashboards reading the relational row + stay accurate. +- **No cross-session transactions** — the framework doesn't model + workflows that span multiple sessions (the `parent_session_id` + link is the only inter-session reference, and it's a passive + pointer). +- **Retry creates a new langgraph thread** — `Orchestrator.retry_session` + bumps the `active_thread_id` (e.g. `INC-…:retry-2`); the + original thread's checkpoint stays at the failed state so the + retry runs fresh. + +--- + +## Vector index + +FAISS is the default (`vector.backend: faiss`); pgvector and "none" +are also supported (`src/runtime/storage/vector.py`). Vectors are +written through on every `SessionStore.save` so the index stays +aligned with the row table. + +Index is keyed on `session_id`; each row carries a single embedding +of `_embed_source` (the session's query text, falling back to +`extra_fields["query"]`). + +--- + +## Backup / restore + +Inference: not formally documented. Practical recovery: + +- **SQLite**: copy `/tmp/asr.db` (and `*-wal`, `*-shm` if mid-write). +- **FAISS**: copy `/tmp/asr-faiss/` directory. +- The two MUST be backed up together — a vector index pointing at + rows that no longer exist will surface "ghost" similar-incidents + matches. The reverse (rows without vectors) silently degrades + similarity to "no matches". diff --git a/docs/07-integrations.md b/docs/07-integrations.md new file mode 100644 index 0000000..2701758 --- /dev/null +++ b/docs/07-integrations.md @@ -0,0 +1,196 @@ +# 07 — Integrations + +External systems the framework talks to, plus their dev / local +alternatives. + +--- + +## LLM providers + +Source: `src/runtime/llm.py:get_llm`. Each provider kind maps to a +LangChain chat-model class. + +| Provider kind | Production class | Auth | Local alternative | +|---|---|---|---| +| `ollama` | `langchain_ollama.ChatOllama` | `api_key` (Ollama Cloud) or none (local Ollama) | Run Ollama locally (`ollama serve`); set `base_url: http://localhost:11434` | +| `azure_openai` | `langchain_openai.AzureChatOpenAI` | `api_key`, `endpoint`, `deployment` | None — Azure is cloud-only. Use `stub` for tests. | +| `openai_compat` | `langchain_openai.ChatOpenAI` (with `base_url=`) | `api_key` | Any OpenAI-compatible endpoint (LM Studio, vLLM, OpenRouter, …) | +| `stub` | `runtime.llm.StubChatModel` | none | Built-in canned-response chat model for tests / smoke | + +Switching providers: edit `llm.providers` + `llm.models` in +`config/config.yaml`; per-skill override via `skill.model` in the +skill's YAML. + +429 retry: free / shared upstream tiers (e.g. OpenRouter `…:free`) +are protected by the rate-limit retry regime added in v1.5-D +(`_RATE_LIMIT_MARKERS` in `src/runtime/graph.py`). + +Live verification: `tests/test_integration_driver_s1.py` parametrises +three legs (`local`, `workhorse`, `azure`); each independently skips +on missing keys. `tests/test_llm_providers_smoke.py` is the +single-call smoke gated on `OLLAMA_LIVE=1`. + +--- + +## MCP servers + +Source: `src/runtime/mcp_loader.py`, +`src/runtime/config.py:MCPServerConfig`. + +Three transports: + +| Transport | Connection | Use case | +|---|---|---| +| `in_process` | Loads a Python module that exports a `mcp = FastMCP(...)` instance | Default for example apps; zero network cost | +| `stdio` | Spawns a subprocess command, talks JSON-RPC over stdio | Wrapping a 3rd-party MCP CLI | +| `http` | Talks JSON-RPC over HTTP | Remote MCP server (often with bearer auth via `headers`) | +| `sse` | Server-sent events transport | Inference: present in `MCPServerConfig.transport` literal but not exercised in tests; status: scaffold | + +Configuration: + +```yaml +mcp: + servers: + - name: local_inc + transport: in_process + module: examples.incident_management.mcp_server + category: incident_management + - name: ext_metrics + transport: http + url: ${EXTERNAL_MCP_URL} + headers: + Authorization: "Bearer ${EXT_TOKEN}" + category: observability +``` + +The example apps' MCP servers all use `in_process` — the bundle +ships with the MCP code in the same process. Tests fixture sample at +`tests/fixtures/sample_config.yaml` covers `http` + bearer auth. + +--- + +## Auth providers + +The framework does not integrate with external auth providers +(no SSO, OIDC, SAML, …). Air-gap deploys live behind corporate +network controls. + +The only auth touched by the framework: + +- **MCP server bearer auth** — `headers.Authorization: "Bearer + ${EXT_TOKEN}"` per server config. +- **Webhook trigger bearer auth** — `auth: bearer` + + `auth_token_env: ` per trigger config; constant-time + comparison via `hmac.compare_digest`. + +Both read tokens from env vars at process start; rotating a secret +requires a process restart. + +--- + +## Queues / messaging + +The framework has no built-in queue. The closest thing is the +**trigger registry** (`src/runtime/triggers/`), which can fire a +session start from: + +- HTTP POST (webhook) +- APScheduler cron (in-process) +- Custom plugin transport (entry-point or explicit registration) + +There is no SQS / Kafka / NATS / RabbitMQ integration shipped, but +the `TriggerTransport` ABC and `plugin_transports` kwarg on +`TriggerRegistry.create` exist for adding one. The +`src/runtime/triggers/transports/plugin.py` file is a stub — +Inference: scaffold for future SQS/Kafka work. + +--- + +## Observability / external services (referenced by the +incident_management example) + +Source: `examples/incident_management/mcp_servers/observability.py`, +`mcp_servers/remediation.py`, `mcp_servers/user_context.py`. + +The example app's MCP servers expose **mock** versions of operational +tools: + +| Tool | Purpose | Real backend (production) | Mock (this repo) | +|---|---|---|---| +| `get_logs(service, minutes)` | Recent logs | Datadog / Loki / Splunk | Returns canned WARN/ERROR/INFO lines | +| `get_metrics(service, minutes)` | CPU/latency/error-rate samples | Prometheus / Datadog | Returns canned numeric envelope | +| `get_service_health(env)` | Service-level health | Service registry / k8s health | Returns canned per-service health dict | +| `check_deployment_history(hours, env)` | Recent deploys | ArgoCD / Spinnaker / Octopus | Returns canned recent-release list | +| `notify_oncall(team, message)` | Page oncall | PagerDuty / Opsgenie | Returns synthesised page id | +| `apply_fix(proposal_id, env)` | Run a remediation script | Ansible / Salt / custom | Returns deterministic success/failure | +| `propose_fix(hypothesis, env)` | Generate a fix proposal | LLM-driven (this remains LLM-only in production) | Returns canned proposal_id | + +To wire real backends: replace the `_impl` body in the corresponding +`mcp_servers/.py` file with the real client call, keeping the +function signature stable (the LLM-visible tool surface comes from +the signature + docstring). + +--- + +## Code review tools + +`examples/code_review/mcp_server.py` ships **mocked**: + +- `fetch_pr_diff(repo, number)` — reads from + `tests/fixtures/code_review//.json` if present; + otherwise returns a tiny synthetic diff. +- `add_review_finding(...)` and `set_recommendation(...)` — + in-process state mutation only. + +There is no real GitHub or GitLab integration. To wire one up, +replace `fetch_pr_diff` with a `gh` API call or PyGithub / +python-gitlab client. + +--- + +## Memory layers (incident_management example) + +Source: `examples/incident_management/asr/`. + +| Layer | Backing files | Lifecycle | +|---|---|---| +| L2 Knowledge Graph | `incidents/kg/{components,edges}.json` (or seed bundle at `examples/incident_management/asr/seeds/kg/`) | Read-only; populated by ops, consumed by intake | +| L5 Release Context | `incidents/releases/recent.json` (or seed bundle) | Read-only; populated by deploy pipeline (out of scope), consumed by triage | +| L7 Playbook Store | `incidents/playbooks/*.yaml` (or seed bundle) | Read-only; authored by SREs, consumed by resolution | + +Filesystem-backed by design — no Neo4j / Redis / pgvector dependency +keeps the framework air-gap-friendly. When the configured layer +directory is empty, each store falls back to the bundled seeds so a +fresh checkout has working data. + +Mutation paths (write-back from agents, playbook authoring) are +deferred — Inference: planned for a later milestone. + +--- + +## CI / external services for development + +| Service | Purpose | Configuration | +|---|---|---| +| GitHub Actions | CI (lint / type-check / test / sonar / bundle freshness) | `.github/workflows/ci.yml` | +| SonarCloud | Code quality + coverage gate | `sonar-project.properties`, `SONAR_TOKEN` repo secret | +| CodeQL | Security analysis | Default GitHub setup; `.github/workflows/` (auto-generated) | +| Socket Security | Dependency security scan | Auto-detected on PRs | +| OpenRouter | Live LLM smoke (when keys present) | `OPENROUTER_API_KEY` repo secret (Inference: project owner controls) | + +CI does not call live LLM providers — the test suite is +stub-mode-only. Live integration smokes (`tests/test_integration_driver_s1.py`, +`tests/test_llm_providers_smoke.py`) are gated on env vars and skipped +in CI. + +--- + +## Where to override for local dev + +| Want to | Override | +|---|---| +| Use local Ollama instead of Ollama Cloud | `llm.providers.ollama.base_url: http://localhost:11434` | +| Use SQLite in `/var/lib/asr/` instead of `/tmp` | `storage.metadata.url: sqlite:////var/lib/asr/asr.db`, `storage.vector.path: /var/lib/asr/faiss` | +| Use Postgres instead of SQLite | `pip install asr[postgres]`; `storage.metadata.url: postgresql://…` | +| Skip MCP entirely for an integration test | Use `LLMConfig.stub()` + an empty `MCPConfig` (see `tests/_envelope_helpers.py`) | +| Test webhook trigger locally | Set `triggers:` in a local `config.yaml`; `curl -H 'Authorization: Bearer …' -X POST http://localhost:8000/triggers/` | diff --git a/docs/08-testing.md b/docs/08-testing.md new file mode 100644 index 0000000..a978dc8 --- /dev/null +++ b/docs/08-testing.md @@ -0,0 +1,169 @@ +# 08 — Testing + +## Framework + +**pytest** with `pytest-asyncio` (asyncio_mode=auto), `pytest-cov`, +`pytest-repeat` (for D-13 stability gate). Config in +`pyproject.toml:53-58`. + +``` +[tool.pytest.ini_options] +asyncio_mode = "auto" +testpaths = ["tests"] +addopts = "-v --cov=src/runtime --cov-report=term-missing --cov-report=xml" +pythonpath = ["src", "."] +``` + +Coverage gate **fails below 85%** when run with +`--cov-fail-under=85`. Current coverage: **87.04%** (post v1.5). + +## How to run + +```bash +# Full suite, fail-fast +uv run pytest -x + +# Without coverage (faster iteration) +uv run pytest -x --no-cov + +# Single file +uv run pytest tests/test_interrupt_detection.py -x -v + +# Single test +uv run pytest tests/test_interrupt_detection.py::test_resume_forwards_verdict_to_inner_tool_and_completes -xvs + +# With coverage gate +uv run pytest --cov=src/runtime --cov-fail-under=85 -x + +# Stability check (50 iterations of one test — D-13 local gate) +uv run pytest tests/test_session_lock.py -x --count=50 + +# Live integration smoke (gated on env vars) +OLLAMA_API_KEY=... OLLAMA_BASE_URL=https://ollama.com \ + uv run pytest tests/test_integration_driver_s1.py -v +``` + +CI runs the full suite + coverage XML + JUnit XML for SonarCloud. + +## Suite structure + +149 test files; ~1265 tests; ~140s for the full suite. + +### By topic + +| Topic | Sample files | +|---|---| +| Agent runner contract + live-LLM smoke | `test_agent_node*.py`, `test_real_llm_tool_loop_termination.py`, `test_integration_driver_s1.py`, `test_per_agent_model_dispatch.py` | +| HITL approve/reject + gateway | `test_interrupt_detection.py`, `test_gateway_persist_resolution.py`, `test_orchestrator_pause_detection.py`, `test_approval_*.py`, `test_gateway_*.py`, `test_interrupt_status_handling.py` | +| Markdown turn-output parser | `test_markdown_turn_output.py` (36 tests) | +| Retry behaviour | `test_ainvoke_retry_429.py` (5 tests) | +| Storage layer | `test_session_store.py`, `test_incident_store.py`, `test_history_store.py`, `test_dedup_*.py`, `test_event_log.py` | +| FastAPI surface + locks | `test_api*.py`, `test_approval_api.py`, `test_session_lock.py`, `test_retry_concurrency.py` | +| Triggers | `test_triggers/test_*.py` (transport per file) | +| Bundler + bundle | `test_build_*.py`, `test_bundle_*.py` | +| Genericity ratchets | `test_genericity_ratchet.py`, `test_concept_leak_ratchet.py` | +| Skill loader | `test_skill*.py` | +| Telemetry + auto-learning | `test_telemetry_integration.py`, `test_lesson_*.py` | +| UI helpers | `test_ui_*.py`, `test_render_*.py` | +| Memory layers (incident_management) | `test_asr_*.py`, `test_kg_store.py`, `test_release_store.py`, `test_playbook_store.py` | +| Per-app tests | `test_code_review_*.py`, `test_two_apps_coexist.py`, `test_session_id_format.py`, `test_generic_round_trip.py` | + +### Helpers + fixtures + +| File | Purpose | +|---|---| +| `tests/_envelope_helpers.py` | `EnvelopeStubChatModel` — pydantic stub LLM that emits the markdown contract, used across HITL + agent tests | +| `tests/_policy_helpers.py` | Helpers for building synthetic gate decisions | +| `tests/fixtures/sample_config.yaml` | Reference config for config-loader tests | +| `tests/fixtures/code_review//.json` | Mock PR diffs for the code-review example app | + +Conftest is implicit (no `tests/conftest.py` discovered; +fixtures defined per-file). + +## What's covered well + +- **Markdown envelope parser** — 36 tests covering 6 paths, + Unicode dash variants, gpt-oss empty-closing pattern, terminal-tool + args synthesis, permissive synthesis fallback. +- **HITL pause/resume on langgraph 1.x** — `test_interrupt_detection.py` + proves the GraphInterrupt re-raise + Command(resume) forwarding; + `test_gateway_persist_resolution.py` (10 tests) proves the DB row + reflects the verdict for both sync + async paths. +- **Retry regimes** — `test_ainvoke_retry_429.py` pins both backoff + windows (5xx and 429) plus fast-fail on non-transient errors. +- **Per-agent LLM dispatch** — `test_per_agent_model_dispatch.py` + proves `_build_agent_nodes` calls `get_llm` with `model_name=skill.model`. +- **Storage round-trip** — `test_generic_round_trip.py` proves + `extra_fields` JSON survives full save/load cycles for arbitrary + `Session` subclasses. +- **Optimistic concurrency** — `test_session_lock.py` (over 1000 + lines) covers the D-01 / D-20 contracts: per-session lock holds + across HITL pause; resume re-acquires cleanly; concurrent retry + is rejected. +- **API surface** — `test_api_react_surface.py` covers `/sessions/*` + + SSE + WebSocket + structured error envelope. +- **Two apps coexist** — `test_two_apps_coexist.py` proves an + incident session and a code-review session can share the same + metadata DB without collisions (per `Session.id_format`). + +## What's covered weakly or not at all + +| Gap | Why it matters | Where to start | +|---|---|---| +| `src/runtime/ui.py` (~1700 lines, 0% coverage) | Streamlit shell — exercised by manual smoke. Phase 20 (HARD-09) scaffolded `tests/test_ui_*.py` but UI parity coverage is a milestone. | `tests/test_ui_*.py` exists; extend with `streamlit.testing.v1.AppTest` | +| `src/runtime/__main__.py` | argparse-only CLI; covered by smoke only | Inference: low risk | +| `src/runtime/checkpointer_postgres.py` | Postgres saver; CI is sqlite-only | Run a postgres container in CI for a one-test postgres smoke | +| `src/runtime/triggers/transports/plugin.py` | Stub for future transports | n/a | +| `ApprovalWatchdog` × `gateway` saves on transition | I added gateway saves on transitions in PR #6; the watchdog should observe a faster cleanup signal but no focused test verifies that. ~15 min. | New test asserting the watchdog resolves a row faster after a gateway save | +| Live integration with all 3 providers green simultaneously | OpenRouter is out of credits and Azure has placeholder endpoint in this dev `.env` | Operator-side issue, not framework | +| `test_silent_failure_sweep.py` | Should assert no `except Exception: pass` survives | Inference: name based on Phase 18 / HARD-04; verify the test exists and passes | + +## Risky areas needing more tests + +1. **Multi-agent live runs against real providers** — only the + single-agent S1 driver is live-gated. Multi-agent E2E (intake → + triage → DI → resolution) only runs in stub mode. A live multi- + agent driver would catch provider-quirk regressions earlier. +2. **`HistoryStore` filter dimensions** — apps build their own + `filter_resolver`; the framework only tests the incident-shaped + one. A code-review-shaped filter test would prove the seam holds. +3. **`OrchestratorService.stop_session` mid-pause** — what happens + if the operator cancels a session that's currently `pending_approval`? + `test_session_lock.py` covers locks; explicit cancellation + semantics during HITL deserve a focused test. +4. **`migrations.py` rollback** — the migrations are forward-only + and idempotent. A backward-compat regression test (run the new + code against an old-shape DB) exists for `migrate_tool_calls_audit`; + adding similar tests for future migrations would lock the + contract. +5. **Trigger registry under concurrency** — `test_triggers/` + covers each transport in isolation; a fan-in test (50 webhooks + firing concurrently) would catch idempotency-key races. + +## CI gates + +`.github/workflows/ci.yml`: + +| Gate | Tool | Failure behavior | +|---|---|---| +| Lockfile freshness (HARD-02) | `uv lock --check` | Fails if `pyproject.toml` drift from `uv.lock` | +| Bundle staleness (HARD-08) | `python scripts/build_single_file.py && git diff --exit-code dist/` | Fails if `dist/` would change | +| Lint | `ruff check src/ tests/` | Fails on any rule violation | +| Type check (HARD-03) | `pyright src/runtime` | Fail-on-error since Phase 19 | +| Test + coverage | `pytest --cov=src/runtime --cov-report=xml --junitxml=junit.xml` | Default fail on test failure; coverage gate via SonarCloud | +| Skill-prompt-vs-schema lint (SKILL-LINTER-01) | `python scripts/lint_skill_prompts.py` | Fails if any skill prompt references a tool name / arg field that doesn't exist | +| SonarCloud scan | `SonarSource/sonarqube-scan-action@v8.0.0` | Quality gate (coverage / hotspots / duplications) reported back to the PR | + +## How to add a test + +1. Pick the file matching the topic (or create a new one if cross-cutting). +2. If async, no decorator needed (`asyncio_mode=auto`). +3. If you need a stub LLM, use `EnvelopeStubChatModel` from + `tests/_envelope_helpers.py` — it emits the markdown contract + automatically. +4. If you need a `Session` instance with a particular state class, + use `runtime.storage.session_store.SessionStore.create(...)` + over a tmp_path engine (see `_make_repo` patterns in existing + tests). +5. Run the new test with `-xvs` to iterate; then `-x` for the full + suite to catch regressions. diff --git a/docs/09-build-deploy-release.md b/docs/09-build-deploy-release.md new file mode 100644 index 0000000..5c060e5 --- /dev/null +++ b/docs/09-build-deploy-release.md @@ -0,0 +1,227 @@ +# 09 — Build / deploy / release + +## Build commands + +| Step | Command | Source | +|---|---|---| +| Install dependencies (frozen, hash-verified) | `uv sync --frozen --extra dev` | `uv.lock`, `pyproject.toml:42-50` | +| Regenerate single-file bundle | `uv run python scripts/build_single_file.py` | `scripts/build_single_file.py` | +| Lint | `uv run ruff check src/ tests/` | | +| Type-check | `uv run pyright src/runtime` | `pyrightconfig.json` | +| Test + coverage | `uv run pytest --cov=src/runtime --cov-fail-under=85` | `pyproject.toml:53-58` | +| Skill-prompt linter | `uv run python scripts/lint_skill_prompts.py` | | +| Concept-leak ratchet | `uv run python scripts/check_genericity.py --baseline 39` | | +| Lockfile freshness | `uv lock --check` | | + +The "build" of this project is **not a wheel** — wheels exist +(`pyproject.toml:[tool.hatch.build.targets.wheel]` declares +`packages = ["src/runtime", "examples"]`) but the deployed artifact +is the **single-file bundle** under `dist/`. Wheels are useful for +local `pip install -e .` development; the deployed shape is +copy-only. + +## Packaging — the bundler + +Source: `scripts/build_single_file.py`. Runs in three steps: + +1. Read `RUNTIME_MODULE_ORDER` (a list of `(root, relpath)` tuples + topologically ordered so each module's body sees its + dependencies' symbols already in scope). +2. For each module: read source, strip intra-bundle imports + (the bundle is one big namespace — `from runtime.config import X` + becomes a no-op when `X` is already defined above). +3. Concatenate + emit four bundles: + +| Output | Contents | +|---|---| +| `dist/app.py` (~660KB) | Framework only. Used to demonstrate the runtime stands on its own. | +| `dist/apps/incident-management.py` (~707KB) | Framework + `incident_management` example. The deployment ship target for the incident app. | +| `dist/apps/code-review.py` (~670KB) | Framework + `code_review` example. The second example, demonstrating genericity. | +| `dist/ui.py` (~68KB) | Streamlit shell. Sits next to whichever `app.py` you deployed and `from app import …` reaches into the deploy bundle's flattened namespace. | + +The bundler also runs an `ast.parse` smoke on each output so a +broken bundle fails the script (rather than failing at deploy). + +## CI/CD + +Source: `.github/workflows/ci.yml`. + +Single workflow `quality:` runs on every push to `main` and on every +PR. Steps: + +``` +checkout (fetch-depth: 0 for SonarCloud blame) + ↓ +setup-python @ 3.11 + ↓ +setup-uv @ 0.11.7 + ↓ +Lockfile freshness gate (uv lock --check) # HARD-02 + ↓ +Install deps (uv sync --frozen --extra dev) + ↓ +Bundle staleness gate (build + git diff --exit-code dist/) # HARD-08 + ↓ +Lint (ruff check src/ tests/) + ↓ +Type check (pyright src/runtime) # HARD-03 fail-on-error + ↓ +Test with coverage (pytest --cov= --cov-report=xml --junitxml=junit.xml) + ↓ +Skill-prompt-vs-schema lint (lint_skill_prompts.py) # SKILL-LINTER-01 + ↓ +SonarCloud Scan +``` + +Total CI time: ~2-3 minutes (most spent in test suite). + +CI environment variables (dummy values for the +`_interpolate` strict check; tests don't call live providers): +- `OLLAMA_API_KEY=""` +- `OPENROUTER_API_KEY=""` +- `AZURE_OPENAI_KEY=""` +- `AZURE_DEPLOYMENT=""` +- `AZURE_ENDPOINT=https://ci-dummy.example/` +- `EXTERNAL_MCP_URL=https://ci-dummy.example/` +- `EXT_TOKEN=ci-dummy` + +## Quality gates + +Beyond CI's pass/fail, these soft gates guide PR review: + +| Gate | Source | Threshold | +|---|---|---| +| Coverage | SonarCloud `new_coverage` | ≥ 80% on new code | +| Duplications | SonarCloud `new_duplicated_lines_density` | < 3% (with `sonar.cpd.exclusions` for intentional sync/async + responsive/graph mirrors) | +| Reliability | SonarCloud `new_reliability_rating` | A (=1) | +| Security | SonarCloud `new_security_rating` | A (=1) | +| Maintainability | SonarCloud `new_maintainability_rating` | A (=1) | +| Hotspots reviewed | SonarCloud `new_security_hotspots_reviewed` | 100% | +| Concept-leak ratchet | `tests/test_genericity_ratchet.py` | ≤ `BASELINE_TOTAL` (currently 39) | +| Bundle freshness | `tests/test_bundle_completeness.py` + CI gate | exit-code clean | +| Type errors | `pyright` fail-on-error | zero new errors | +| Lockfile drift | `uv lock --check` | clean | +| Skill prompts | `scripts/lint_skill_prompts.py` | binary pass | + +## Containerisation + +There is **no Dockerfile** in the repo (verified via +`find . -name Dockerfile`). Inference: the deploy target is bare-VM +or systemd, not container. A container deploy would need a +hand-rolled `Dockerfile`: + +```dockerfile +FROM python:3.11-slim +WORKDIR /app +COPY dist/apps/incident-management.py app.py +COPY dist/ui.py ui.py +COPY config/ config/ +ENV PYTHONUNBUFFERED=1 +CMD ["python", "app.py", "--config", "config/incident_management.yaml"] +``` + +(Inference: above is illustrative; not tested in this repo.) + +## Deployment model — air-gap copy + +Source: `docs/AIRGAP_INSTALL.md`, +`docs/DEVELOPMENT.md`, `docs/DESIGN.md` § 10. + +**The deploy target has NO public-internet access** at runtime. Two +phases: + +### Phase A — install dependencies (one-time, on the dev/CI box or behind an internal mirror) + +```bash +export UV_INDEX_URL="https:///simple/" +uv sync --frozen --extra dev # populates ~/.cache/uv from the mirror +# or fully offline if the cache is pre-warmed: +uv sync --frozen --offline --extra dev +``` + +### Phase B — copy the 7-file payload onto the target host + +``` +app.py (renamed from dist/apps/.py) +ui.py (dist/ui.py) +config/config.yaml (framework: LLM, MCP, storage) +config/.yaml (app: severity aliases, escalation roster, …) +config/skills/ (optional skill prompt overrides) +.env (provider keys; secrets manager preferred) +``` + +### Phase C — boot + +```bash +python -m runtime --config config/.yaml & +streamlit run ui.py --server.port 37777 & +``` + +Or systemd units; or k8s `Pod`s. The framework doesn't care. + +## Release flow + +Source: git history + `docs/DESIGN.md` § 13. + +The release pattern in this repo is **squash merge into `main`** via +GitHub PRs. Each milestone is a sequence of small PRs: + +``` +PR opened → CI runs (lint / type / test / sonar / bundle / skill-lint) + → all green → squash merge with verbose subject + → branch deleted + → main moves to the squash SHA +``` + +There is **no separate release branch**, no semver tags, and no +release notes infrastructure. The "release" is `main` itself. + +The milestone history (v1.0 → v1.5) is recorded in +`docs/DESIGN.md` § 13. New work goes on a feature branch (`feat/…`, +`fix/…`, `refactor/…`, `docs/…`); merge via PR. + +## Rollback + +Inference: not formally documented. Practical: + +- **Code rollback** — `git revert ` and merge a revert + PR. CI will re-run. +- **Bundle rollback** — copy the previous bundle from a known-good + `main` commit; the deploy is copy-only so rolling back is just + copying older files. +- **Schema rollback** — there's no Alembic. New columns / tables + added via `Base.metadata.create_all` are forward-only; + rolling back code that introduced a new column doesn't delete + the column from the DB (harmless — old code ignores it). New + rows in new tables are abandoned (also harmless). +- **Stuck session rollback** — operator can `DELETE /sessions/{sid}` + (soft delete) or set `status='stopped'` via `stop_session(sid)`. + +## Versioning + +`pyproject.toml:8` declares `version = "0.1.0"`. The version has +not been bumped despite v1.0 → v1.5 of the **product** milestones — +Inference: the package version is independent of the milestone +labelling. There are no git tags pinning the milestones; the +squash SHAs in `docs/DESIGN.md` § 13 are the canonical reference. + +## Operational concerns + +- **Process lifecycle** — `OrchestratorService` runs a single + asyncio loop on a background thread. SIGTERM cancels in-flight + session tasks; the lifespan shutdown hook closes the FastMCP + + SQLAlchemy + checkpointer transports. +- **Session capacity** — `runtime.max_concurrent_sessions: 8` + (default); raises `SessionBusy → HTTP 429` on overflow. +- **Long-running approval** — `framework.approval_timeout` (default + Inference: 1800 seconds) drives `ApprovalWatchdog`; sessions with + pending approvals beyond that age get auto-resolved with + `verdict=timeout`. +- **DB growth** — `EventLog` and `LessonStore` are append-only. + No automatic pruning. Operators should periodically GC closed + sessions via `delete_session(sid)` (soft delete) or run a + manual VACUUM on SQLite. Inference: not documented; needs a + runbook. +- **FAISS index growth** — vectors are written through on every + save and removed on `delete_session`. The index size scales + linearly with active sessions. diff --git a/docs/10-known-risks-and-todos.md b/docs/10-known-risks-and-todos.md new file mode 100644 index 0000000..4c5c25d --- /dev/null +++ b/docs/10-known-risks-and-todos.md @@ -0,0 +1,146 @@ +# 10 — Known risks and TODOs + +## Source-code TODO/FIXME/HACK markers + +Verified via `grep -rnE "TODO|FIXME|XXX|HACK|DEPRECATED" src/ examples/` +on this branch (excluding `__pycache__` and the +`deprecated_kwargs` legitimate name). + +| File | Marker | What | +|---|---|---| +| `src/runtime/locks.py:49` | `TODO(v2)` | Evict idle slots in `SessionLockRegistry` to cap memory in long-running servers | +| `src/runtime/locks.py:53` | `TODO(v2)` | Same — placement note on `_slots: dict[str, _Slot]` | + +That's it. The codebase is otherwise free of TODO/FIXME debt — a +deliberate result of Phase 18 (HARD-04 silent-failure sweep) and the +overall "fix root cause, not workaround" project rule. + +## Hardcoded values worth flagging + +| Where | Value | Risk | +|---|---|---| +| `src/runtime/config.py` (default `MetadataConfig.url`) | `"sqlite:///incidents/incidents.db"` (relative path) | Default points at a relative path; CWD-dependent. The framework's actual default `config/config.yaml` overrides to `sqlite:////tmp/asr.db` (absolute). Operators who skip `config.yaml` get the relative-path default. | +| `src/runtime/config.py` (`storage.vector.path`) | `"incidents/faiss"` (relative) | Same as above | +| `src/runtime/llm.py` Phase 13 default request_timeout | `120.0` seconds | A 2-minute timeout is generous for LLM calls; some providers can hang longer on long-context responses. Per-provider override available | +| `runtime.locks.SessionLockRegistry` | unbounded dict | See `TODO(v2)` above | +| Bundle file sizes | ~660-700KB each | Large for code review. Inference: the flatten + intra-import-strip pattern is the only viable single-file deploy path. | +| `_RATE_LIMIT_MARKERS` in `src/runtime/graph.py` | string-match heuristic | If a provider invents a new 429 phrasing, retries fall back to fast-fail. Markers list comments out the variants observed in the wild. | + +## Weak / incomplete features + +### v1.5-D Azure leg of the integration driver +The `azure` parametrize arm in +`tests/test_integration_driver_s1.py` is wired but the dev +`.env` carries placeholder values for `AZURE_ENDPOINT`. Live +verification requires a real Azure deployment; framework code path +(`AzureChatOpenAI` construction) is intact. + +### Duplicate ToolCall audit rows +The HITL fix in PR #6 left a known cosmetic duplication: when the +gateway records a high-risk tool, it stores the row under the +FastMCP composite name (`local_remediation:apply_fix`, colon form), +while the harvester later records the same tool call under the +LLM-visible name (`local_remediation__apply_fix`, double-underscore +form). Two rows for one logical event. Cosmetic in the UI; matters +if any consumer aggregates tool counts. Fix: align both on the `__` +form (~30 min). Out of scope for v1.5; deferred. + +### `ApprovalWatchdog` regression test +PR #6 added gateway saves on resolution transitions. The watchdog +should observe a faster cleanup signal but no focused test verifies +that. Add a 1-test regression. ~15 min. + +### `ASR_LOG_LEVEL` env var documentation +Added in PR #6, mentioned in `docs/01-local-setup.md` and +`docs/05-configuration.md` of this brownfield set, but not in the +main `README.md` or `docs/DEVELOPMENT.md`. One-line note worth +adding for operator visibility. + +### Streamlit UI test coverage +`src/runtime/ui.py` is ~1700 lines, 0% coverage. Phase 20 (HARD-09) +scaffolded `tests/test_ui_*.py` with a few smoke tests but reaching +parity with backend coverage requires a dedicated UI-testing +milestone. Excluded from the coverage gate via +`pyproject.toml:[tool.coverage.run].omit`. + +### Trigger registry plugin transport +`src/runtime/triggers/transports/plugin.py` is a stub — +Inference: scaffold for future SQS / Kafka / NATS work. The +`TriggerTransport` ABC + `plugin_transports` kwarg on +`TriggerRegistry.create` are usable today by external code, but no +in-repo transport beyond api / webhook / schedule. + +### Postgres checkpointer +Optional via `pip install asr[postgres]`. CI is sqlite-only; the +postgres saver code (`src/runtime/checkpointer_postgres.py`) is +excluded from coverage. Production postgres deploys exist but +aren't exercised in the test suite. Risk: a postgres-specific bug +ships unnoticed. + +### ASR memory layer write-back +The L2 / L5 / L7 stores in `examples/incident_management/asr/` +are read-only. Mutation paths (write-back from agents, playbook +authoring) are deferred. Inference: planned for a future +milestone; no roadmap entry confirms this. + +### Dedup pipeline LLM error handling +`Orchestrator._run_dedup_check` catches all `Exception` from the +stage-2 LLM and degrades to "not a duplicate". Defensive but +silently masks a misconfigured stage-2 model. Inference: a typed +error path with logging would make ops triage faster. + +## Security-sensitive areas + +| Area | What to audit | +|---|---| +| `src/runtime/config.py:_interpolate` | Strict mode requires every `${VAR}` to exist; misses VAR-injection if `os.environ` itself is compromised. Standard env-var posture. | +| `src/runtime/triggers/auth.py` | Bearer token is read from env var at process start; rotation requires restart. `hmac.compare_digest` used. No HMAC-signature transport (PagerDuty / Slack) yet — `auth: bearer` only. | +| `src/runtime/tools/gateway.py` (HITL gate) | The risk policy is config-driven (`runtime.gateway.policy`) — operators MUST configure `apply_fix`-class tools as `high` for production environments to enforce HITL. The framework defaults to `auto` for unlisted tools. | +| `src/runtime/tools/gateway.py:_record_pending_resolution` | Verdict dict from operator → `Command(resume=verdict)` → tool args. Trust boundary: the operator is trusted; a malicious approver could pass arbitrary `rationale` text but cannot inject tool args (the gateway re-injects from session-derived state). | +| `src/runtime/dedup.py` (LLM stage 2) | Operator-supplied `query` text is interpolated into the LLM prompt. Standard prompt-injection surface — the LLM verdict can be steered by adversarial query content. Currently used only for soft routing (`status='duplicate'`); a misclassification doesn't escalate privileges. | +| `src/runtime/api.py` | NO authentication on `/sessions/*` endpoints. Air-gap deploys live behind corporate network controls. Webhook triggers have bearer auth via the trigger registry. | +| `src/runtime/intake.py` (similarity retrieval) | `query` text is embedded and matched against historical sessions. Low risk — the retrieved lessons are framing context, not authoritative. | +| Vector store (FAISS) | Local files. No encryption at rest; relies on filesystem permissions. Ops should chmod `/tmp/asr-faiss/` appropriately. | + +## Migration risks + +| Migration | Risk | +|---|---| +| Schema additive (new column, new table) | Low — `Base.metadata.create_all` at boot handles new tables; new columns get hand-rolled idempotent JSON-walk migrations under `migrations.py`. | +| Schema destructive (drop column, rename, change type) | High — there is no Alembic. A destructive change requires a one-shot script + a documented downtime window. None planned. | +| `extra_fields` JSON field reshape | Medium — apps store domain fields here. Renaming a field on the app's `Session` subclass without a `SessionStore` migration breaks load. Mitigation: app authors own their migrations. | +| FAISS index format change | Low — re-indexing is idempotent (delete the index file; the next save rebuilds). | +| Bundle format change | Low — `dist/*` is regenerated from source on every PR (HARD-08 gate). Bundle drift is mechanical. | +| `langgraph` major version bump | High — PR #6 caught a breaking semantic change in `interrupt()` between langgraph 0.x and 1.x. Future major bumps (2.x?) need similar smoke tests; the `_drive_agent_with_resume` helper is the most exposed surface. | +| `langchain` major version bump | High — `langchain.agents.create_agent` is the agent factory. A signature change there cascades through `make_agent_node`. | +| Provider model deprecation (e.g. OpenRouter free-tier model removed) | Low — config swap; no code change. The 429 retry helps with transient throttles, not deprecations. | + +## Concurrency / race risks + +| Risk | Mitigation | +|---|---| +| Concurrent session writes (UI + API approval simultaneously) | `SessionLockRegistry` enforces single writer per session; second writer gets `SessionBusy → HTTP 429`. | +| Concurrent retry on a session in `error` | `_retries_in_flight` set in `Orchestrator` rejects second retry. | +| Approval race with `ApprovalWatchdog` timeout | `StaleVersionError` → both reload, one wins. Watchdog re-checks before resolving. | +| LangGraph thread_id collision on retry | `retry_session` bumps `active_thread_id` to `:retry-N`; original thread stays at terminated checkpoint. | +| Stale state on HITL resume | PR #6 fix: `make_agent_node` reloads from store at entry. Past pain point — see `docs/DESIGN.md` DEC-010. | + +## Operational risks + +| Risk | Mitigation | +|---|---| +| `/tmp` filling up (SQLite + FAISS in `/tmp` per default config) | Operators should override `storage.metadata.url` and `storage.vector.path` in production to a persistent path. | +| Long-running orchestrator memory growth | `SessionLockRegistry` `TODO(v2)` — slots accumulate; add eviction. | +| Provider key rotation requires restart | Env vars read at process start. No SIGHUP reload. | +| Single-process limit | One `OrchestratorService` per host; `runtime.max_concurrent_sessions: 8` cap. Multi-host deploys need a separate orchestrator per host (and a separate metadata DB OR strict per-session locking via a shared lock service — not implemented). | +| Bundle drift on hand-edited `dist/` | CI catches via "Bundle staleness gate (HARD-08)". | +| Lockfile drift after `pip install` instead of `uv sync` | Operators MUST use `uv sync --frozen`; CI catches via `uv lock --check`. | + +## Documentation drift risks + +| Risk | Mitigation | +|---|---| +| Docs reference outdated test counts / coverage / ratchet baseline | `docs/00-project-overview.md` snapshots current values; refresh on milestone landings. | +| `.planning/` (gitignored) used as canonical state | Don't — the canonical state is `docs/DESIGN.md` § 13 and the git history. | +| `.env` placeholder vs real values mismatch | Operators must populate per-deploy; CI uses dummy values. | +| Skill prompts reference removed tool args | `scripts/lint_skill_prompts.py` (Phase 21 / SKILL-LINTER-01) catches as a CI gate. | diff --git a/docs/11-agent-handoff.md b/docs/11-agent-handoff.md new file mode 100644 index 0000000..54303b2 --- /dev/null +++ b/docs/11-agent-handoff.md @@ -0,0 +1,230 @@ +# 11 — Agent handoff + +> Designed for AI coding agents picking this project up cold. If +> you're a human, this works for you too. + +## Project summary in 20 lines + +ASR is a generic Python multi-agent runtime framework. It wraps +**LangGraph** (orchestration / checkpointing) and **LangChain** +(`langchain.agents.create_agent` for the per-agent loop; +`Chat{OpenAI,Ollama}` and `AzureChatOpenAI` for provider abstraction). +Tools come from **FastMCP** servers (in-process / stdio / http). +A risk-rated **HITL gateway** wraps every tool — high-risk calls +raise `langgraph.types.interrupt(payload)` to pause the graph for +operator approval; resume via `Command(resume=verdict)`. Agent +output uses a **markdown contract block** (`## Response / ## +Confidence / ## Signal`) parsed by a 6-path lenient parser with +synthesis fallbacks for misbehaving models. + +Two reference apps live in `examples/`: `incident_management` (4-skill +SRE investigation pipeline with ASR memory layers) and `code_review` +(3-skill PR review pipeline; mocked tools). Apps subclass `Session` +to add domain fields; the framework stays generic — a CI ratchet +(`tests/test_genericity_ratchet.py`) keeps it that way. + +The deploy target is air-gapped corporate environments. The deploy +artifact is a single-file bundle under `dist/` (not a wheel) plus a +handful of YAML configs and `.env`. The bundler script +(`scripts/build_single_file.py`) flattens `src/runtime` + an example +app into one `.py` file; CI's "Bundle staleness gate" rebuilds on +every PR and refuses the merge if `dist/` would change. + +`main` is at v1.5; 1265 tests passing; 87% coverage; ruff clean; +SonarCloud green; concept-leak ratchet at 39. v2.0 (React UI +replacing the Streamlit prototype) is the next big move. + +## Top 20 files to read first + +In order — each builds on the previous. + +1. **`README.md`** — repo intro + quick start +2. **`docs/DESIGN.md`** — long-form architecture + decision log + (12 numbered DEC-NNN entries) + milestone history (v1.0 → v1.5) +3. **`docs/02-architecture.md`** — quick-scan summary of the layers +4. **`pyproject.toml`** — deps, pytest/ruff/pyright/coverage config +5. **`config/config.yaml.example`** — annotated config template +6. **`src/runtime/state.py`** — `Session`, `AgentRun`, `ToolCall`, + `TokenUsage` pydantic models +7. **`src/runtime/skill.py`** — `Skill` (YAML-driven agent declaration) +8. **`src/runtime/orchestrator.py`** — `Orchestrator` class + lifecycle + methods (`start_session`, `stream_session`, `resume_session`, + `_finalize_session_status_async`, `_is_graph_paused`) +9. **`src/runtime/service.py`** — `OrchestratorService` long-lived + loop wrapper + thread-safe bridge +10. **`src/runtime/graph.py`** — `build_graph`, `make_agent_node`, + `_drive_agent_with_resume`, `_ainvoke_with_retry`, + `parse_envelope_from_result` callers +11. **`src/runtime/agents/turn_output.py`** — markdown envelope + parser, 6-path fallback chain +12. **`src/runtime/tools/gateway.py`** — `wrap_tool` (~830 LOC) — + risk-rated tool wrapper with HITL pause/resume +13. **`src/runtime/llm.py`** — `get_llm` provider abstraction +14. **`src/runtime/storage/session_store.py`** — CRUD + FAISS + write-through + optimistic-version save +15. **`src/runtime/api.py`** — FastAPI `/sessions/*` REST + SSE + + WebSocket + approvals +16. **`examples/incident_management/state.py`** — example + `IncidentState(Session)` subclass +17. **`examples/incident_management/mcp_server.py`** — example MCP + server pattern +18. **`tests/test_interrupt_detection.py`** — proves the HITL fix + end-to-end (read this for the resume contract) +19. **`scripts/build_single_file.py`** — the bundler (the deploy + pipeline) +20. **`.github/workflows/ci.yml`** — CI gates (lint / type / test / + sonar / bundle / skill-lint) + +## Commands future agents SHOULD use + +| Goal | Command | +|---|---| +| Install / sync deps | `uv sync --frozen --extra dev` | +| Run full test suite | `uv run pytest -x` | +| Run single test fast | `uv run pytest tests/.py:: -xvs --no-cov` | +| Lint | `uv run ruff check src/ tests/` | +| Type check | `uv run pyright src/runtime` | +| Coverage gate | `uv run pytest --cov=src/runtime --cov-fail-under=85 -x` | +| Regenerate single-file bundle | `uv run python scripts/build_single_file.py` | +| Concept-leak ratchet check | `python scripts/check_genericity.py` | +| Skill-prompt linter | `uv run python scripts/lint_skill_prompts.py` | +| Lockfile freshness | `uv lock --check` | +| Boot CLI | `uv run python -m runtime --config config/incident_management.yaml` | +| Boot Streamlit UI | `ASR_LOG_LEVEL=INFO uv run streamlit run src/runtime/ui.py --server.port 37777` | +| Reset local state | `rm /tmp/asr.db /tmp/asr.db-*; rm -rf /tmp/asr-faiss` | +| Inspect session events | `sqlite3 /tmp/asr.db "SELECT kind, datetime(ts), substr(payload,1,200) FROM session_events WHERE session_id='' ORDER BY ts;"` | +| Inspect a session row | `sqlite3 /tmp/asr.db "SELECT id, status, version FROM incidents WHERE id='';"` | +| Live integration smoke | `OLLAMA_API_KEY=… OLLAMA_BASE_URL=https://ollama.com uv run pytest tests/test_integration_driver_s1.py -v` | +| Open a PR | `gh pr create --base main --head --title "…" --body "…"` | +| Watch CI | `gh pr checks --watch` | +| Squash merge | `gh pr merge --squash --delete-branch --subject "…"` | + +## Commands future agents SHOULD AVOID + +| Avoid | Why | Use instead | +|---|---|---| +| `pip install …` | Bypasses uv lockfile; CI's "Lockfile freshness gate" will fail | `uv add ` then `uv sync` | +| `pytest …` (bare) | Doesn't pick up `pythonpath` from `pyproject.toml` | `uv run pytest …` | +| Editing `dist/*` directly | Bundles are generated; hand-edits get clobbered + CI's "Bundle staleness gate" fails | Edit `src/runtime/` or `examples/`, regenerate via `scripts/build_single_file.py` | +| `git commit` without bundle regen after touching `src/runtime/` or `examples/` | CI's bundle gate fails | Run `scripts/build_single_file.py`, `git add dist/` | +| `git push --force` to `main` (or any shared branch) | Rewrites history for everyone | Use a feature branch + PR | +| `git push origin --delete ` for branches you didn't create | Destructive on shared state | Confirm with the owner | +| Adding a `TODO` to source | Project rule is "fix root cause, not workaround"; the only `TODO(v2)` in the repo is intentional | Open an issue or write the fix | +| Adding `except Exception: pass` | Phase 18 (HARD-04) explicitly removed all of these | Log + re-raise, or catch a typed exception | +| Touching schema columns on `IncidentRow` | Requires a migration; v1.5-B (DEC-008) explicitly left the incident-shaped columns alone | Use `extra_fields` JSON for app-specific data | +| Calling live LLM providers in tests | CI uses dummy keys; live tests are env-gated and skipped | Use `LLMConfig.stub()` + `EnvelopeStubChatModel` | +| Renaming `incident` → `session` in source code without bumping the ratchet test | `tests/test_genericity_ratchet.py` enforces the count downward only | Update `BASELINE_TOTAL` in the same commit with rationale comment (see history at `tests/test_genericity_ratchet.py:60-86`) | +| Writing agent-generated `*.md` outside `docs/` and committing | `docs/*` is gitignored except for explicit allowlist | Add to the allowlist in `.gitignore` if it's a real deliverable; otherwise keep it local | + +## Architectural rules + +These are **load-bearing** — if you're tempted to violate one, stop +and re-read `docs/DESIGN.md` § 12 (decision log). + +1. **The framework stays domain-agnostic.** Apps subclass `Session` + for domain data; framework code references `Session` and + `extra_fields`, never app-specific fields. The concept-leak + ratchet enforces this on `incident` / `severity` / `reporter` + tokens. +2. **One source of truth per concern.** Gate decisions: + `policy.should_gate`. Retry policy: `policy.should_retry`. + Status finalization: `_finalize_session_status`. Don't reimplement. +3. **HITL pause is NOT an error.** `GraphInterrupt` and the + `__interrupt__` field on the result dict signal a checkpointed + pending_approval, not a failure. `_handle_agent_failure` must NOT + fire; finalize must NOT run while paused. See PR #6. +4. **Append-only audit trails.** `agents_run`, `tool_calls`, + `session_events` are never updated in place (the gateway's + per-row pending→approved transition IS in-place but is the only + exception, and it persists via `_record_pending_resolution`). +5. **The bundle is the deploy unit.** `dist/*` is regenerated, not + hand-edited. Every PR touching `src/runtime/` or `examples/` + commits a fresh bundle. +6. **Provider abstraction stays in `src/runtime/llm.py`.** Apps + declare provider config; the framework owns the provider class + selection (`langchain_openai.ChatOpenAI` vs + `langchain_openai.AzureChatOpenAI` vs `langchain_ollama.ChatOllama`). +7. **Tests use stubs by default.** Live LLM tests are env-gated; + the suite must run cleanly in CI without any provider keys. +8. **No public-internet calls at deploy time.** Air-gap is the + target. The `https://ollama.com` hardcoded fallback was + explicitly removed in Phase 13 (HARD-05); don't re-introduce. + +## Coding conventions + +| Convention | Example | +|---|---| +| Pydantic v2 BaseModel for every config / state | `src/runtime/state.py:Session` | +| Async first; sync wrappers as needed | `OrchestratorService.submit_async` is async; `submit_and_wait` wraps for sync callers | +| Type-hint everything; pyright fail-on-error gate | `src/runtime/graph.py` | +| Skill prompts as `system.md` not Python strings | `examples/*/skills//system.md` | +| Tools registered via `@mcp.tool()` decorator on FastMCP server | `examples/incident_management/mcp_server.py` | +| Per-line `# pyright: ignore[] -- ` for legitimate stub gaps | `src/runtime/orchestrator.py` (multiple) | +| String constants for envelope keys / status values | Avoid bare strings — use `runtime.state.ToolStatus` Literal or named constants | +| `_private_helper(*, kw=…)` for keyword-only args inside the framework | `src/runtime/graph.py:make_agent_node` | +| Test files mirror source: `src/runtime/X.py` → `tests/test_X.py` | Most do; some are topical (`test_interrupt_detection.py` ≠ one source file) | +| Conventional-commit subjects | `feat(retry): 429 rate-limit retry…`, `fix(hitl): …`, `refactor(v1.5-B): …`, `docs: …`, `build: …`, `chore(config): …` | +| Atomic commits per logical change; squash-merge into main | git history shows the pattern | + +## Common traps + +1. **`pytest` (bare) doesn't pick up the `pythonpath`** → `ModuleNotFoundError: runtime`. Use `uv run pytest …`. +2. **Touching `src/runtime/` or `examples/` without regenerating `dist/`** → CI bundle gate fails. Always run `uv run python scripts/build_single_file.py && git add dist/` before committing. +3. **Adding a kwarg to a framework function without checking callers** → `incident=` rename in v1.5-B caught the example app's `_record_success_run(incident=…)` call. Run `git grep -nE "\\("` before any signature change. +4. **Approving a HITL session that was created on pre-PR-#6 code** → that session's checkpoint is poisoned (langgraph 1.x semantic mismatch). The Approve button silently no-ops. Tell the user to start a fresh session. +5. **Live OpenRouter `:free` model rate limits** → first call may 429. The v1.5-D 429 retry (7.5s/15s/22.5s) clears most short-window throttles; persistent 429 means quota exhaustion. +6. **Azure connection error** → check `.env` `AZURE_ENDPOINT` is a real URL, not a placeholder like `noop`. +7. **Pyright complains about langchain stubs** → use `# pyright: ignore[] -- ` per line; don't disable the gate. +8. **Streamlit `AssertionError: scope["type"] == "http"` storm under Python 3.14** → cosmetic Starlette compat bug; HTTP traffic still works. Filter logs. +9. **`StaleVersionError` on HITL resume** → was a real bug pre-PR-#6 (stale `state["session"]`); now mitigated by `make_agent_node` reload-on-entry. If you see it again, check whether you accidentally bypassed the reload. +10. **Two ToolCall rows for one apply_fix** → known cosmetic duplication (gateway colon-form vs harvester `__`-form). Documented as a small follow-up. + +## Current unfinished work + +From `docs/00-project-overview.md` § "What's next" and +`docs/10-known-risks-and-todos.md`: + +| Item | Effort | Priority | +|---|---|---| +| **v2.0 — React UI** replacing Streamlit; parity-port against `/sessions/*` API | ~1–2 weeks | High | +| Duplicate ToolCall audit rows (gateway colon vs harvester `__`) | ~30 min | Low (cosmetic) | +| `ApprovalWatchdog` regression test (covers PR #6 saves) | ~15 min | Medium | +| `ASR_LOG_LEVEL` env var doc in main README | ~5 min | Low | +| `src/runtime/locks.py:49` — `TODO(v2)` slot eviction | ~1-2h | Low (relevant for long-running servers) | + +**Environment-side (operator, not framework):** + +- OpenRouter `workhorse` returns 402 on paid models — out of credits +- Azure live verification needs a real `AZURE_ENDPOINT` (`.env` placeholder) + +## Recommended next tasks + +In order of value × effort: + +1. **Update `.planning/STATE.md` + `.planning/ROADMAP.md`** (gitignored, + local) to reflect v1.5 fully shipped. ~5 min. +2. **Land the smaller cleanups together as a single "v1.5 polish" PR**: + `ApprovalWatchdog` test + duplicate ToolCall fix + `ASR_LOG_LEVEL` + doc. ~1h total. Closes the loop on v1.5. +3. **Brainstorm v2.0 React UI** — invoke `superpowers:brainstorming`. + Stack pick (Next.js / Vite + React / Remix?), state management, + API client codegen from `/sessions/*` OpenAPI? +4. **Scaffold v2.0 React UI** in a new top-level `web/` directory. + Don't touch `src/runtime/` until the parity-port surfaces a real + missing API. +5. **Build a multi-agent live driver** that runs intake → triage → + resolution against a real provider end-to-end. Catch provider-quirk + regressions earlier than the single-agent S1 driver. +6. **Postgres CI smoke** — one test against a postgres container so + the optional checkpointer doesn't drift unnoticed. + +## Where DESIGN.md and this handoff differ + +`docs/DESIGN.md` is the **prose narrative** — read it once, top-to- +bottom, to build the mental model. This handoff is the **action card** +— skim it at the start of each new session to remember what to do +and what to avoid. + +The 12 numbered files in this `docs/` directory (00 through 11) are +the **per-topic reference**: jump to whichever one matches your +current question. diff --git a/docs/adr/0001-current-architecture.md b/docs/adr/0001-current-architecture.md new file mode 100644 index 0000000..98b2cf4 --- /dev/null +++ b/docs/adr/0001-current-architecture.md @@ -0,0 +1,209 @@ +# ADR 0001: Current architecture + +**Status:** Accepted (snapshot of `main` as of v1.5, post-PR #11) + +**Date:** 2026-05-14 + +**Context:** This ADR captures the architectural baseline that +v1.5 ships. It is a synthesis of the twelve numbered decisions in +`docs/DESIGN.md` § 12 (DEC-001 through DEC-012). Future ADRs +should be written for new decisions that supersede or refine this +baseline. + +--- + +## Decision + +The framework's architecture composes three external layers +(LangGraph, LangChain, FastMCP) with a generic runtime + two +example apps, deployed as a single-file bundle into air-gapped +corporate environments. + +### Layer composition + +| Layer | Provided by | Owned by us | +|---|---|---| +| Provider clients | `langchain-openai`, `langchain-ollama` | NO | +| Agent factory (per-skill ReAct loop) | `langchain.agents.create_agent` (which is itself a langgraph subgraph) | NO | +| Graph orchestration / checkpointing / `interrupt()` | `langgraph` 1.x | NO | +| MCP tool servers | `fastmcp` | NO | +| **Framework abstractions** (`Session`, `Skill`, `Orchestrator`, gateway, telemetry, storage, bundling, HITL plumbing) | THIS REPO (`src/runtime/`) | YES | +| **Apps** (state subclass, MCP servers, skill prompts) | THIS REPO (`examples/`) or external | YES (examples) / external (downstream apps) | + +### Decision summary + +Reference: each is detailed in `docs/DESIGN.md` § 12. + +| ID | Decision | Why | +|---|---|---| +| DEC-001 | LangGraph as orchestration engine | Out-of-the-box Pregel-style step boundaries + checkpointing + first-class HITL `interrupt()` | +| DEC-002 | `langchain.agents.create_agent` as the per-agent loop (Phase 15) | Single tool-loop; AutoStrategy → ToolStrategy fallback; removed the `recursion_limit=25` workaround | +| DEC-003 | Markdown turn-output contract over `response_format` JSON (Phase 22) | JSON schema brittleness across providers; markdown is what every chat model writes well; parse leniency under our control | +| DEC-004 | Pure-policy HITL gating (Phase 11) | One source of truth (`should_gate`); auditing what gates is one grep | +| DEC-005 | Generic `Session` base + `extra_fields` JSON (v1.1) | Apps extend without schema migrations; framework stays domain-agnostic | +| DEC-006 | Per-agent `skill.model` override (v1.5-C / M8) | Cheap models for cheap agents; one config knob | +| DEC-007 | Single-file bundle for air-gap deploy (BUNDLER-01) | Copy-only deploy; no `pip install` at deploy time | +| DEC-008 | Concept-leak ratchet (v1.5-B) | CI-enforced framework genericity; downward-only count | +| DEC-009 | 429 separate retry regime (v1.5-D) | Free upstream tiers (OpenRouter `…:free`) need 30-60s windows; 5xx default backoff exhausts in 9s | +| DEC-010 | Inner agent checkpointer + reload-on-entry (PR #6) | langgraph 1.x `__interrupt__` semantics + outer Pregel step-boundary checkpointing → reload defends against stale state | +| DEC-011 | Two example apps to prove genericity | Without a second app, "is the framework generic?" is unanswerable | +| DEC-012 | Bundle staleness CI gate (HARD-08) | dist drift = deploy-time bugs; CI rebuilds + diff every PR | + +--- + +## Consequences + +### Positive + +- **Air-gap deployable** — copy-only 7-file payload; no runtime + internet dependencies; reproducible installs via `uv.lock`. +- **Genuinely generic** — two distinct example apps prove the + decoupling; CI ratchet keeps it that way. +- **HITL is first-class** — risk-rated gateway, durable pause via + langgraph checkpointer, two approval surfaces (UI + API), watchdog + for stale approvals. +- **Per-step observability** — `EventLog` rows for every + meaningful boundary, drives the auto-learning lesson store and + any external observability stack. +- **Provider-agnostic** — Ollama / Azure / OpenAI-compatible via + one config knob; per-skill override. +- **Resilient to provider quirks** — markdown contract + Path 5/6 + synthesis fallbacks; 429 backoff regime; provider timeout + + retry on 5xx. + +### Negative + +- **Two heavy upstream dependencies** (`langgraph`, `langchain`) + with histories of breaking semantic changes (PR #6 caught one; + more likely on future major bumps). +- **Single-process model** — `OrchestratorService` is one asyncio + loop on one host. Multi-host / multi-tenant deploys need + separate orchestrators per tenant. +- **No built-in auth on the FastAPI surface** — relies on corporate + network controls. Webhook triggers have bearer auth only. +- **Schema migrations are ad-hoc** — no Alembic. Additive changes + use `Base.metadata.create_all`; destructive changes need + hand-rolled scripts. +- **Concept-leak residue** — 39 tokens still on the `incident` / + `severity` / `reporter` axis after v1.5-B, mostly schema-coupled + columns + legacy `/incidents/*` URL routes that would require + destructive migration to remove. Documented in + `docs/DESIGN.md` § 12 DEC-008. +- **Bundle files are large** (~660-700KB each). Code review on + `dist/*` is impractical; reviewers focus on `src/runtime/` + diffs and trust the bundle gate. +- **Streamlit UI is a prototype** — slated for replacement by a + React UI (v2.0, not started). Adds a transitional cost. + +### Neutral + +- **No queue / messaging integration shipped** — trigger registry + + plugin transport ABC exists, but no SQS/Kafka/NATS in-tree. +- **No container Dockerfile** — Inference: bare-VM / systemd + deploy assumed. +- **No semver tags** — `pyproject.toml` declares `0.1.0`; the + v1.0 → v1.5 milestone labels are documentation-level, not git + tags. Squash SHAs in `docs/DESIGN.md` § 13 are the canonical + references. + +--- + +## Alternatives considered + +### Build a graph engine ourselves + +Rejected (DEC-001 implicitly). LangGraph's Pregel + checkpointer + +interrupt semantics are exactly what HITL needs. Owning the +orchestration engine would cost us a year of work for a similarly- +shaped result. + +### Stay on `langgraph.prebuilt.create_react_agent` + +Rejected in Phase 15 (DEC-002). The prebuilt was deprecated; the +`recursion_limit=25` workaround we needed to avoid infinite loops +was a symptom of the prebuilt's interaction with our structured- +output post-pass. `langchain.agents.create_agent` runs a single +tool-loop with native ToolStrategy fallback, removing the workaround. + +### Stay on `response_format=AgentTurnOutput` JSON envelope + +Rejected in Phase 22 (DEC-003). `response_format` triggered three +classes of brittleness: model-specific JSON drift, tool-strategy + +React END interaction, recursion-limit ceilings. Markdown is the +native format every chat model writes well; the parse step now +happens in our code where leniency is in our control. + +### Keep `IncidentState` as the only state class + +Rejected in v1.1 (DEC-005). Adding a second app (code_review) was +the forcing function — every "incident-shaped" leak that surfaced +during code-review's build moved into the framework rather than +becoming an app workaround. The concept-leak ratchet (DEC-008, +v1.5-B) keeps this honest. + +### Multi-file deploy (zip / tarball / wheel + venv) + +Rejected for BUNDLER-01 (DEC-007). Air-gap target is copy-only; +multi-file `pip install` at deploy time is out of scope. The +bundler turns the multi-file source tree into the smallest +possible deploy payload (7 files). + +### Use Alembic for schema migrations + +Considered, rejected (Inference). Schema changes have been purely +additive so far. When a destructive change becomes necessary, +adding Alembic at that point is straightforward. Until then, the +pydantic + JSON-bag pattern keeps schema rare. + +### Multi-agent supervisor as the entry point (instead of intake) + +Considered (Phase 6 introduced `kind: supervisor`). The +incident-management example app uses a supervisor for intake (rule- +based dispatch); other apps use a `responsive` skill at entry +(`code_review` does). The framework supports both patterns equally. + +--- + +## Open questions to revisit in future ADRs + +These are decisions the v1.5 baseline does NOT take a strong +position on: + +1. **Multi-host orchestration.** When does the single-process model + stop scaling? Does the answer involve a shared lock service, a + queue between orchestrators, or just "shard by app"? +2. **Authentication on the FastAPI surface.** Air-gap defers this; + if v2.0 React UI is hosted on a corporate intranet with SSO, + we'll need at least a JWT verification layer. ADR 0002? +3. **Postgres CI coverage.** The `asr[postgres]` extra ships but + no CI test exercises it. A postgres container in CI would + close the gap; cost is CI time + workflow complexity. +4. **Trigger fan-in transports.** SQS / Kafka / NATS plugin + transports exist as scaffold — no production user yet. When + the first arrives, the plugin transport ABC may need refining. +5. **React UI architecture.** Stack pick (Next.js? Vite + + React Router?), state management (TanStack Query?), API codegen + from a generated OpenAPI spec? ADR 0003 territory. +6. **Lesson-store pruning.** `LessonRow` is append-only; soft delete + exists but there's no automatic GC. At what corpus size do + intake's relevance lookups slow down enough to need pruning? +7. **Dual-write inconsistency between IncidentRow.pending_intervention + and the langgraph checkpointer.** Currently both are written + when a gate pauses; race-window between the two writes is + tolerated (operator dashboards may briefly disagree). Worth a + focused test or a transactional wrapper? + +--- + +## Related documents + +- `docs/DESIGN.md` — long-form architecture narrative + decision + rationale + milestone history +- `docs/00-project-overview.md` — what / who / status +- `docs/02-architecture.md` — quick-scan summary of the layers + + data flow +- `docs/04-main-flows.md` — entry points + failure modes per flow +- `docs/06-data-model.md` — entities + relationships + + persistence assumptions +- `docs/10-known-risks-and-todos.md` — what's pending +- `docs/11-agent-handoff.md` — action card for AI agents