From acd63753ddf312103c731ca964ab2ca09b536931 Mon Sep 17 00:00:00 2001 From: Memorysaver Date: Mon, 15 Jun 2026 21:08:18 +0800 Subject: [PATCH 1/8] =?UTF-8?q?docs(research):=20loop=20engineering=20?= =?UTF-8?q?=C3=97=20AEP=20autonomy=20gap=20analysis?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Web research on loop engineering (5 building blocks, ReAct, Ralph loop) mapped against current AEP workflow. Scorecard + gap classification (G1 fresh-context, G2 recovery ladder, G4 post-merge guard, G5 telemetry reflect, G6 self-feeding discovery, G7 hygiene) with priority ordering. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../research/loop-engineering-autonomy-gap.md | 121 ++++++++++++++++++ 1 file changed, 121 insertions(+) create mode 100644 docs/research/loop-engineering-autonomy-gap.md diff --git a/docs/research/loop-engineering-autonomy-gap.md b/docs/research/loop-engineering-autonomy-gap.md new file mode 100644 index 0000000..dfae4ac --- /dev/null +++ b/docs/research/loop-engineering-autonomy-gap.md @@ -0,0 +1,121 @@ +# Loop Engineering 研究 × AEP 自主開發差距分析 + +> **狀態:** 研究 + 比對成果(非實作計畫)。記錄「loop engineering 業界文章」與「AEP 現況」的對照,標出離完全自主開發 (fully autonomous development) 還缺的元素。 +> **日期:** 2026-06-15 **分支:** `research/loop-engineering-autonomy-gap` + +--- + +## 0. TL;DR + +AEP 已經是一個**成熟的外層迴圈 (outer loop)** — 在 loop engineering 的「目標定義、工具環境、誠實驗證 (gen/eval)、host-agnostic 執行、制度記憶」這幾項上甚至**領先**業界文章。缺口集中在三處: + +1. **內層迴圈的 context 紀律** — 單一 workspace agent 跑完 Phase 0–13,有 context rot 風險,缺 Ralph 式 fresh-context-per-task。 +2. **自主復原能力** — 卡關時只會「重試同一招 → 升級給人」,缺「換策略」的階梯。 +3. **合併後的生產回饋迴圈未閉環** — 缺自動 rollback / telemetry 驅動的 reflect / 自我餵食的工作發掘。這也是讓無人值守自主**安全**的前提。 + +--- + +## 1. 研究:Loop Engineering 是什麼 + +**定義(MindStudio):** 「設計不只回應一次、而是 act → observe → decide → repeat 直到目標真正達成的 AI 系統」。各家 coding agent(Claude Code、Devin、Codex)品質差異「通常不是底層模型,而是 loop 設計」。 + +### 五大構件 + +1. **Clear Goal / Task Definition** — 定義「done」、可評估、可拆成可測子任務。「沒有 termination condition,agent 不是跑不停就是亂停。」 +2. **Tool Set for Environment Interaction** — code execution / fs / shell / docs lookup / test runner。「agent 不能跑自己的 code,這 loop 只是在猜。」 +3. **Context Management** — 摘要前次迭代、結構化 action log、剪裁無關 context。否則撞 token 上限或失憶。 +4. **Termination Logic** — success / failure / escalation 條件。「沒有明確終止邏輯,loop 變資源黑洞。」 +5. **Error Handling & Recovery** — 區分可復原 vs 阻斷、依錯誤型態**調整策略**。「同錯誤後重試同動作不是學習,是在空轉。」 + +### 基礎 pattern:ReAct(Reason + Act) + +理解目標 → 嘗試動作 → 跑並觀察 → 推理 → 修正再試 → 重複至完成。 + +### 常見迴圈型態 + +Retry Loop / Plan-Execute-Verify / Explore-Narrow / Human-in-the-Loop。 + +### 實作 checklist(6 條) + +先定 termination → 給結構化回饋(非 raw output)→ running log 週期摘要進 working memory → 每迭代設嚴格 tool-call 預算 → 測失敗路徑 → 用「真正無解」的任務驗證 exit 條件能觸發。 + +### Anti-patterns + +無 exit / 策略停滯 (strategy stagnation) / context 溢出 / 目標模糊 / 缺工具存取。 + +### Ralph Loop(Geoffrey Huntley)關鍵原則 + +- **每迭代重新 malloc 整個 context window**,刻意「浪費」以避免 **context rot / compaction**(過 60–70% 進「Dumb Zone」)。 +- **single-task per iteration** — 不做多階段規劃,降低失敗域。 +- **context 隔離** — 不同任務用獨立 context,避免規格汙染。 +- **Architectural back-pressure** — pre-commit hooks、property-based tests、自動部署、CDC、限制寫權限、audit log。 +- **Safety engineering 讓激進自主變安全** — Huntley 的 agent 全 sudo 直推 master,靠的是完整測試 + < 30s rollback 的保護冗餘。 +- 核心:**verification precedes autonomy**、**engineering trumps coding**。 + +--- + +## 2. 對照記分卡 + +| Loop-engineering 元素 | AEP 現況 | 評級 | 證據 | +| ----------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------- | -------------------------------------- | +| ① Clear Goal / Task Definition | `product-context.yaml`、stories、OpenSpec specs、`contracts.md`、`dispatch_score`/`readiness_score` | ✅ 強 | `dispatch/SKILL.md`、`build` Phase 1–3 | +| ② Tool Set for Environment | workspace agent 全工具、dev server、`ports.env`、test runners | ✅ 強 | `build` Phase 4/6、`init.sh` | +| ③ Context Management(外層) | CHECK 委派 Haiku/`codex exec`(避免 orchestrator 膨脹)、`lessons.md` 注入(cap 2000 tokens) | ✅ 強 | `tick-protocol.md` CHECK→ACT | +| ③ Context Management(內層) | **單一 workspace agent 跑完 Phase 0–13**,task-by-task 但**同一 context**;`init.sh` 只在 reset 後被動復原 | ⚠️ 缺口 G1 | `build/SKILL.md` 全 13 phase | +| ④ Termination Logic | goal driver layer 邊界自停、`--max-turns 200`、`layer_complete` 條件、escalation triggers | ✅ 大致強(但「unsolvable」其實只是「打到上限」) | `tick-protocol.md` ⑦ | +| ⑤ Error Handling & Recovery | gen/eval loop、stuck detection + liveness、orphan 再領養、retry 計數 | ⚠️ 部分(缺「換策略」) | `tick-protocol.md` ④⑤ | +| Ralph:single-task fresh context | 一個 task 一個 commit,但 context 不重置 | ⚠️ 缺口 G1 | — | +| Ralph:architectural back-pressure | 有 CI / gen-eval / contracts | ⚠️ 部分(缺 post-merge 自動 rollback / property tests / audit log) | 缺口 G4 | +| verification precedes autonomy | gen/eval 分離、`feature-verification.json`「只有 evaluator 能改」 | ✅ 強 | `gen-eval/SKILL.md` | +| per-iteration budget / 結構化回饋 / 測失敗路徑 | 有 cost 追蹤、signals;codex 有 `token_budget` | ⚠️ 部分(預算非全 backend 硬約束) | 缺口 G7 | +| 「discovers → assigns → verifies → persists → hands off」自我餵食 | assigns/verifies/persists 有;**discovers 靠人工** envision/reflect | ⚠️ 缺口 G6 | `reflect` Step 1 手動問人 | + +--- + +## 3. 缺口分類 + +### Bucket 1:執行/戰術層缺口(可自動化) + +- **G1 — 內層 fresh-context per task**(防 context rot;Ralph malloc)。現況 lead 一路跑完 13 phase。 +- **G2 — 換策略復原階梯**(genuine adaptation)。現況 eval FAIL 同一 generator 同思路再修,5 輪打滿升級人;缺「重讀 spec → 換做法 → 拆 story → 換 agent → 才找人」。 +- **G3 — 設計歧義 / 視覺品質自主評斷**。`auto_design` 只是自動跑互動式 `/aep-design`;視覺品質明文「agent 無法判斷」靠 `.5` polish layer 人工。 +- **G4 — Post-merge guard & 自動 rollback ⭐ 安全關鍵**。現況 merge 後即 wrap,無生產健康監控 / 自動 revert / canary / audit log。 +- **G5 — Telemetry 驅動 reflect / outcome 評估**。現況 `reflect` 逐一問人,outcome contract 明文 pause 等人工判斷。 +- **G6 — 自我餵食工作發掘 ("discovers")**。新工作只能從人工 envision/reflect 進入。 +- **G7 — Loop hygiene**。per-phase token/tool-call 硬預算、termination 區分「打到上限」vs「真正無解」。 + +### Bucket 2:技術上可自動化、但**建議保留人工**(待拍板) + +| 關卡 | 可自動化途徑 | 建議 | 理由 | +| --------------------------- | --------------------------------------- | ---------------------------------- | ---------------------------- | +| S1 產品願景 `/aep-envision` | telemetry + 市場訊號生成 hypothesis | **保留人工** | 高風險、定義「做什麼」是戰略 | +| S2 架構決策 `/aep-map` | gen/eval 架構提案 | **人工核准**(agent 提案、人按鈕) | 難回滾、跨層影響 | +| S3 Outcome contract 判定 | 量化指標可自動 (見 G5);質化保留 | **混合** | — | +| S4 成本/優先序權衡 | `dispatch_score` 已自動;殘留為預算上限 | **policy 化**(預算寫成 config) | — | + +### Bucket 3:本質保留人工 + +最終問責 / 生產事故價值判斷、倫理與商業風險決策。 + +--- + +## 4. 建議優先序(供後續實作規劃) + +| 優先 | 缺口 | 為何 | +| ------ | -------------------------------------- | ------------------------------------------------ | +| **P0** | G4 post-merge guard、G2 復原階梯 | 沒 G4 → 無人值守不安全;沒 G2 → 一直 spin 回找人 | +| **P1** | G1 fresh-context、G5 telemetry reflect | 內層品質/規模 + 閉合外層回饋迴圈 | +| **P2** | G6 自我餵食、G3 視覺自主、G7 hygiene | 推向真正連續自主 | + +> 逐檔案的實作藍圖(新增 reference / config flag / 修改點)已草擬,待決定推進哪些缺口後再展開為正式 spec。複用既有抽象:`executor.spawn/nudge/check`、gen/eval 協定、signals、`product-context.yaml` config、autopilot state schema。 + +--- + +## Sources + +- [What Is Loop Engineering? — MindStudio](https://www.mindstudio.ai/blog/what-is-loop-engineering-ai-coding-agents) +- [What Is Agentic Coding? — MindStudio](https://www.mindstudio.ai/blog/what-is-agentic-coding) +- [Mastering Ralph loops (Geoffrey Huntley) — LinearB](https://linearb.io/blog/ralph-loop-agentic-engineering-geoffrey-huntley) +- [The Ralph Wiggum Loop — codecentric](https://www.codecentric.de/en/knowledge-hub/blog/the-ralph-wiggum-loop-autonomous-code-generation-with-a-fresh-context) +- [Loop Engineering — Cobus Greyling](https://cobusgreyling.medium.com/loop-engineering-62926dd6991c) +- [snarktank/ralph](https://github.com/snarktank/ralph) · [vercel-labs/ralph-loop-agent](https://github.com/vercel-labs/ralph-loop-agent) From 1b0d58b66197618fd12eb020c1c2b17264ab8e0a Mon Sep 17 00:00:00 2001 From: Memorysaver Date: Mon, 15 Jun 2026 22:44:33 +0800 Subject: [PATCH 2/8] docs(research): add codex/claude-code compatibility verdict All 7 gap-fill methods (G1-G7) confirmed cross-host compatible via the executor abstraction. Resolved two caveats: G3 visual evaluator (Codex confirmed multimodal), G7 unifies on --max-turns (drop codex-only token_budget as primary). G1 standardizes on exec/headless one-shot per task to avoid nesting limits. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../research/loop-engineering-autonomy-gap.md | 22 +++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/docs/research/loop-engineering-autonomy-gap.md b/docs/research/loop-engineering-autonomy-gap.md index dfae4ac..359bbb9 100644 --- a/docs/research/loop-engineering-autonomy-gap.md +++ b/docs/research/loop-engineering-autonomy-gap.md @@ -111,6 +111,28 @@ Retry Loop / Plan-Execute-Verify / Explore-Narrow / Human-in-the-Loop。 --- +## 5. Codex / Claude Code 相容性判定 + +對照 executor 抽象層(`detect/spawn/spawn_evaluator/nudge/liveness/gate/check/monitor/present/teardown` + 檔案式 signals,本就用來吸收 host 差異)。**結論:7 個方法全數雙邊相容**,無任一只能在單一 host 跑。 + +| 方法 | 依賴機制 | Claude Code | Codex | 判定 | +| ----------------------------------- | ------------------------------------------------------------------------------- | ---------------- | ------------------- | ---------------- | +| G2 換策略復原階梯 | protocol 文字 + `executor.spawn` 開 fresh generator | ✅ | ✅ | 全相容 | +| G4 post-merge guard / auto-rollback | autopilot tick 讀 signals + `bash`/`gh pr revert`;back-pressure 為 git/CI 設定 | ✅ | ✅ | 全相容 | +| G5 telemetry 驅動 reflect | `bash`/`curl`/`jq` + 分類 prompt | ✅ | ✅ | 全相容 | +| G6 自我餵食 `/aep-watch` | `/loop`(Claude) 或 `codex exec` cron(Codex) 驅動 — 相容性矩陣 cron 列雙邊 ✅ | ✅ | ✅ | 全相容 | +| G1 per-task fresh context | 每 task 呼叫 `executor.spawn` 開 worktree-bound 新 worker | ✅ team/headless | ✅ subagent/exec | 全相容(見註 1) | +| G3 視覺品質 evaluator | 餵 screenshot 給 vision model 評分 | ✅ 原生多模態 | ✅ 多模態(已確認) | 全相容 | +| G7 per-phase 預算硬牆 | 用量上限 | ✅ `--max-turns` | ✅ `--max-turns` | 全相容(見註 2) | + +**註 1(G1 巢狀):** workspace agent 內再 spawn per-task 子 agent。為避開 Workflow 工具單層巢狀限制與 `spawn_agent` 無 cwd 參數的問題,統一以 **exec / headless one-shot per task** 表達(OS process / 一次性 subagent,worktree 由 cwd 或 prompt 契約綁定),雙邊皆成立。 + +**註 2(G7 決策):** 統一以 **`--max-turns`(turn 數)** 作為唯一的 per-phase / runaway 預算機制 —— 雙邊原生都有(`autopilot/SKILL.md:185,646`)。**不採用** Codex 專屬的 `token_budget` 當主要約束,避免 host 不對稱;Codex 的 `token_budget` 至多作為可選的次要保險。預算抽象因此是 host-agnostic 的單一 knob。 + +**註 3(G3 視覺):** Codex 確認為多模態,可吃 screenshot;視覺 evaluator 維度雙邊一致。截圖擷取可走既有 webapp-testing / agent-browser 工具,圖檔再交給各 host 的多模態 evaluator。 + +--- + ## Sources - [What Is Loop Engineering? — MindStudio](https://www.mindstudio.ai/blog/what-is-loop-engineering-ai-coding-agents) From 46803a05cb5b9b5d292b518ddcd680a622fa1659 Mon Sep 17 00:00:00 2001 From: Memorysaver Date: Mon, 15 Jun 2026 23:09:10 +0800 Subject: [PATCH 3/8] docs(research): reject G1 (per-task fresh context) Spawn granularity in AEP is the story (one worker per story per round); deliberately not subdividing into per-task fresh contexts. G1 moved to a "Rejected" record with rationale; scorecard, gap buckets, priority, and compatibility tables updated. Gaps now G2-G7 (6 methods). Co-Authored-By: Claude Opus 4.8 (1M context) --- .../research/loop-engineering-autonomy-gap.md | 39 ++++++++++--------- 1 file changed, 21 insertions(+), 18 deletions(-) diff --git a/docs/research/loop-engineering-autonomy-gap.md b/docs/research/loop-engineering-autonomy-gap.md index 359bbb9..7a3381e 100644 --- a/docs/research/loop-engineering-autonomy-gap.md +++ b/docs/research/loop-engineering-autonomy-gap.md @@ -7,11 +7,12 @@ ## 0. TL;DR -AEP 已經是一個**成熟的外層迴圈 (outer loop)** — 在 loop engineering 的「目標定義、工具環境、誠實驗證 (gen/eval)、host-agnostic 執行、制度記憶」這幾項上甚至**領先**業界文章。缺口集中在三處: +AEP 已經是一個**成熟的外層迴圈 (outer loop)** — 在 loop engineering 的「目標定義、工具環境、誠實驗證 (gen/eval)、host-agnostic 執行、制度記憶」這幾項上甚至**領先**業界文章。缺口集中在兩處: -1. **內層迴圈的 context 紀律** — 單一 workspace agent 跑完 Phase 0–13,有 context rot 風險,缺 Ralph 式 fresh-context-per-task。 -2. **自主復原能力** — 卡關時只會「重試同一招 → 升級給人」,缺「換策略」的階梯。 -3. **合併後的生產回饋迴圈未閉環** — 缺自動 rollback / telemetry 驅動的 reflect / 自我餵食的工作發掘。這也是讓無人值守自主**安全**的前提。 +1. **自主復原能力** — 卡關時只會「重試同一招 → 升級給人」,缺「換策略」的階梯。 +2. **合併後的生產回饋迴圈未閉環** — 缺自動 rollback / telemetry 驅動的 reflect / 自我餵食的工作發掘。這也是讓無人值守自主**安全**的前提。 + +> **註:** Ralph 式「per-task fresh context」(原 G1)**經評估後否決** —— AEP 的 spawn 單位是 **story**(每輪一個 worker 對一個 story),不在 story 內再細分 per-task context。詳見 §3。 --- @@ -61,10 +62,10 @@ Retry Loop / Plan-Execute-Verify / Explore-Narrow / Human-in-the-Loop。 | ① Clear Goal / Task Definition | `product-context.yaml`、stories、OpenSpec specs、`contracts.md`、`dispatch_score`/`readiness_score` | ✅ 強 | `dispatch/SKILL.md`、`build` Phase 1–3 | | ② Tool Set for Environment | workspace agent 全工具、dev server、`ports.env`、test runners | ✅ 強 | `build` Phase 4/6、`init.sh` | | ③ Context Management(外層) | CHECK 委派 Haiku/`codex exec`(避免 orchestrator 膨脹)、`lessons.md` 注入(cap 2000 tokens) | ✅ 強 | `tick-protocol.md` CHECK→ACT | -| ③ Context Management(內層) | **單一 workspace agent 跑完 Phase 0–13**,task-by-task 但**同一 context**;`init.sh` 只在 reset 後被動復原 | ⚠️ 缺口 G1 | `build/SKILL.md` 全 13 phase | +| ③ Context Management(內層) | **單一 workspace agent 跑完 Phase 0–13**,task-by-task 但**同一 context**;`init.sh` 只在 reset 後被動復原 | ✅ 設計選擇(story-based spawn,不細分 per-task;否決 G1,見 §3) | `build/SKILL.md` 全 13 phase | | ④ Termination Logic | goal driver layer 邊界自停、`--max-turns 200`、`layer_complete` 條件、escalation triggers | ✅ 大致強(但「unsolvable」其實只是「打到上限」) | `tick-protocol.md` ⑦ | | ⑤ Error Handling & Recovery | gen/eval loop、stuck detection + liveness、orphan 再領養、retry 計數 | ⚠️ 部分(缺「換策略」) | `tick-protocol.md` ④⑤ | -| Ralph:single-task fresh context | 一個 task 一個 commit,但 context 不重置 | ⚠️ 缺口 G1 | — | +| Ralph:single-task fresh context | 一個 task 一個 commit,但 context 不重置 | ➖ 不採用(spawn 單位為 story,非 per-task;否決 G1) | — | | Ralph:architectural back-pressure | 有 CI / gen-eval / contracts | ⚠️ 部分(缺 post-merge 自動 rollback / property tests / audit log) | 缺口 G4 | | verification precedes autonomy | gen/eval 分離、`feature-verification.json`「只有 evaluator 能改」 | ✅ 強 | `gen-eval/SKILL.md` | | per-iteration budget / 結構化回饋 / 測失敗路徑 | 有 cost 追蹤、signals;codex 有 `token_budget` | ⚠️ 部分(預算非全 backend 硬約束) | 缺口 G7 | @@ -76,7 +77,6 @@ Retry Loop / Plan-Execute-Verify / Explore-Narrow / Human-in-the-Loop。 ### Bucket 1:執行/戰術層缺口(可自動化) -- **G1 — 內層 fresh-context per task**(防 context rot;Ralph malloc)。現況 lead 一路跑完 13 phase。 - **G2 — 換策略復原階梯**(genuine adaptation)。現況 eval FAIL 同一 generator 同思路再修,5 輪打滿升級人;缺「重讀 spec → 換做法 → 拆 story → 換 agent → 才找人」。 - **G3 — 設計歧義 / 視覺品質自主評斷**。`auto_design` 只是自動跑互動式 `/aep-design`;視覺品質明文「agent 無法判斷」靠 `.5` polish layer 人工。 - **G4 — Post-merge guard & 自動 rollback ⭐ 安全關鍵**。現況 merge 後即 wrap,無生產健康監控 / 自動 revert / canary / audit log。 @@ -97,15 +97,19 @@ Retry Loop / Plan-Execute-Verify / Explore-Narrow / Human-in-the-Loop。 最終問責 / 生產事故價值判斷、倫理與商業風險決策。 +### 已否決 + +- **G1 — 內層 fresh-context per task**(原 P1)。**否決理由:** AEP 的 spawn 單位是 **story**(每輪一個 worker 對應一個 story),刻意不在 story 內再切 per-task fresh context。Ralph 的 per-task malloc 適用於「單執行緒一路做」的模型;AEP 已用 story 粒度 + worktree 隔離 + 階段性 signals 達到 context 邊界控制,再細分一層與「每輪 spawn 都 story-based」的設計原則衝突,且徒增 spawn/lessons 捕捉的複雜度。Context rot 的殘餘風險改由 §G2(換策略時開 fresh generator)與既有 `init.sh` 復原機制承接。 + --- ## 4. 建議優先序(供後續實作規劃) -| 優先 | 缺口 | 為何 | -| ------ | -------------------------------------- | ------------------------------------------------ | -| **P0** | G4 post-merge guard、G2 復原階梯 | 沒 G4 → 無人值守不安全;沒 G2 → 一直 spin 回找人 | -| **P1** | G1 fresh-context、G5 telemetry reflect | 內層品質/規模 + 閉合外層回饋迴圈 | -| **P2** | G6 自我餵食、G3 視覺自主、G7 hygiene | 推向真正連續自主 | +| 優先 | 缺口 | 為何 | +| ------ | ------------------------------------ | ------------------------------------------------ | +| **P0** | G4 post-merge guard、G2 復原階梯 | 沒 G4 → 無人值守不安全;沒 G2 → 一直 spin 回找人 | +| **P1** | G5 telemetry reflect | 閉合外層回饋迴圈 | +| **P2** | G6 自我餵食、G3 視覺自主、G7 hygiene | 推向真正連續自主 | > 逐檔案的實作藍圖(新增 reference / config flag / 修改點)已草擬,待決定推進哪些缺口後再展開為正式 spec。複用既有抽象:`executor.spawn/nudge/check`、gen/eval 協定、signals、`product-context.yaml` config、autopilot state schema。 @@ -113,7 +117,7 @@ Retry Loop / Plan-Execute-Verify / Explore-Narrow / Human-in-the-Loop。 ## 5. Codex / Claude Code 相容性判定 -對照 executor 抽象層(`detect/spawn/spawn_evaluator/nudge/liveness/gate/check/monitor/present/teardown` + 檔案式 signals,本就用來吸收 host 差異)。**結論:7 個方法全數雙邊相容**,無任一只能在單一 host 跑。 +對照 executor 抽象層(`detect/spawn/spawn_evaluator/nudge/liveness/gate/check/monitor/present/teardown` + 檔案式 signals,本就用來吸收 host 差異)。**結論:保留的 6 個方法(G2–G7,G1 已否決)全數雙邊相容**,無任一只能在單一 host 跑。 | 方法 | 依賴機制 | Claude Code | Codex | 判定 | | ----------------------------------- | ------------------------------------------------------------------------------- | ---------------- | ------------------- | ---------------- | @@ -121,15 +125,14 @@ Retry Loop / Plan-Execute-Verify / Explore-Narrow / Human-in-the-Loop。 | G4 post-merge guard / auto-rollback | autopilot tick 讀 signals + `bash`/`gh pr revert`;back-pressure 為 git/CI 設定 | ✅ | ✅ | 全相容 | | G5 telemetry 驅動 reflect | `bash`/`curl`/`jq` + 分類 prompt | ✅ | ✅ | 全相容 | | G6 自我餵食 `/aep-watch` | `/loop`(Claude) 或 `codex exec` cron(Codex) 驅動 — 相容性矩陣 cron 列雙邊 ✅ | ✅ | ✅ | 全相容 | -| G1 per-task fresh context | 每 task 呼叫 `executor.spawn` 開 worktree-bound 新 worker | ✅ team/headless | ✅ subagent/exec | 全相容(見註 1) | | G3 視覺品質 evaluator | 餵 screenshot 給 vision model 評分 | ✅ 原生多模態 | ✅ 多模態(已確認) | 全相容 | -| G7 per-phase 預算硬牆 | 用量上限 | ✅ `--max-turns` | ✅ `--max-turns` | 全相容(見註 2) | +| G7 per-phase 預算硬牆 | 用量上限 | ✅ `--max-turns` | ✅ `--max-turns` | 全相容(見註 1) | -**註 1(G1 巢狀):** workspace agent 內再 spawn per-task 子 agent。為避開 Workflow 工具單層巢狀限制與 `spawn_agent` 無 cwd 參數的問題,統一以 **exec / headless one-shot per task** 表達(OS process / 一次性 subagent,worktree 由 cwd 或 prompt 契約綁定),雙邊皆成立。 +> G1(per-task fresh context)已否決,故不列入相容性評估,見 §3。 -**註 2(G7 決策):** 統一以 **`--max-turns`(turn 數)** 作為唯一的 per-phase / runaway 預算機制 —— 雙邊原生都有(`autopilot/SKILL.md:185,646`)。**不採用** Codex 專屬的 `token_budget` 當主要約束,避免 host 不對稱;Codex 的 `token_budget` 至多作為可選的次要保險。預算抽象因此是 host-agnostic 的單一 knob。 +**註 1(G7 決策):** 統一以 **`--max-turns`(turn 數)** 作為唯一的 per-phase / runaway 預算機制 —— 雙邊原生都有(`autopilot/SKILL.md:185,646`)。**不採用** Codex 專屬的 `token_budget` 當主要約束,避免 host 不對稱;Codex 的 `token_budget` 至多作為可選的次要保險。預算抽象因此是 host-agnostic 的單一 knob。 -**註 3(G3 視覺):** Codex 確認為多模態,可吃 screenshot;視覺 evaluator 維度雙邊一致。截圖擷取可走既有 webapp-testing / agent-browser 工具,圖檔再交給各 host 的多模態 evaluator。 +**註 2(G3 視覺):** Codex 確認為多模態,可吃 screenshot;視覺 evaluator 維度雙邊一致。截圖擷取可走既有 webapp-testing / agent-browser 工具,圖檔再交給各 host 的多模態 evaluator。 --- From 34dc2a67cd42d034cd4ca1cc2e1a67bd93b55977 Mon Sep 17 00:00:00 2001 From: Memorysaver Date: Mon, 15 Jun 2026 23:24:04 +0800 Subject: [PATCH 4/8] docs(research): G4 host-aware dogfood validation design Post-deploy staging/prod validation with host-aware method selection: Claude Code auto-detects agent-browser; Codex uses native in-app browser+computer-use (desktop) or Playwright scripts (headless codex-exec, since computer-use is desktop-only). URL resolution = config first, CI fallback. Integration: upgrade Phase 6 + new post-deploy step. Issues auto-create stories via reflect classifier (links G6). Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/research/g4-dogfood-validation-design.md | 136 ++++++++++++++++++ .../research/loop-engineering-autonomy-gap.md | 2 +- 2 files changed, 137 insertions(+), 1 deletion(-) create mode 100644 docs/research/g4-dogfood-validation-design.md diff --git a/docs/research/g4-dogfood-validation-design.md b/docs/research/g4-dogfood-validation-design.md new file mode 100644 index 0000000..b69fdcb --- /dev/null +++ b/docs/research/g4-dogfood-validation-design.md @@ -0,0 +1,136 @@ +# G4 — Host-aware Dogfood Validation 預設設計 + +> **狀態:** 設計 spec(待展開為 skill 變更)。屬 [loop-engineering-autonomy-gap](./loop-engineering-autonomy-gap.md) §3 的 **G4** 子設計:部署後在 staging/production 上的驗證,依 host 採原生方法。 +> **日期:** 2026-06-15 **分支:** `research/loop-engineering-autonomy-gap` + +--- + +## Context + +G4 是「合併後的生產回饋閉環」。現況:AEP 的 dogfood(`/aep-build` Phase 6)只在**本地 localhost**(`ports.env` 的 `BASE_URL`)跑,且**前提是 agent-browser 有裝**否則整個 phase skip;**完全沒有 staging/production 部署後驗證**。本設計補上兩件事: + +1. **Dogfood 方法 host-aware** —— Claude Code 自動判斷是否用 agent-browser;Codex 採原生 browser / computer-use。 +2. **部署後在 staging/production 驗證** —— 新增 post-deploy dogfood,目標 URL 來自 config 或 CI。 + +--- + +## 決策(已拍板) + +| 項目 | 決定 | +| --------------------- | ----------------------------------------------------------------------------------------------------------------------------------------- | +| staging/prod URL 來源 | **config 優先,fallback CI** —— `topology.routing.deploy_targets.{staging_url,production_url}`;缺則從 CI/deploy 輸出(如 preview URL)讀 | +| 接入點 | **新 G4 post-deploy 步驟 + 升級 Phase 6 為 host-aware**(兩者並存) | +| 發現問題時 | **自動建 story 進 dispatch**(走 `/aep-reflect` 分類器,連動 G6) | + +--- + +## 研究依據(host 原生能力) + +- **Claude Code:** agent-browser 是原生瀏覽器工具(CDP 驅動 Chrome、accessibility-tree `@eN` refs、screenshot `--annotate`、video、auth vault),`/agent-browser:dogfood` 已是 Phase 6 用的探索式測試流程。健康偵測 `agent_browser_healthy()`(`agent-browser navigate about:blank`)已存在於 `testing-guide`。 +- **Codex:** computer-use(GPT-5.4 原生:截圖 + 滑鼠鍵盤 + 寫 Playwright)與 in-app browser(Atlas)**僅桌面 app**;`codex exec`(headless)**沒有**,只能寫並跑 Playwright 腳本或退回 agent-browser CLI。→ Codex 必須分桌面 / headless 兩條路。 + +來源:[OpenAI Codex app](https://developers.openai.com/codex/app)、[GPT-5.4 — OpenAI](https://openai.com/index/introducing-gpt-5-4/)、[Codex superapp — MacStories](https://www.macstories.net/news/openai-unveils-codex-superapp-update-with-computer-use-automations-built-in-browser-and-more/)、[Codex for Chrome — eigent.ai](https://www.eigent.ai/blog/codex-for-chrome)。 + +--- + +## 預設選擇邏輯(dogfood method 偵測) + +延用 `executor.detect()` 的精神,新增一層方法偵測(host × mode): + +``` +dogfood_method(): + resolve HOST + mode via executor.detect() + + if HOST == claude: # 任一 mode + if agent_browser_healthy(): return "agent-browser" # /agent-browser:dogfood + else: return "degrade" # 非 UI→API/curl;UI→human-eval + + if HOST == codex: + if mode == codex-subagent and computer_use_enabled: # 桌面 app + return "codex-native" # in-app browser + computer-use + else: # codex-exec / headless + if playwright_available(): return "playwright-script" # GPT-5.4 原生會寫 + elif agent_browser_healthy(): return "agent-browser" # CLI 退路 + else: return "degrade" +``` + +| Host / mode | 預設原生方法 | 偵測 | 退路 | +| ---------------------------- | ------------------------------------ | --------------------------- | -------------------------------- | +| Claude Code(任一 mode) | `/agent-browser:dogfood` | `agent_browser_healthy()` | 非 UI→API/curl;UI→human-eval | +| Codex 桌面(codex-subagent) | native in-app browser + computer-use | desktop + computer-use 啟用 | Playwright skill → agent-browser | +| Codex headless(codex-exec) | 寫並跑 Playwright 腳本 | playwright 可用/可裝 | agent-browser CLI → API 檢查 | + +> 所有方法統一輸出同格式報告(`/agent-browser:dogfood` 的 severity/category/repro 模板),讓下游分類器 host-agnostic。 + +--- + +## 目標 URL 解析 + +``` +target_url(env): # env ∈ {local, staging, production} + if env == local: # 現況不變 + source .dev-workflow/ports.env → return $BASE_URL + else: + u = product-context: topology.routing.deploy_targets._url + if u: return u # config 優先 + else: return <讀 CI/deploy 步驟輸出的 preview/deploy URL> # fallback CI +``` + +--- + +## 接入點 + +### (1) 升級 Phase 6(本地,pre-merge) + +`/aep-build` Phase 6 把「agent-browser 沒裝就 skip」改為呼叫 `dogfood_method()`:Claude→agent-browser、Codex→原生。`env=local`,URL 來自 `ports.env`。報告仍寫 `.dev-workflow/dogfood-.md`。 + +### (2) 新 G4 post-deploy 步驟(staging/prod,post-merge) + +在 autopilot tick 的 wrap 後(或 `post-merge-guard`)新增:merge→(觸發/等待 deploy)→`target_url(staging|production)`→`dogfood_method()` 跑驗證→寫報告。維持 orchestrator boundary(讀 signals/報告 + 跑 gh/CLI,不讀 workspace code)。 + +--- + +## 發現問題時的行為 + +- **dogfood 發現的問題** → 餵 `/aep-reflect` 分類器 → 自動建 bug/refinement story 進 `product-context.yaml` → dispatch(連動 G6 自我餵食)。 +- **硬性 regression(健康訊號)** → 另走 G4 post-merge guard 的 `auto_revert` 政策(預設保守:先告警、人工確認後才 revert)。兩條路分開:dogfood 找 UX/功能問題建 story;guard 找服務性 regression 決定回滾。 + +--- + +## config 新增(product-context.yaml) + +```yaml +topology: + routing: + deploy_targets: + staging_url: "https://staging.example.com" # 選填;缺則 fallback CI + production_url: "https://example.com" + dogfood: + method: auto # auto | agent-browser | codex-native | playwright + post_deploy_env: staging # staging | production | none + on_issue: create_story # create_story | escalate +``` + +--- + +## 實作時會動到的檔案(待展開) + +| 檔案 | 變更 | +| ------------------------------------------------------------------------ | --------------------------------------------------------------------------------------------------- | +| `agentic-development-workflow/build/SKILL.md` | Phase 6 改呼叫 `dogfood_method()`(host-aware),不再「沒裝就 skip」 | +| 新 `patterns/.../references/dogfood-validation.md` | `dogfood_method()` 偵測 + `target_url()` 解析 + 報告格式 | +| `patterns/executor/references/codex-native.md` | 新增 codex-subagent 用 in-app browser / computer-use 做 dogfood 的 recipe;codex-exec 用 Playwright | +| `patterns/autopilot/references/tick-protocol.md` + `post-merge-guard.md` | 新增 post-deploy dogfood 步驟 | +| `product-context/reflect/SKILL.md` | 接收 dogfood 報告 → 分類 → 建 story(已有分類器,補來源) | +| `project-setup/testing-guide/SKILL.md` | 重用既有 `agent_browser_healthy()`;補 playwright 偵測 | + +--- + +## Verification(實作後) + +1. **Claude Code**:裝/不裝 agent-browser 各跑一次 Phase 6 → 確認自動選 agent-browser / 正確 degrade。 +2. **Codex 桌面**:codex-subagent 跑 post-deploy → 確認用 in-app browser + computer-use 驗證 staging URL。 +3. **Codex headless**:codex-exec 跑 → 確認改寫並跑 Playwright 腳本(無 computer-use 時)。 +4. **URL 解析**:設 `deploy_targets.staging_url` → 用之;移除 → 確認 fallback 從 CI 輸出取得。 +5. **on_issue**:故意留一個 UX bug → 確認自動在 `product-context.yaml` 建出 bug story 並進 dispatch。 +6. **boundary**:確認 post-deploy 步驟只讀報告/signals + 跑 CLI,不讀 workspace code。 diff --git a/docs/research/loop-engineering-autonomy-gap.md b/docs/research/loop-engineering-autonomy-gap.md index 7a3381e..83ec2bc 100644 --- a/docs/research/loop-engineering-autonomy-gap.md +++ b/docs/research/loop-engineering-autonomy-gap.md @@ -79,7 +79,7 @@ Retry Loop / Plan-Execute-Verify / Explore-Narrow / Human-in-the-Loop。 - **G2 — 換策略復原階梯**(genuine adaptation)。現況 eval FAIL 同一 generator 同思路再修,5 輪打滿升級人;缺「重讀 spec → 換做法 → 拆 story → 換 agent → 才找人」。 - **G3 — 設計歧義 / 視覺品質自主評斷**。`auto_design` 只是自動跑互動式 `/aep-design`;視覺品質明文「agent 無法判斷」靠 `.5` polish layer 人工。 -- **G4 — Post-merge guard & 自動 rollback ⭐ 安全關鍵**。現況 merge 後即 wrap,無生產健康監控 / 自動 revert / canary / audit log。 +- **G4 — Post-merge guard & 自動 rollback ⭐ 安全關鍵**。現況 merge 後即 wrap,無生產健康監控 / 自動 revert / canary / audit log。**驗證面向(部署後 staging/prod dogfood,host-aware)已展開設計:[g4-dogfood-validation-design.md](./g4-dogfood-validation-design.md)。** - **G5 — Telemetry 驅動 reflect / outcome 評估**。現況 `reflect` 逐一問人,outcome contract 明文 pause 等人工判斷。 - **G6 — 自我餵食工作發掘 ("discovers")**。新工作只能從人工 envision/reflect 進入。 - **G7 — Loop hygiene**。per-phase token/tool-call 硬預算、termination 區分「打到上限」vs「真正無解」。 From 3c611cf45fc6373664ad46b4d961726502f07fb3 Mon Sep 17 00:00:00 2001 From: Memorysaver Date: Tue, 16 Jun 2026 00:11:27 +0800 Subject: [PATCH 5/8] =?UTF-8?q?feat(aep-v2):=20autonomy=20loop=20=E2=80=94?= =?UTF-8?q?=20recovery=20ladder,=20post-deploy=20guard/dogfood,=20telemetr?= =?UTF-8?q?y=20reflect,=20self-feeding=20watch,=20visual=20eval,=20full-au?= =?UTF-8?q?to=20switch?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Implements the retained loop-engineering gaps (G2–G7) plus the A1 full-auto master switch, all defaulting to human-in-the-loop (opt-in only). - G2 recovery ladder: gen-eval/references/recovery-ladder.md; build Phase 5 and autopilot tick ④ climb same-fix → re-ground → fresh native-bg-subagent → decompose before the eval_not_converging human gate. - G4 host-aware dogfood + post-merge guard: executor/references/dogfood-validation.md (dogfood_method()/target_url(), Claude=agent-browser, Codex=native/Playwright), autopilot/references/post-merge-guard.md + tick Step ③.5; build Phase 6 host-aware; on-issue → reflect story; hard regression → conservative auto_revert (default off). - G5 telemetry reflect: reflect/references/telemetry-ingestion.md; reflect Step 1 auto-ingestion + Step 2.75 quantitative outcome auto-eval; tick layer-completion. - G6 self-feeding discovery: new /aep-watch skill (registered in marketplace.json). - G3 visual evaluator: Visual Design dimension in gen-eval scoring + evaluator contract. - G7 loop hygiene: unified --max-turns budget; cap = possibly-unsolvable. - A1 full_auto master switch (default false) gates strategic pauses; config keys added to product-context schema (all 3 templates). Quick-reference updated. Co-Authored-By: Claude Opus 4.8 (1M context) --- .claude-plugin/marketplace.json | 5 +- docs/skills-quick-reference.md | 18 +- .../build/SKILL.md | 37 ++- skills/patterns/autopilot/SKILL.md | 26 ++ .../autopilot/references/post-merge-guard.md | 195 ++++++++++++ .../autopilot/references/tick-protocol.md | 41 ++- .../executor/references/dogfood-validation.md | 171 +++++++++++ .../gen-eval/references/agent-contracts.md | 4 +- .../gen-eval/references/recovery-ladder.md | 113 +++++++ .../gen-eval/references/scoring-framework.md | 32 +- .../templates/product-context-schema.yaml | 19 ++ skills/product-context/dispatch/SKILL.md | 16 + .../templates/product-context-schema.yaml | 19 ++ .../map/templates/product-context-schema.yaml | 19 ++ skills/product-context/reflect/SKILL.md | 8 + skills/product-context/watch/SKILL.md | 289 ++++++++++++++++++ 16 files changed, 975 insertions(+), 37 deletions(-) create mode 100644 skills/patterns/autopilot/references/post-merge-guard.md create mode 100644 skills/patterns/executor/references/dogfood-validation.md create mode 100644 skills/patterns/gen-eval/references/recovery-ladder.md create mode 100644 skills/product-context/watch/SKILL.md diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index bb6e591..d2ae612 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -11,7 +11,7 @@ "plugins": [ { "name": "product-context", - "description": "Product-level planning and iteration: envision, map, dispatch, validate, calibrate, reflect.", + "description": "Product-level planning and iteration: envision, map, dispatch, validate, calibrate, reflect, watch.", "source": "./", "strict": false, "skills": [ @@ -20,7 +20,8 @@ "./skills/product-context/dispatch", "./skills/product-context/validate", "./skills/product-context/calibrate", - "./skills/product-context/reflect" + "./skills/product-context/reflect", + "./skills/product-context/watch" ] }, { diff --git a/docs/skills-quick-reference.md b/docs/skills-quick-reference.md index c62fd81..05c343a 100644 --- a/docs/skills-quick-reference.md +++ b/docs/skills-quick-reference.md @@ -1,6 +1,6 @@ # AEP Skills — Quick Reference -A cheat sheet for all 16 AEP skills. For precise term definitions, see the [Glossary](glossary.md). For a guided first-hour introduction to the mental models behind these skills, see the [Orientation Guide](orientation.md). +A cheat sheet for all 17 AEP skills. For precise term definitions, see the [Glossary](glossary.md). For a guided first-hour introduction to the mental models behind these skills, see the [Orientation Guide](orientation.md). --- @@ -20,13 +20,14 @@ CONTROL PLANE (human + AI) EXECUTION PLANE (agents build) ### Product Discovery (Control Plane) -| Skill | When to use | Input | Output | Session | -| --------------- | ----------------------------------------------- | ------------------------------------------------------ | ---------------------------------------------------------------- | ------- | -| `/aep-envision` | New product idea, revisit direction | Product idea (vague or refined) | `product-context.yaml` with `opportunity` + `product` | Main | -| `/aep-map` | After `/aep-envision` — decompose into stories | `product-context.yaml` with product section | Architecture, stories, topology, layer gates, cost added to YAML | Main | -| `/aep-validate` | Check quality of any artifact before proceeding | Any artifact (product context, design, code, document) | Scoring dimensions + findings | Main | -| `/aep-dispatch` | Pick what to build next | `product-context.yaml` with stories | OpenSpec change + story status updated + handoff | Main | -| `/aep-reflect` | After shipping — close the feedback loop | Observations from user testing, errors, cost data | Classified feedback + updated YAML | Main | +| Skill | When to use | Input | Output | Session | +| --------------- | -------------------------------------------------------- | -------------------------------------------------------------- | ---------------------------------------------------------------- | ------- | +| `/aep-envision` | New product idea, revisit direction | Product idea (vague or refined) | `product-context.yaml` with `opportunity` + `product` | Main | +| `/aep-map` | After `/aep-envision` — decompose into stories | `product-context.yaml` with product section | Architecture, stories, topology, layer gates, cost added to YAML | Main | +| `/aep-validate` | Check quality of any artifact before proceeding | Any artifact (product context, design, code, document) | Scoring dimensions + findings | Main | +| `/aep-dispatch` | Pick what to build next | `product-context.yaml` with stories | OpenSpec change + story status updated + handoff | Main | +| `/aep-reflect` | After shipping — close the feedback loop | Observations from user testing, errors, cost data | Classified feedback + updated YAML | Main | +| `/aep-watch` | Continuously ingest errors/telemetry → auto-file stories | Telemetry/bug-tracker/error sources (`topology.routing.watch`) | New bug/refinement stories in YAML → dispatch | Main | ### Feature Execution (Execution Plane) @@ -67,6 +68,7 @@ CONTROL PLANE (human + AI) EXECUTION PLANE (agents build) "Ready to start coding" → /aep-launch "Feature is done, PR merged" → /aep-wrap "What did we learn?" → /aep-reflect +"Auto-file work from telemetry" → /aep-watch "Capture process learnings" → /aep-workflow-feedback "Pull learnings from downstreams" → /aep-workflow-feedback "I want hands-free mode" → /aep-autopilot diff --git a/skills/agentic-development-workflow/build/SKILL.md b/skills/agentic-development-workflow/build/SKILL.md index 46d1118..c1d9904 100644 --- a/skills/agentic-development-workflow/build/SKILL.md +++ b/skills/agentic-development-workflow/build/SKILL.md @@ -9,6 +9,8 @@ Autonomous feature implementation inside an isolated git worktree on a fresh `fe > **Phase numbering note:** Phases 1-3 (explore, propose, review) were completed on main via `/aep-design`. This skill begins at Phase 0 (workspace init) and continues from Phase 4 (implementation). +> **Loop hygiene (G7):** Each phase runs under a unified `--max-turns` runaway budget. Hitting the cap is **not** completion — treat it as "possibly unsolvable → escalate" (Human-Gate Protocol, distinct from a genuine clean finish). This keeps a stuck phase (e.g. a non-converging Phase 5 loop) from silently burning turns and reading as done. + **Where this fits:** ``` @@ -366,6 +368,18 @@ convergence rules are identical across modes. #### Evaluation round +For each round N (starting at 1, max 5), the generator's response to a FAIL escalates along the **change-strategy recovery ladder** (`.claude/skills/aep-gen-eval/references/recovery-ladder.md`) rather than retrying the same way every round: + +| Eval round | Rung | Generator move | +| ---------- | ---- | -------------- | +| 1–2 | **Same fix** | Same generator fixes the FAIL items in place (current default). | +| 3 | **Re-ground** | Same generator re-reads the FULL spec + design + contracts from scratch, then re-attempts. | +| 4 | **Different approach** | Spawn a **fresh `native-bg-subagent` generator** told "the previous approach failed on X; take a different design path" — not anchored on the stuck solution (it inherits the existing worktree). | +| 5 | **Decompose** | Split the story into sub-tasks; attempt the **smallest viable slice** and surface the proposed split. | +| after 5 | **Human gate** | Ladder exhausted → escalate with type `eval_not_converging`. | + +Track the rung with `eval_round` + `recovery_rung` in `status.json` (see the ladder's State Tracking). **Generator≠evaluator separation holds** — the evaluator only scores; re-grounding, a fresh generator, and decomposition are all generator-side moves. **Skip the ladder and escalate immediately** on a hard-failure / security FAIL (auth-model gap, data-exposure risk), a spec contradiction, or a missing external dependency — these need human judgment, not a different approach. See the ladder file for full rung rationale and the rung-4 fresh-generator spawn contract (`native-bg-subagent` + post-spawn liveness probe). + For each round N (starting at 1, max 5): 1. **Write eval-request** — create `.dev-workflow/signals/eval-request.md` per the format in `eval-protocol.md` (Signal Files section). @@ -421,10 +435,10 @@ For each round N (starting at 1, max 5): 5. **Read the response.** Legacy only: close the evaluator pane (`tmux kill-pane -t :.1`). Native evaluators have already exited. -6. **Fix FAIL items** — add follow-up commits addressing each FAIL item, then loop back to step 1 with round N+1. Do not rewrite history; the PR review should see the fix as new commits on top. +6. **Fix FAIL items per the recovery rung** — add follow-up commits addressing each FAIL item, then loop back to step 1 with round N+1. On rounds 1–2 fix in place; on round 3 re-ground (re-read the full spec/design/contracts from scratch first); on round 4 spawn a fresh `native-bg-subagent` generator with a different approach; on round 5 decompose to the smallest viable slice. Do not rewrite history; the PR review should see the fix as new commits on top. (Hard-failure/security FAILs skip the ladder and escalate immediately — see the rung table above.) -7. **Max 5 rounds** — if not converging, escalate to human via the **human-gate - protocol** (see below) and the convergence rules in `eval-protocol.md`. +7. **Max 5 rounds** — once the ladder is exhausted (not converging after round 5), escalate to human via the **human-gate + protocol** (see below) with type `eval_not_converging`, recording the ladder history. See the convergence rules in `eval-protocol.md` and `recovery-ladder.md`. The evaluator also updates `.dev-workflow/feature-verification.json` with pass/fail results per the field ownership rules in `eval-protocol.md`. @@ -488,21 +502,26 @@ do not guess and do not silently stall. Raise a gate: ## Phase 6: Browser Testing (Dogfood) -> Skip if `agent-browser` is not installed. **Light mode:** Skip this phase. +> **Light mode:** Skip this phase. Otherwise **do not skip just because `agent-browser` is absent** — pick a host-aware method and degrade (see below). + +**Pick the method, host-aware.** Call `dogfood_method()` from `.claude/skills/aep-executor/references/dogfood-validation.md` to select the right **native** validation tool for this host/mode: + +- **Claude Code** (any mode) — `/agent-browser:dogfood` if `agent_browser_healthy()`; otherwise **degrade** (non-UI changes → API/curl checks; UI changes → human-eval) rather than skipping. +- **Codex** — native in-app browser + computer-use (codex-subagent desktop), else a Playwright script (codex-exec headless), falling back to the agent-browser CLI, then API checks. -**Port configuration:** Source `.dev-workflow/ports.env` to get the correct URLs: +**Target URL stays local.** Resolve via `target_url(local)` from `dogfood-validation.md` — source `.dev-workflow/ports.env` and use `$BASE_URL`: ```bash -source .dev-workflow/ports.env +source .dev-workflow/ports.env # target_url(local) → $BASE_URL ``` -Use agent-browser to systematically explore and test the application: +If the selected method is `/agent-browser:dogfood`, run it against `$BASE_URL`: ``` /agent-browser:dogfood ``` -Document results in `.dev-workflow/dogfood-.md`. +Whatever the method, emit the unified severity/category/repro report format (see `dogfood-validation.md` → Unified report format) so the downstream classifier stays host-agnostic. Document results in `.dev-workflow/dogfood-.md`. > **Signal update:** Update `.dev-workflow/signals/status.json` with `"phase": 6, "phase_name": "dogfood-testing"`. @@ -731,7 +750,7 @@ REMOTE_URL=$(git remote get-url origin) - **Use `git push --force-with-lease`, never `--force`** — the `lease` variant fails safely if someone else pushed to the same branch since your last fetch. - **Signal updates are required** — update `.dev-workflow/signals/status.json` at the start and end of every phase. Check `.dev-workflow/signals/feedback.md` for main session feedback at phase boundaries. - **Generator must not modify verification data** — never modify `verification_steps` or `passes` in `feature-verification.json`. Only `commit_sha` is generator-writable. The evaluator or human updates `passes` / `evaluated_by` / `round`. -- **Evaluator loop max 5 rounds** — if the generator-evaluator loop hasn't converged after 5 rounds, escalate to human. +- **Evaluator loop max 5 rounds, climbing the recovery ladder** — the generator escalates its strategy per round (same fix → re-ground → fresh generator → decompose) per `recovery-ladder.md`; only once the ladder is exhausted (after round 5) does it escalate to human as `eval_not_converging`. Hard-failure/security FAILs skip the ladder and escalate on the first occurrence. - **Raise human gates, don't guess** — decisions only the human can make go through the Human-Gate Protocol (`needs-human.md` + `blocked_on: "human"` + your mode's transport). Silent stalls read as stuck; unrecorded guesses read as scope drift. --- diff --git a/skills/patterns/autopilot/SKILL.md b/skills/patterns/autopilot/SKILL.md index 944b795..f263527 100644 --- a/skills/patterns/autopilot/SKILL.md +++ b/skills/patterns/autopilot/SKILL.md @@ -41,6 +41,8 @@ driver remains available as a fallback (`--loop`). │ tick ⑤ detect stuck workspaces │ │ tick ⑥ dispatch new work (/aep-launch) │ │ tick ⑦ write state + SURFACE status + WAIT │ + │ post-merge-guard monitor deploy health, │ + │ revert regressions │ └─────────────────────────────────────────────┘ │ goal evaluator reads the surfaced status line: │ "is layer N complete, or is autopilot paused?" @@ -208,6 +210,26 @@ Verify these conditions before proceeding: - **Stories available:** At least one story must be `ready` or `in_progress` - **Validated:** Product context should have passed `/aep-validate` (both passes) +### `full_auto` — strategic master switch + +`topology.routing.full_auto` (default **false**) is the master switch over the +**strategic** human gates — the "what to build" / architecture layer. With the +default, those gates stay with the human: + +- **`full_auto: false` (default):** strategic pauses hold — ambiguous / low-readiness + stories escalate to a human for design (the design-escalation pause below), and + the qualitative outcome-contract evaluation pauses for human judgment before a + layer advances. +- **`full_auto: true` (explicit opt-in only):** those strategic pauses auto-proceed + via agent judgment instead of waiting for a human. + +`full_auto` sits **above** the finer-grained flags under `topology.routing` +(`auto_design`, `auto_outcome_eval`, `watch.auto_create`): `full_auto: true` +**implies** all of them. The default keeps humans in control of the strategic +layer; turning `full_auto` on removes those pauses only when the user explicitly +opts in. See the per-flag behavior in **Design Escalation** below and in +`aep-dispatch` (readiness-based routing). + ### Start Protocol 1. Create `.dev-workflow/` if it doesn't exist: @@ -388,6 +410,10 @@ The 7-step protocol below is the **content of the CHECK prompt** (steps ①② ⑤⑥-scoring ⑦ = analysis + state write) plus the **ACT items** it emits (③ wrap, ④b/④c nudges, ⑥ launch, escalations). Full detail in `references/tick-protocol.md`. +> **Post-merge guard:** after a story wraps and merges, a post-deploy guard step +> monitors deploy health and can revert regressions — see +> `references/post-merge-guard.md`. + **Before every tick, re-read the "STOP — Orchestrator Boundaries" section above.** **Summary (annotated `[CHECK]` analysis vs `[ACT]` orchestrator action):** diff --git a/skills/patterns/autopilot/references/post-merge-guard.md b/skills/patterns/autopilot/references/post-merge-guard.md new file mode 100644 index 0000000..44e5e59 --- /dev/null +++ b/skills/patterns/autopilot/references/post-merge-guard.md @@ -0,0 +1,195 @@ +# Post-Merge Guard Protocol + +The post-merge monitoring window that runs **after** a story is merged and wrapped. Today autopilot wraps a merged story and forgets it; this guard keeps watching the deployed result for a bounded window, runs the host-aware dogfood against the live environment, and — only when explicitly enabled — can revert a hard service regression. It is the safety net that makes unattended autonomy survivable: the difference between "merged and walked away" and "merged, verified the deploy is healthy, and rolled back if it wasn't". + +> **BOUNDARY REMINDER:** This step is an **orchestrator** action, identical in posture to the rest of the tick. It reads CI/health signals, reads dogfood reports, and runs `gh` / deploy / CLI commands — it **NEVER** reads workspace source code, **NEVER** spawns reviewers or evaluators from main, and **NEVER** forms code-quality opinions. The dogfood itself runs via `dogfood_method()` (see `dogfood-validation.md`) using the host's native browser tooling, producing a signals-only report the orchestrator consumes. See SKILL.md "STOP — Orchestrator Boundaries". + +--- + +## Where this runs + +The guard is a **post-deploy step that runs after Step ③ wrap** in the [tick protocol](./tick-protocol.md#step--wrap-completed-workspaces). When a story is merged (④a detects `MERGED`) and wrapped (③ removes its worktree), the story is **not** forgotten: its `guard_state` is opened and subsequent ticks drive it through the monitoring window below until the window closes (healthy) or fires (regression / dogfood issue). + +``` +③ wrap completed → open guard_state for the merged story + │ + ▼ + ┌─ POST-MERGE GUARD (per merged story, across ticks) ───────────────┐ + │ 1. trigger/await deploy (deploy_status: pending→deploying │ + │ →deployed | failed) │ + │ 2. open monitoring window (window_min, default 15) │ + │ 3. each tick within window: │ + │ • read health_signals (CI / error-rate / health endpoint) │ + │ • run host-aware dogfood against target_url(staging|prod) │ + │ 4. classify findings → ONE of two issue paths (below) │ + │ 5. window elapsed, all green → close guard_state (healthy) │ + └───────────────────────────────────────────────────────────────────┘ +``` + +The guard never blocks dispatch — Steps ④/⑤/⑥ continue normally for in-flight workspaces while a merged story's window is open. The guard is signals-only and adds no per-tick workspace-code reads, so the orchestrator boundary and the `<60s` tick budget hold. + +--- + +## Step PG.1: Trigger / Await Deploy + +After wrap, advance the merged story's deploy lifecycle. The host-native deploy trigger is project-specific; the guard treats it as a CLI/CI signal, never as code: + +- **CI-driven deploy** (most projects): the merge to the integration branch already triggered the pipeline. Poll status: + ```bash + gh run list --branch "$BASE" --limit 1 --json status,conclusion,databaseId + gh run view --json status,conclusion,jobs --jq '.status,.conclusion' + ``` +- **Explicit deploy**: if the project declares a deploy command/workflow, dispatch it once and record the run id, then poll as above. + +Set `guard_state.deploy_status` accordingly: `pending` → `deploying` → `deployed` (CI success + deploy URL resolvable) or `failed`. A **failed deploy** is itself a hard regression — go straight to the [auto-revert / escalate path](#path-2-hard-service-regression--auto_revert-policy). + +The monitoring window (PG.2/PG.3) opens only once `deploy_status == "deployed"`. Until then the guard waits across ticks (idempotent — see [state](#state--idempotency)). + +--- + +## Step PG.2: Open the Monitoring Window + +Once deployed, open a window of `topology.routing.post_merge_guard.window_min` minutes (default **15**). Record `window_opened_at`. Each subsequent tick that falls inside the window runs PG.3; once `now > window_opened_at + window_min` with no firing condition met, the window closes and the guard records the story **healthy** and clears its `guard_state`. + +--- + +## Step PG.3: Watch Health Signals + Run Host-Aware Dogfood + +Within the open window, each tick performs two independent reads: + +### (a) Health signals + +Read every signal named in `topology.routing.post_merge_guard.health_signals`. These are service-level, signals-only probes — no workspace code: + +| Signal kind | How the orchestrator reads it (examples) | +| ------------------- | ----------------------------------------------------------------------------------------- | +| `ci_status` | `gh run view --json status,conclusion` for the post-merge pipeline | +| `health_endpoint` | `curl -fsS --max-time 5 ` (e.g. `/healthz`, `/readyz`) → expect 2xx | +| `error_rate` | query the project's metrics/log source for error-rate over the window vs. a baseline | +| `latency_p95` | same source — p95 latency vs. baseline threshold | +| `smoke_check` | a declared CLI/API smoke command exiting 0 | + +A signal is **red** when it fails its declared threshold (non-2xx health, CI `failure`, error-rate above baseline + margin, etc.). One transient red is not a regression — require the red to persist across **2 consecutive ticks** (or match a declared confirm rule) before treating it as confirmed, to avoid reverting on a deploy-warmup blip. + +### (b) Host-aware dogfood + +Run the dogfood validation against the deployed environment: + +``` +method = dogfood_method() # host × mode detection (see dogfood-validation.md) +url = target_url(post_deploy_env) # staging | production, from deploy_targets / CI +run dogfood(method, url) → report (severity/category/repro, signals-only) +``` + +`post_deploy_env` comes from `topology.routing.dogfood.post_deploy_env` (`staging` | `production` | `none`). `target_url()` resolves config-first then CI fallback (see `dogfood-validation.md`). The dogfood report uses the unified `/agent-browser:dogfood` severity/category/repro template, so the downstream classifier is host-agnostic. + +--- + +## Step PG.4: Two Issue Paths — Kept Strictly Separate + +The design fixes two **distinct** failure shapes (`g4-dogfood-validation-design.md` → "發現問題時的行為"). Do not conflate them: a dogfood UX finding is **never** a revert, and a service regression is **never** a new backlog story. + +### Path 1: Dogfood-found UX / functional issues → create story (NOT a revert) + +The deploy is healthy at the service level, but the dogfood surfaced a UX or functional defect (broken flow, visual regression, wrong copy, dead link). This is feedback, not an outage. + +- Feed the dogfood report to the **`/aep-reflect` classifier**, which classifies severity/category and **auto-creates a bug/refinement story** in `product-context.yaml` (links the G6 self-feeding loop). +- The new story enters the normal dispatch queue — Step ⑥ picks it up on a later tick by `readiness_score`. +- **Never revert** for a Path-1 finding. The merged change stays; the fix ships as its own story. +- Record `guard_state.dogfood = {report_path, issues_created:[story_ids]}`. + +### Path 2: Hard service regression → `auto_revert` policy + +A health signal is **confirmed red** (or the deploy failed). The deployed service is degraded — users are affected now. Behavior is governed by `topology.routing.post_merge_guard.auto_revert`: + +> **DEFAULT IS CONSERVATIVE — `auto_revert: false`.** With auto-revert off (the default), the guard **warns and escalates only**: it adds a `post_merge_regression` escalation, pauses if the story is on the critical path, and waits for a human to confirm the revert. Automatic reverting is **opt-in** and presumes the architectural back-pressure below is in place. + +- **`auto_revert: false` (default):** add escalation, do not touch the merge. + ```json + { + "type": "post_merge_regression", + "story_id": "", + "reason": "Health signal '' red for 2 consecutive ticks after merge of ", + "details": "", + "expected_human_action": "Investigate the deployed regression. If confirmed, revert with `gh pr revert ` (or revert the merge commit) and redeploy; then run /aep-reflect to log the incident.", + "created_at": "", + "acknowledged": false + } + ``` +- **`auto_revert: true` (opt-in) and regression confirmed:** + 1. **Revert** — `gh pr revert ` (opens/auto-merges a revert PR per repo policy) or revert the merge commit on `$BASE` and push. This is the **one** sanctioned exception to "never act on the merge" — it is a *recovery* action, gated behind explicit opt-in, not a normal merge. + 2. **Record an incident** — write `.dev-workflow/incidents/-.md` (or append to `autopilot-history.jsonl` with `type: incident`): the red signals, readings, the reverted PR, and the deploy outcome. + 3. **Feed `/aep-reflect`** — hand the incident to the reflect classifier so the regression becomes a learning + a follow-up story (root-cause / guard hardening), closing the loop the same way Path 1 does for UX issues. + 4. Set `guard_state.reverted = true` so no later tick reverts the same story twice (see [state](#state--idempotency)). + +--- + +## Architectural Back-Pressure (prerequisites for safe auto-revert) + +`auto_revert: true` is only as safe as the scaffolding that makes a revert clean and a regression detectable. Document these as **prerequisites** the project should have before enabling auto-revert; they are scaffold-level recommendations, not steps the guard performs: + +- **Pre-commit hooks** — lint/typecheck/format/secret-scan at commit time, so obviously-broken changes never reach the merge that the guard would have to revert. +- **Property-based tests** — broaden coverage beyond example-based tests so regressions are caught by signals (and by Phase 5 eval) rather than only in production. +- **Feature-flag / canary gating** — ship merged code dark or to a canary slice; a regression then degrades a fraction of traffic and a "revert" can be a flag flip, far safer and faster than a code revert. +- **Audit log** — append-only record of every guard action (deploy triggered, signals read, revert performed, incident filed) so auto-revert decisions are reconstructable and reviewable. + +Without these, prefer the default `auto_revert: false` (warn + escalate). The guard should note in its escalation when prerequisites appear absent. + +--- + +## Config + +```yaml +topology: + routing: + post_merge_guard: + window_min: 15 # monitoring window length, minutes (default 15) + auto_revert: false # OPT-IN. false = warn + escalate only (conservative default) + health_signals: # service-level, signals-only probes watched during the window + - ci_status # post-merge pipeline conclusion + - health_endpoint # 2xx from /healthz (URL from deploy_targets / CI) + - error_rate # error-rate over window vs. baseline + # - latency_p95 + # - smoke_check +``` + +Reuses `topology.routing.deploy_targets.{staging_url,production_url}` and `topology.routing.dogfood.{post_deploy_env,on_issue}` from the G4 dogfood design — the guard does not duplicate URL/method config. + +--- + +## State & Idempotency + +The guard records its progress **per merged story** so a re-fired tick never double-acts (double-deploys, double-reverts, double-files an incident). Add a `guard_state` entry keyed by `story_id` (alongside `workspaces` in `autopilot-state.json`; see `state-schema.md`): + +```json +{ + "story_id": "PROJ-003", + "pr_number": 412, + "merged_at": "", + "deploy_status": "deployed", // pending | deploying | deployed | failed + "window_opened_at": "", + "health": { "ci_status": "green", "health_endpoint": "green", "error_rate": "green" }, + "red_streak": { "error_rate": 0 }, // consecutive red ticks per signal (confirm rule) + "dogfood": { "report_path": null, "issues_created": [] }, + "reverted": false, + "incident_path": null, + "last_action": "watching", // watching | dogfood_ran | story_created | escalated | reverted | closed + "closed_at": null +} +``` + +Idempotency rules: + +- **Deploy once** — only trigger a deploy if `deploy_status == "pending"`; otherwise poll. +- **Revert once** — never revert if `reverted == true`; the confirmed-red check is short-circuited once reverted. +- **One escalation per regression** — guard against duplicate `post_merge_regression` escalations for the same `story_id` while unacknowledged. +- **Close cleanly** — when the window elapses with all-green (or after Path-1 story creation / Path-2 revert + incident), set `last_action` accordingly, set `closed_at`, and drop the `guard_state` entry on the next tick. + +--- + +## Cross-References + +- [tick-protocol.md](./tick-protocol.md) — Step ③ wrap (the guard opens immediately after wrap); this guard is the new post-deploy step that runs across subsequent ticks. +- `dogfood-validation.md` — `dogfood_method()` host × mode detection, `target_url(env)` resolution, and the unified report format the guard consumes. +- `/aep-reflect` — the classifier both issue paths feed: Path 1 (UX/functional → new story) and Path 2 (incident → learning + follow-up story). +- [state-schema.md](./state-schema.md) — where `guard_state` lives in `autopilot-state.json`. diff --git a/skills/patterns/autopilot/references/tick-protocol.md b/skills/patterns/autopilot/references/tick-protocol.md index 77c31ee..aded908 100644 --- a/skills/patterns/autopilot/references/tick-protocol.md +++ b/skills/patterns/autopilot/references/tick-protocol.md @@ -1,6 +1,6 @@ # Tick Protocol -The 7-step state machine executed on each autopilot tick. Each tick is idempotent — running it twice with no external state change produces the same result and takes no duplicate actions. +The 7-step state machine executed on each autopilot tick (with a ③.5 post-merge guard sub-step between wrap and guide-completion). Each tick is idempotent — running it twice with no external state change produces the same result and takes no duplicate actions. **Target duration:** <60 seconds of work per tick (under the goal driver the turn then waits the per-tick floor — step ⑦ — before ending) @@ -13,7 +13,7 @@ this tick each turn until the layer completes; loop driver (fallback) — **EXECUTION MODEL — CHECK → ACT** (see SKILL.md "Execution model"). A tick is two halves: - **CHECK** — steps ①②⑤, the read-only/scoring parts of ④⑥, and the ⑦ state write. These run in a cheap, context-isolated agent via `executor.check()` (Claude Code Haiku subagent / Codex `codex exec`) and produce an **action list**. The CHECK reads signals only — never workspace code. -- **ACT** — the orchestrator performs the emitted actions: ③ wrap, ④/⑤ nudges, ⑥ launch, escalations. +- **ACT** — the orchestrator performs the emitted actions: ③ wrap, ③.5 post-merge guard (dogfood / reflect / revert), ④/⑤ nudges, ⑥ launch, escalations. The action-list schema is `{summary, state_written, actions[]}`, each action `{type, workspace, story_id, message, reason}` (full schema in `aep-executor/references/backends.md`). The step recipes below are both the content of the CHECK prompt and the templates the ACT executes. @@ -122,6 +122,24 @@ After wrapping, **skip to step ⑦** (write state). The next tick will handle di --- +## Step ③.5: Post-Merge Guard + +For each recently-merged story (one Step ③ wrapped within the monitoring window — default applies per `post-merge-guard.md`), run the **post-merge guard**. The detail lives in `references/post-merge-guard.md`; this step defers to it. Within the monitoring window: + +1. **Watch deploy health** — read deploy/CI signals and `gh` only (no workspace code, no `gh pr merge`). The orchestrator boundary holds. +2. **Run host-aware dogfood** — exercise the merged change per the host-aware recipe in `post-merge-guard.md`. + +Two issue paths: + +- **Dogfood UX / functional issue** → route the finding through the `/aep-reflect` classifier, which auto-creates a follow-up story. +- **Hard regression** (deploy health breaks / CI red on the integration branch) → apply the `post_merge_guard.auto_revert` policy: + - **DEFAULT (conservative, `auto_revert: false`)** → **warn + escalate** for human decision; do not revert. + - **`auto_revert: true`** (opt-in) → revert the merge. + +Emit any follow-up (reflect story / escalation / revert) as an action; never read workspace source — signals / CI / `gh` only. + +--- + ## Step ④: Guide Completion **This is the most important step. ALL actions here use `executor.nudge()` (delivered via the workspace's mode transport). NEVER spawn Agent tools for review. NEVER call `gh pr merge`. Workspace agents own code review and merging.** @@ -219,7 +237,7 @@ executor.nudge(, "Code has changed since your last evaluation. Re-run Phase 5 code review on the current state before proceeding with the PR. Write a new eval-request.md and spawn a fresh evaluator.") ``` -**Escalation:** No eval-response after 6 ticks (30 min) post-trigger → add escalation with type `"eval_not_converging"`. +**Escalation:** No eval-response after 6 ticks (30 min) post-trigger → before escalating, the workspace must climb the **recovery ladder** (`../../gen-eval/references/recovery-ladder.md`): nudge it to work the ladder's rungs (re-scope, decompose, relax non-essential criteria, etc.) first. Only emit the `"eval_not_converging"` escalation **after the ladder is exhausted** — i.e. the workspace has reported the ladder spent without a PASS. ### Sub-step ④c: Guide to Merge @@ -387,12 +405,12 @@ Before scoring individual stories, check for `compile_mode: grouped_change`: For the top-scored story (or group), use `readiness_score` for routing: - **readiness_score >= 0.7** → dispatch to `/aep-launch` -- **readiness_score 0.5–0.7** → check `topology.routing.auto_design`: - - If `auto_design: true` → route through `/aep-design` automatically (no pause), then `/aep-launch` - - If `auto_design: false` → **ESCALATE** (pause for human design input) -- **readiness_score < 0.5** → check `topology.routing.auto_design`: - - If `auto_design: true` → route through `/aep-design` automatically (no pause), then `/aep-launch` - - If `auto_design: false` → **ESCALATE** (pause for human design input) +- **readiness_score 0.5–0.7** → check `topology.routing.full_auto` / `auto_design`: + - If `full_auto: true` (master switch) **or** `auto_design: true` → auto-route through the **non-interactive design resolver** (`/aep-design`, no pause), then `/aep-launch` + - Otherwise → **ESCALATE** (pause for human design input) +- **readiness_score < 0.5** → check `topology.routing.full_auto` / `auto_design`: + - If `full_auto: true` (master switch) **or** `auto_design: true` → auto-route through the **non-interactive design resolver** (`/aep-design`, no pause), then `/aep-launch` + - Otherwise → **ESCALATE** (pause for human design input) - **`attempt_count >= 2`** → always **ESCALATE** regardless of readiness (repeated failures need human attention) If escalation triggers: follow the pause protocol from the main SKILL.md. Do not dispatch. @@ -443,7 +461,10 @@ If all stories in the active layer are completed (after wraps): 1. Suggest running the layer gate integration test 2. If gate passes: update `layer_gates[layer].status: passed` -3. **Outcome contract check:** If `product.layers[active_layer].outcome_contract` exists, add an escalation requesting the user to evaluate the outcome contract before advancing. Autopilot **pauses** — outcome evaluation requires human judgment (user testing, analytics, qualitative assessment). The user runs `/aep-reflect` which evaluates outcome contracts in Step 2.75. After `/aep-reflect` completes, resume autopilot. +3. **Outcome contract check:** If `product.layers[active_layer].outcome_contract` exists, decide whether to auto-evaluate or pause: + - **Quantitative auto-eval:** If `topology.routing.auto_outcome_eval: quantitative` **and** the contract's metric is quantitative (a measurable threshold) → auto-evaluate it via `../../../product-context/reflect/references/telemetry-ingestion.md` (ingest the telemetry, compare against the threshold) and **advance without pausing** when it passes. If the metric is qualitative, fall through to the pause rule below. + - **Qualitative / default pause:** Otherwise (no `auto_outcome_eval`, a qualitative metric, etc.) → **pause** and add an escalation requesting the user to evaluate the outcome contract before advancing — **UNLESS** `topology.routing.full_auto: true`, in which case auto-evaluate via the telemetry-ingestion recipe and advance without pause. Outcome evaluation otherwise requires human judgment (user testing, analytics, qualitative assessment). The user runs `/aep-reflect` which evaluates outcome contracts in Step 2.75. After `/aep-reflect` completes, resume autopilot. + - Default (no `auto_outcome_eval` / `full_auto` false) preserves the current human pause. 4. If no outcome contract or outcome evaluation passes: advance to next layer 5. If gate fails: add escalation, pause autopilot (layer gate failures require human judgment) 6. If all layers complete: stop autopilot, notify human diff --git a/skills/patterns/executor/references/dogfood-validation.md b/skills/patterns/executor/references/dogfood-validation.md new file mode 100644 index 0000000..0cd8691 --- /dev/null +++ b/skills/patterns/executor/references/dogfood-validation.md @@ -0,0 +1,171 @@ +# Host-aware Dogfood Validation — `dogfood_method()` & `target_url()` + +Dogfood/validation picks the right **native** method per host, both locally +(pre-merge, `/aep-build` Phase 6) and on staging/production (post-deploy). This +closes gap **G4b**: until now Phase 6 ran only against localhost and only if +`agent-browser` happened to be installed (else the whole phase was skipped), and +there was no post-deploy validation at all. + +Detection reuses `executor.detect()` for HOST + mode — read +[`backends.md`](backends.md) first. The two functions here add a **method** +layer on top of that: which validation tool to drive (`dogfood_method()`) and +which URL to point it at (`target_url()`). All methods emit one unified report +format so the downstream classifier is host-agnostic. + +--- + +## Table of Contents + +1. [`dogfood_method()` — host × mode selection](#dogfood_method--host--mode-selection) +2. [`target_url(env)` — URL resolution](#target_urlenv--url-resolution) +3. [Unified report format](#unified-report-format) +4. [Config block](#config-block) +5. [Post-deploy worker & boundary (v1.8.0)](#post-deploy-worker--boundary-v180) +6. [Cross-references](#cross-references) + +--- + +## `dogfood_method()` — host × mode selection + +`executor.detect()` resolves HOST + mode; this adds a method probe on top. Each +host uses its **native** capability first and degrades only when that is +unavailable. `agent_browser_healthy()` (`agent-browser navigate about:blank`) +lives in `testing-guide`; `playwright_available()` is the analogous +write-and-run probe for Codex headless. + +``` +dogfood_method(): + resolve HOST + mode via executor.detect() + + if HOST == claude: # any mode + if agent_browser_healthy(): return "agent-browser" # /agent-browser:dogfood + else: return "degrade" # non-UI → API/curl; UI → human-eval + + if HOST == codex: + if mode == codex-subagent and computer_use_enabled: # desktop app + return "codex-native" # in-app browser + computer-use + else: # codex-exec / headless + if playwright_available(): return "playwright-script" # GPT-5.4 writes + runs it + elif agent_browser_healthy(): return "agent-browser" # CLI fallback + else: return "degrade" # API checks +``` + +| Host / mode | Native method (default) | Detection | Fallback | +| ------------------------------ | -------------------------------------- | ------------------------------- | --------------------------------- | +| **Claude Code** (any mode) | `/agent-browser:dogfood` | `agent_browser_healthy()` | non-UI → API/curl; UI → human-eval | +| **Codex desktop** (codex-subagent) | native in-app browser + computer-use (GPT-5.4 multimodal) | desktop + computer-use enabled | Playwright skill → agent-browser CLI | +| **Codex headless** (codex-exec) | write + run a Playwright script | `playwright_available()` | agent-browser CLI → API checks | + +> **Why Codex splits two ways.** Computer-use and the in-app (Atlas) browser are +> **desktop-only**. `codex exec` (headless) has neither, so it writes and runs a +> Playwright script (GPT-5.4 does this natively) and falls back to the +> agent-browser CLI, then to API/curl checks. + +--- + +## `target_url(env)` — URL resolution + +``` +target_url(env): # env ∈ {local, staging, production} + if env == local: # unchanged from current Phase 6 + source .dev-workflow/ports.env → return $BASE_URL + else: + u = topology.routing.deploy_targets._url # product-context.yaml + if u: return u # config first + else: return # fallback CI (e.g. preview URL) +``` + +- **`env=local`** — source `.dev-workflow/ports.env`, return `$BASE_URL` (the + Phase 6 status quo; `ports.env` is written by the workspace-setup hook). +- **`env=staging|production`** — read + `topology.routing.deploy_targets._url` first; if unset, read the URL the + CI/deploy step printed (e.g. a Vercel/Netlify preview URL or deploy output). + +--- + +## Unified report format + +Every method — `/agent-browser:dogfood`, `codex-native`, `playwright-script`, +the degrade paths — emits the **same** severity / category / repro structure as +`/agent-browser:dogfood`, so the downstream classifier never branches on host. +Reports are written to `.dev-workflow/dogfood-.md` (local) or the +post-deploy report path (staging/prod), one entry per finding: + +```markdown +## + +**Severity:** blocker | major | minor +**Category:** UX | logic | visual | edge-case | accessibility | performance +**Repro:** +**Observed:** **Expected:** +**Evidence:** +``` + +**On issue** → route per `topology.routing.dogfood.on_issue` (default +`create_story`): feed the report to the `/aep-reflect` classifier → create a +bug/refinement story in `product-context.yaml` → dispatch (the G6 self-feeding +loop). Set `escalate` instead to surface to the human rather than auto-filing. + +> Hard service regressions (health signals) are a **separate** path — they go +> through the autopilot post-merge guard's revert policy, not this story-filing +> path. Dogfood finds UX/functional issues and files stories; the guard finds +> service regressions and decides rollback. + +--- + +## Config block + +Added under `topology.routing` in `product-context.yaml`: + +```yaml +topology: + routing: + deploy_targets: + staging_url: "https://staging.example.com" # optional; missing → fallback CI + production_url: "https://example.com" + dogfood: + method: auto # auto | agent-browser | codex-native | playwright + post_deploy_env: staging # staging | production | none + on_issue: create_story # create_story | escalate +``` + +- **`method`** — `auto` (default) defers to `dogfood_method()`; the explicit + values pin a method (parallels the `aep.executor-backend` pin). +- **`post_deploy_env`** — which environment the post-deploy step validates; + `none` disables post-deploy dogfood. +- **`on_issue`** — `create_story` (default) or `escalate`. + +--- + +## Post-deploy worker & boundary (v1.8.0) + +When the post-deploy step needs a worker to run validation (e.g. a Codex +headless Playwright run, or a Claude `/agent-browser:dogfood` pass), it is +spawned as **`native-bg-subagent`** and confirmed live by the mandatory +**post-spawn liveness probe** before being treated as running — never trust a +flag or roster (see [`backends.md`](backends.md) → Post-Spawn Liveness Probe). + +Screenshots captured by any method feed **each host's multimodal evaluator**: +Claude evaluates natively; Codex is confirmed multimodal (GPT-5.4). This keeps +the visual judgment in-host rather than crossing back to the orchestrator. + +The **orchestrator boundary holds.** The post-deploy step reads reports and +signals and runs CLIs (`gh`, deploy tooling, `target_url` resolution) — it never +reads workspace code. The validation worker is bound to its worktree (or runs +against the deployed URL); the main session stays at arm's length, consistent +with the autopilot orchestrator boundary. + +--- + +## Cross-references + +- [`backends.md`](backends.md) — `executor.detect()` (HOST + mode), the + `native-bg-subagent` default, and the Post-Spawn Liveness Probe. +- `agentic-development-workflow/build/SKILL.md` **Phase 6** — local (pre-merge) + dogfood; calls `dogfood_method()` with `env=local` instead of skipping when + agent-browser is absent. +- `patterns/autopilot/references/post-merge-guard.md` — the post-deploy step + invokes `target_url(staging|production)` + `dogfood_method()` after merge + + deploy; hard regressions go through the guard's revert policy. +- `product-context/reflect` (`/aep-reflect`) — the host-agnostic classifier that + turns a unified dogfood report into a bug/refinement story. diff --git a/skills/patterns/gen-eval/references/agent-contracts.md b/skills/patterns/gen-eval/references/agent-contracts.md index 8ca2d19..d5b17f6 100644 --- a/skills/patterns/gen-eval/references/agent-contracts.md +++ b/skills/patterns/gen-eval/references/agent-contracts.md @@ -74,6 +74,7 @@ The evaluator independently assesses work against specifications. It has NO know | Context | Evaluator does | |---------|---------------| | **Code review** (build) | Tests running application, reviews code, scores dimensions | +| **UI work** (build) | Additionally receives screenshot(s) of the running app and scores Visual Design against the calibration/design-system spec (multimodal) | | **Artifact validation** (validate) | Checks claims against codebase, verifies file paths, API shapes | | **Design review** | Verifies technical feasibility against actual code | | **Document review** | Confirms factual claims, tests commands | @@ -84,8 +85,9 @@ The evaluator independently assesses work against specifications. It has NO know - **MUST** score against the dimension scale definitions (not gut feel) - **MUST** apply hard failure thresholds strictly - **MUST** provide actionable fix suggestions for every finding +- **MUST**, for UI work, receive screenshot(s) of the running app (captured host-aware per `executor/references/dogfood-validation.md`) and score the **Visual Design** dimension against the project's `calibration/.yaml` / design-system spec using its multimodal vision (Claude natively; Codex GPT-5.4) - **MUST NOT** rationalize problems away ("this is probably fine because...") -- **MUST NOT** implement fixes (role contamination) +- **MUST NOT** implement fixes or capture the screenshot itself (generator ≠ evaluator — the dogfood/capture step produces the image; the evaluator only judges it) - **CAN** update `passes`, `evaluated_by`, `round` in feature-verification.json ### Evaluator output format diff --git a/skills/patterns/gen-eval/references/recovery-ladder.md b/skills/patterns/gen-eval/references/recovery-ladder.md new file mode 100644 index 0000000..fddcd49 --- /dev/null +++ b/skills/patterns/gen-eval/references/recovery-ladder.md @@ -0,0 +1,113 @@ +# Change-Strategy Recovery Ladder + +When the Phase 5 gen/eval loop FAILs, the default behavior is for the **same generator to retry the same way** — fix the FAIL items, re-request evaluation, repeat. After `max_rounds` (default 5) it escalates to a human. The failure mode this guards against is **strategy stagnation**: the generator keeps applying the approach that already failed, burning rounds without exploring a genuinely different path. + +This reference defines an escalating recovery ladder. Each rung tries something **structurally different** from the last, so the system exhausts real strategy changes **before** a human gate — not five copies of the same attempt. + +> The evaluator never climbs this ladder. Generator≠evaluator separation still holds: the evaluator scores; the generator (or a fresh generator) is the only role that "tries a new approach." A re-grounded read, a fresh generator, and a decomposition are all generator-side moves. + +--- + +## Table of Contents + +1. [The Ladder](#the-ladder) +2. [When to Skip the Ladder](#when-to-skip-the-ladder) +3. [State Tracking](#state-tracking) +4. [Spawning a Fresh Generator (Rung 4)](#spawning-a-fresh-generator-rung-4) +5. [Cross-References](#cross-references) + +--- + +## The Ladder + +Round numbers are tunable per project; the **shape** is what matters — each rung is a strictly larger change of strategy than the one below it. + +| Eval round | Rung | Strategy | +| ---------- | ---- | -------- | +| 1–2 | **Same fix** | Same generator fixes the FAIL items normally. Current default behavior. | +| 3 | **Re-ground** | Same generator re-reads the FULL spec + design + contracts **from scratch** and re-attempts. | +| 4 | **Different approach** | Spawn a **fresh generator** told "the previous approach failed on X; take a different design path." Not anchored on the stuck solution. | +| 5 | **Decompose** | Split the story into smaller sub-stories / sub-tasks; attempt the **smallest viable slice**. Surface the proposed split. | +| after 5 | **Human gate** | Ladder exhausted → escalate with type `eval_not_converging`. | + +### Round 1–2 — Same fix (current behavior) + +The generator reads the latest `eval-response-.md`, fixes the FAIL items in place, updates `eval-request.md`, and re-requests evaluation. This is the cheapest rung and resolves most failures (typical convergence is 2–3 rounds). No strategy change is warranted yet — the first couple of FAILs are usually ordinary bugs, not a stuck approach. + +### Round 3 — Re-ground + +Context may have rotted: the generator has been editing for several rounds and its working memory of the spec has drifted. Before fixing again, the generator **re-reads the full source of truth from scratch** — the spec, the design doc, and the contracts — rather than reasoning from its in-context summary. It then re-attempts the FAIL items against that fresh reading. This catches the common case where the FAIL persists because the generator has been solving the wrong problem. + +### Round 4 — Different approach (fresh generator) + +Re-grounding didn't converge, which suggests the generator is **anchored** on a design path that cannot satisfy the spec. The stuck generator cannot reliably unstick itself — it will keep returning to the same solution. So spawn a **fresh generator** that has none of the prior context except an explicit framing: + +> The previous approach failed on **X** (cite the persistent FAIL findings). Do **not** continue that approach. Re-read the spec/design/contracts and take a **different design path**. + +The fresh generator works in the **existing worktree** (the prior commits remain; it can revert or rework them). See [Spawning a Fresh Generator](#spawning-a-fresh-generator-rung-4) for the host-agnostic spawn contract. + +### Round 5 — Decompose + +If even a fresh approach FAILs, the story is likely **too large to land as one unit**. The generator (fresh or original) proposes a split into smaller sub-stories / sub-tasks and attempts the **smallest viable slice** — the thinnest piece that can PASS on its own. The proposed split is **surfaced**, not silently applied: write it to `eval-request.md` and the human-gate record so the human (and the autopilot) can see the story has been re-shaped. Landing one slice and deferring the rest is a legitimate outcome of this rung. + +### After Round 5 — Human gate + +Only once every rung has been tried does the loop escalate. This is the `eval_not_converging` escalation (`needs-human.md` + `blocked_on: human` in `status.json`; see `eval-protocol.md` → needs-human gate record). The escalation should record the **ladder history** — which rungs were attempted and why each failed — so the human inherits a genuinely-explored problem, not five identical attempts. + +--- + +## When to Skip the Ladder + +The ladder is for **convergence** failures — the generator can't get the work to PASS. Some FAILs are not convergence problems and **escalate immediately**, skipping all rungs: + +- **Hard-failure / security FAIL that needs human judgment** — e.g. an auth-model gap, a data-exposure risk, or any finding whose fix requires a product/security decision the agent is not authorized to make. Trying "a different approach" on a security boundary is worse than asking. Escalate on the first such FAIL. +- **Spec contradiction** — the FAIL is caused by the spec itself being internally inconsistent or wrong. No generator strategy can fix a contradictory spec; this needs a human to amend the spec. +- **Missing external dependency / access** — the work cannot proceed without something outside the worktree (a credential, an unbuilt upstream service). Decomposing won't help. + +In these cases, escalate with the appropriate type immediately and note that the ladder was deliberately skipped. + +--- + +## State Tracking + +Which rung we're on is **derived**, not free-standing — it follows the eval round count plus an explicit marker so a recovering agent (after a context reset) lands on the right rung: + +- **`eval_round`** in `.dev-workflow/signals/status.json` is the primary driver (round 3 ⇒ re-ground, round 4 ⇒ fresh generator, etc.). +- **`recovery_rung`** in `status.json` records the rung explicitly — one of `same_fix` | `reground` | `fresh_generator` | `decompose` — so the rung is unambiguous even if rounds and rungs are re-tuned, and so the autopilot can read intent without re-deriving it. A fresh generator (rung 4) reads `recovery_rung` to learn it must take a different path rather than resume the stuck one. + +```json +{ + "phase": 5, + "eval_round": 4, + "recovery_rung": "fresh_generator", + "eval_result": "fail", + "blocked_on": null, + "updated_at": "2026-06-16T12:00:00Z" +} +``` + +The workspace owns this state and advances its own rung — the autopilot only observes it and nudges (see [Cross-References](#cross-references)). The autopilot does **not** climb the ladder on the workspace's behalf. + +--- + +## Spawning a Fresh Generator (Rung 4) + +The v1.8.0 spawn contract for the fresh generator (host-agnostic; same rules as any executor spawn): + +1. **Mode:** `native-bg-subagent` — spawned via the **Agent tool** with `run_in_background: true`, **no team**. It runs as an in-process background subagent. +2. **Worktree:** it inherits the **EXISTING** worktree (`.feature-workspaces/`). The prior generator's commits are present; the fresh generator may revert, rework, or build on them — but its prompt forbids resuming the stuck approach. +3. **Liveness:** it MUST pass the post-spawn liveness probe — `skills/patterns/executor/scripts/spawn-liveness-probe.sh `. A spawn call returning is **not** evidence the worker started; the probe confirms worktree activity, and the caller separately confirms the subagent process exists (`TaskList` shows ``). If the probe fails, tear down and re-spawn. +4. **Gate-and-park:** like any generator, the fresh generator **gates and parks for human input** when it hits a decision it can't resolve — it does not invent product/security answers. + +The fresh generator is still a generator: the evaluator role is untouched, and the generator≠evaluator boundary is preserved across the swap. + +--- + +## Cross-References + +| Where | What it covers | +| ----- | -------------- | +| `/aep-build` Phase 5 | Runs the multi-round gen/eval loop; this ladder governs what the generator does on each FAIL round. | +| `eval-protocol.md` → Convergence Rules / needs-human gate | `max_rounds`, the escalation format, and the `needs-human.md` + `blocked_on` gate record the ladder feeds into. | +| `aep-autopilot` tick-protocol Step ④ | The orchestrator observes `eval_round` / `recovery_rung`, nudges a stalled workspace, and emits the `eval_not_converging` escalation once the ladder is exhausted. It only nudges — the workspace runs its own loop and climbs its own ladder. | +| `aep-executor` `scripts/spawn-liveness-probe.sh` | Post-spawn liveness probe the rung-4 fresh generator MUST pass. | diff --git a/skills/patterns/gen-eval/references/scoring-framework.md b/skills/patterns/gen-eval/references/scoring-framework.md index 8a762e2..affd1b3 100644 --- a/skills/patterns/gen-eval/references/scoring-framework.md +++ b/skills/patterns/gen-eval/references/scoring-framework.md @@ -86,6 +86,22 @@ Conventions, maintainability, performance? | 4 | Clean, consistent code with proper error handling and clear structure | | 5 | Exemplary — clear abstractions, well-named, efficient, follows all project conventions | +### 6. Visual Design (1–5) + +Does a screenshot of the running UI match the project's design system? Evaluated by feeding a screenshot of the running app to the **multimodal evaluator**, scored against the project's design-system / calibration spec (`calibration/.yaml`, e.g. `calibration/visual-design.yaml`) — spacing rhythm, visual hierarchy, brand/token consistency, alignment, and overall polish. + +| Score | Definition | +|-------|-----------| +| 1 | Off-brand or visually broken — wrong colors/fonts, overlapping elements, no consistent spacing | +| 2 | Recognizable but inconsistent — ad-hoc spacing, mismatched tokens, weak hierarchy | +| 3 | Follows the design system loosely — on-brand but uneven spacing/alignment, generic polish | +| 4 | Consistent with the design system — correct tokens, clear hierarchy, aligned, minor polish gaps | +| 5 | Pixel-faithful to the design system — consistent tokens, deliberate hierarchy and spacing, fully aligned, production-grade polish | + +> **Hard failure:** Visual Design < 3 for the `.5` polish layer — a screenshot that does not match the design system blocks the polish layer from passing. +> +> **Host-aware capture:** The screenshot is captured per `executor/references/dogfood-validation.md` — `/agent-browser:dogfood` on Claude, native in-app browser / computer-use on Codex (GPT-5.4 multimodal), or a Playwright script — and the resulting image is fed to the in-host multimodal evaluator (Claude natively; Codex GPT-5.4). This keeps the visual judgment in-host and removes the human dependency for routine design-system checks. + --- ## Hard Failure Thresholds @@ -108,12 +124,13 @@ Select the preset matching the artifact type, then adjust with the user during e ### UI-heavy (forms, dashboards, layouts) ``` -Dimensions: Completeness, Correctness, UX Quality, Originality, Accessibility -Weight: UX Quality (high), Originality (high) +Dimensions: Completeness, Correctness, UX Quality, Visual Design, Originality, Accessibility +Weight: UX Quality (high), Visual Design (high), Originality (high) De-weight: Code Quality (still check but don't hard-fail) -Add: Originality — penalize generic "AI slop" (purple gradients, card layouts) +Add: Visual Design — score a screenshot against calibration/visual-design.yaml (multimodal) + Originality — penalize generic "AI slop" (purple gradients, card layouts) Accessibility — WCAG AA compliance, keyboard navigation, screen readers -Hard fail: UX Quality < 3, Completeness < 4 +Hard fail: UX Quality < 3, Visual Design < 3, Completeness < 4 ``` ### API-only (endpoints, services, integrations) @@ -151,11 +168,12 @@ Hard fail: Data Integrity < 4, Completeness < 4 ### Mixed / Full-stack ``` -Dimensions: Completeness, Correctness, UX Quality, Security, Code Quality +Dimensions: Completeness, Correctness, UX Quality, Visual Design, Security, Code Quality Weight: All equal (default) -Add: None — use the 5 defaults +Add: Visual Design — when the feature ships UI, score a screenshot against + calibration/visual-design.yaml (multimodal); omit for non-UI slices Adjust: Weight toward the area the user identifies as highest risk -Hard fail: Default thresholds (any < 3, Completeness < 4) +Hard fail: Default thresholds (any < 3, Completeness < 4); Visual Design < 3 on UI/.5 layers ``` --- diff --git a/skills/product-context/_shared/templates/product-context-schema.yaml b/skills/product-context/_shared/templates/product-context-schema.yaml index 4f918eb..b431c2f 100644 --- a/skills/product-context/_shared/templates/product-context-schema.yaml +++ b/skills/product-context/_shared/templates/product-context-schema.yaml @@ -397,6 +397,25 @@ topology: autonomous: false # true = /aep-autopilot can dispatch without human confirmation auto_design: false # true = skip /aep-design, go straight to /aep-launch for ambiguous stories skip_human_eval: none # none | backend | all — which stories skip human eval in /aep-wrap + # ─── v2 autonomy (all default to human-in-the-loop; opt-in only) ─── + full_auto: false # master switch — true automates the strategic gates (design escalation, qualitative outcome eval); implies auto_design + auto_outcome_eval + watch.auto_create + auto_outcome_eval: none # none | quantitative — quantitative layer outcome contracts auto-evaluate from telemetry (see reflect/references/telemetry-ingestion.md) + deploy_targets: # post-deploy dogfood targets (G4); omit → fall back to CI/deploy output + staging_url: null + production_url: null + dogfood: # host-aware post-deploy validation (see executor/references/dogfood-validation.md) + method: auto # auto | agent-browser | codex-native | playwright + post_deploy_env: none # none | staging | production + on_issue: create_story # create_story | escalate + post_merge_guard: # G4a — watch merged stories' deploy health (see autopilot/references/post-merge-guard.md) + window_min: 15 + auto_revert: false # conservative default: warn + escalate only; true = auto `gh pr revert` on confirmed regression + health_signals: [] # e.g. ["ci", "error_rate", "health_endpoint"] + telemetry_sources: [] # G5 — read-only error-log/analytics/monitoring sources (reference env/secret store; never embed secrets) + watch: # G6 /aep-watch self-feeding discovery + sources: [] + interval: 30m + auto_create: false # surface proposed stories for confirmation; true (or full_auto) = auto-create + dispatch handoffs: - from: implementer to: evaluator diff --git a/skills/product-context/dispatch/SKILL.md b/skills/product-context/dispatch/SKILL.md index da56aeb..0e2391a 100644 --- a/skills/product-context/dispatch/SKILL.md +++ b/skills/product-context/dispatch/SKILL.md @@ -523,6 +523,22 @@ Use the `readiness_score` computed in Step 3: - **readiness_score 0.5–0.7** → present to user for decision (`/aep-launch` or `/aep-design`) - **readiness_score < 0.5** → route to `/aep-design` (spec needs refinement) +#### Full-auto / auto-design routing (medium/low readiness) + +Routing of an under-ready story depends on two `topology.routing` flags — the +`full_auto` master switch (default **false**) and the finer-grained `auto_design` +(default **false**). `full_auto` sits **above** `auto_design`: `full_auto: true` +implies `auto_design: true`. + +- **`full_auto: true` OR `auto_design: true`** → a medium/low-readiness story + (readiness < 0.7) is resolved by a **non-interactive gen/eval design resolver** — + a design agent that refines the spec without a human, then routes to + `/aep-launch` — instead of escalating to interactive `/aep-design` (the G3 human + gate). No strategic pause. +- **`full_auto: false` AND `auto_design: false` (default)** → keep escalating: a + medium/low-readiness story routes to interactive `/aep-design` for human design + refinement before launch. The strategic "what to build" gate stays with the human. + ### Well-specified (readiness >= 0.7) → skip to /aep-launch - 3+ specific, testable acceptance criteria diff --git a/skills/product-context/envision/templates/product-context-schema.yaml b/skills/product-context/envision/templates/product-context-schema.yaml index 4f918eb..b431c2f 100644 --- a/skills/product-context/envision/templates/product-context-schema.yaml +++ b/skills/product-context/envision/templates/product-context-schema.yaml @@ -397,6 +397,25 @@ topology: autonomous: false # true = /aep-autopilot can dispatch without human confirmation auto_design: false # true = skip /aep-design, go straight to /aep-launch for ambiguous stories skip_human_eval: none # none | backend | all — which stories skip human eval in /aep-wrap + # ─── v2 autonomy (all default to human-in-the-loop; opt-in only) ─── + full_auto: false # master switch — true automates the strategic gates (design escalation, qualitative outcome eval); implies auto_design + auto_outcome_eval + watch.auto_create + auto_outcome_eval: none # none | quantitative — quantitative layer outcome contracts auto-evaluate from telemetry (see reflect/references/telemetry-ingestion.md) + deploy_targets: # post-deploy dogfood targets (G4); omit → fall back to CI/deploy output + staging_url: null + production_url: null + dogfood: # host-aware post-deploy validation (see executor/references/dogfood-validation.md) + method: auto # auto | agent-browser | codex-native | playwright + post_deploy_env: none # none | staging | production + on_issue: create_story # create_story | escalate + post_merge_guard: # G4a — watch merged stories' deploy health (see autopilot/references/post-merge-guard.md) + window_min: 15 + auto_revert: false # conservative default: warn + escalate only; true = auto `gh pr revert` on confirmed regression + health_signals: [] # e.g. ["ci", "error_rate", "health_endpoint"] + telemetry_sources: [] # G5 — read-only error-log/analytics/monitoring sources (reference env/secret store; never embed secrets) + watch: # G6 /aep-watch self-feeding discovery + sources: [] + interval: 30m + auto_create: false # surface proposed stories for confirmation; true (or full_auto) = auto-create + dispatch handoffs: - from: implementer to: evaluator diff --git a/skills/product-context/map/templates/product-context-schema.yaml b/skills/product-context/map/templates/product-context-schema.yaml index 4f918eb..b431c2f 100644 --- a/skills/product-context/map/templates/product-context-schema.yaml +++ b/skills/product-context/map/templates/product-context-schema.yaml @@ -397,6 +397,25 @@ topology: autonomous: false # true = /aep-autopilot can dispatch without human confirmation auto_design: false # true = skip /aep-design, go straight to /aep-launch for ambiguous stories skip_human_eval: none # none | backend | all — which stories skip human eval in /aep-wrap + # ─── v2 autonomy (all default to human-in-the-loop; opt-in only) ─── + full_auto: false # master switch — true automates the strategic gates (design escalation, qualitative outcome eval); implies auto_design + auto_outcome_eval + watch.auto_create + auto_outcome_eval: none # none | quantitative — quantitative layer outcome contracts auto-evaluate from telemetry (see reflect/references/telemetry-ingestion.md) + deploy_targets: # post-deploy dogfood targets (G4); omit → fall back to CI/deploy output + staging_url: null + production_url: null + dogfood: # host-aware post-deploy validation (see executor/references/dogfood-validation.md) + method: auto # auto | agent-browser | codex-native | playwright + post_deploy_env: none # none | staging | production + on_issue: create_story # create_story | escalate + post_merge_guard: # G4a — watch merged stories' deploy health (see autopilot/references/post-merge-guard.md) + window_min: 15 + auto_revert: false # conservative default: warn + escalate only; true = auto `gh pr revert` on confirmed regression + health_signals: [] # e.g. ["ci", "error_rate", "health_endpoint"] + telemetry_sources: [] # G5 — read-only error-log/analytics/monitoring sources (reference env/secret store; never embed secrets) + watch: # G6 /aep-watch self-feeding discovery + sources: [] + interval: 30m + auto_create: false # surface proposed stories for confirmation; true (or full_auto) = auto-create + dispatch handoffs: - from: implementer to: evaluator diff --git a/skills/product-context/reflect/SKILL.md b/skills/product-context/reflect/SKILL.md index 0b7ec80..f9cf201 100644 --- a/skills/product-context/reflect/SKILL.md +++ b/skills/product-context/reflect/SKILL.md @@ -48,6 +48,8 @@ Collect observations from all sources. Read product definition (from `product/in Ask the user one source at a time. Don't rush — the quality of classification depends on the quality of input. +**Automated ingestion (optional):** Automated sources — error logs, analytics, monitoring — can be pulled in directly per `references/telemetry-ingestion.md`, normalized into the same observation format Step 2 classifies. Configure endpoints under `topology.routing.telemetry_sources`. Automation **augments** the interactive sources above; it does not replace them — ingested records are merged with the human input before classification, and the human still reviews each classification. + --- ## Step 2: Classify Each Observation @@ -140,6 +142,12 @@ If the completed layer has an `outcome_contract` defined in `product.layers[]`: summary: "Layer N outcome contract: [passed/failed] — [metric] was [actual] vs target [target]" ``` +**Auto-evaluation (optional, opt-in):** The pause above can be skipped per `references/telemetry-ingestion.md`: + +- If `topology.routing.auto_outcome_eval: quantitative` **and** the success metric is quantitative (a numeric target measurable from analytics/monitoring) → fetch the actual value per `references/telemetry-ingestion.md`, apply `keep_if`/`otherwise` mechanically, and record the result in the changelog — no pause. If the metric can't be fetched, fall back to the human pause. +- **Qualitative** metrics still pause for the human as described above — **unless** `topology.routing.full_auto: true`, in which case the agent evaluates the qualitative metric by its own judgment and applies the decision rule with no pause. +- Default (`auto_outcome_eval: none`, `full_auto: false`) preserves the current human-in-the-loop behavior exactly. + If no outcome contract exists for the completed layer, skip this step. --- diff --git a/skills/product-context/watch/SKILL.md b/skills/product-context/watch/SKILL.md new file mode 100644 index 0000000..6f82372 --- /dev/null +++ b/skills/product-context/watch/SKILL.md @@ -0,0 +1,289 @@ +--- +name: aep-watch +description: Continuously ingest bug trackers, error streams, and telemetry; classify each finding with the /aep-reflect classifier; dedupe against existing stories; and auto-create bug/refinement stories into product-context.yaml so reflect→dispatch becomes a self-feeding loop. Use when the user says "watch", "monitor for new work", "ingest errors", "auto-create stories from telemetry", "keep an eye on the bug tracker", "/aep-watch", or wants new work to enter the backlog without manually running /aep-envision or /aep-reflect. Runs from the MAIN workspace only. +--- + +# Watch + +Self-feeding work discovery. `/aep-watch` is a continuous/scheduled monitor that +**discovers** new work: it pulls from configured sources (bug trackers, error +streams, telemetry), classifies each finding with the **same classifier as +`/aep-reflect`**, dedupes against the existing backlog, and writes new +bug/refinement stories into `product-context.yaml`. Those stories then flow into +`/aep-dispatch` (or autopilot picks them up) — closing the loop so the system +keeps finding work to do without a human running `/aep-envision` or `/aep-reflect` +by hand. + +``` +sources → [ /aep-watch: pull → classify → dedupe → write stories ] → product-context.yaml + │ + ▼ + /aep-dispatch (or /aep-autopilot) +``` + +`/aep-reflect` is the **human-in-the-loop** feedback classifier you run after +shipping. `/aep-watch` is its **always-on** sibling: same classification logic, +no human prompting each finding — it is what makes the loop *continuous*. + +**Where this fits:** + +``` +/aep-envision → /aep-map → /aep-validate + → /aep-watch (continuous monitor — discovers + ingests new work) + → /aep-dispatch → … → /aep-wrap → /aep-reflect → loop + ▲ /aep-watch feeds the same stories section /aep-dispatch reads +``` + +**Session:** Main workspace only (like `/aep-autopilot`) — respects the orchestrator boundary. +**Driver:** `/loop ` (Claude Code) or `codex exec` cron/launchd (Codex). +**Input:** Sources configured in `topology.routing.watch`. +**Output:** New `bug` / `refinement` stories appended to the `stories` section of `product-context.yaml` (or surfaced as proposals for confirmation — see Config). + +--- + +## STOP — Orchestrator Boundary + +`/aep-watch` runs from the **main workspace only** and is an **orchestrator**, not +an executor. Like `/aep-autopilot`, it never reads, reviews, edits, or evaluates +**workspace code**. It only reads: + +- the configured sources (via their APIs/feeds — see Step 1), +- `product-context.yaml` (to dedupe and to write stories). + +If a finding needs investigation that requires reading code, that happens inside +a **workspace agent** after the story is dispatched — never in the watch session. + +```bash +# Main workspace guard +pwd | grep -q '.feature-workspaces' && echo "ABORT: Run /aep-watch from main workspace only" && exit 1 +[ -f product-context.yaml ] || echo "ABORT: Run /aep-envision and /aep-map first" +``` + +Any worker `/aep-watch` spawns (e.g. a cheap CHECK delegate to fetch + classify a +batch) is a **`native-bg-subagent`** on Claude Code, gated by the standard +**post-spawn liveness probe** (`scripts/spawn-liveness-probe.sh`): confirm the +agent exists AND shows activity before counting it; on failure, tear down and +fall back to `native-bg-subagent`. The watch session itself does **not** read +workspace code. + +--- + +## Config + +Watch is driven entirely by `topology.routing.watch` in `product-context.yaml`: + +```yaml +topology: + routing: + full_auto: false # A1 master switch (see below) + watch: + sources: # what to pull from — see references/telemetry-ingestion.md + - type: bug_tracker # e.g. github_issues, linear, jira, sentry, datadog, log_stream + query: "is:open label:bug" + - type: error_stream + dsn: "" + - type: telemetry + metric: "error_rate" + threshold: 0.02 + interval: 30m # poll cadence for the /loop or cron driver + auto_create: false # write stories directly vs. surface proposals + since: null # high-water mark — last ingested timestamp (watch maintains this) +``` + +**Confirmation policy (default conservative):** + +- **`full_auto: false` (default)** — watch **surfaces proposed stories** for human + confirmation. It writes them to a `watch_proposals` block (under + `topology.routing.watch`) and prints them; nothing enters the `stories` section + until the human approves. `auto_create: true` lets watch write stories directly + even when `full_auto` is off (a per-watch opt-in, narrower than the master switch). +- **`topology.routing.full_auto: true` (A1 master switch)** — watch **auto-creates + AND lets dispatch run** without confirmation: it writes new stories straight into + the `stories` section, and `/aep-dispatch` / `/aep-autopilot` pick them up on the + next tick. No human gate per finding. + +> **Resolution:** auto-create when `full_auto: true` **OR** `watch.auto_create: true`; +> otherwise surface proposals. When in doubt, surface — recreating noise as stories +> is worse than a confirmation prompt. + +--- + +## The Watch Loop + +Each tick runs the same four-step body. **Idempotent** — re-running with no new +source data produces no new stories (the dedupe + `since` high-water mark guarantee it). + +``` +① PULL → fetch new findings from each configured source (since high-water mark) +② CLASSIFY → run each finding through the /aep-reflect Step 2 classifier +③ DEDUPE → drop findings that already map to an existing story +④ WRITE → create bug/refinement stories (or surface proposals) +``` + +### Step 1: Pull from Sources + +For each entry in `watch.sources`, pull findings created/updated since +`watch.since`. **Reuse the ingestion format and per-source adapters defined in +`references/telemetry-ingestion.md`** (the same source contract `/aep-reflect` +Step 1 draws on) — do not invent a new finding shape here. Each finding normalizes to: + +```yaml +- source: "sentry" + external_id: "ISSUE-4821" # stable id used for dedupe + title: "TypeError in checkout flow" + detail: "..." # stack/message/metric summary + signal: error_stream # bug_tracker | error_stream | telemetry + count: 142 # occurrences / affected users (priority input) + first_seen: "" + last_seen: "" +``` + +Advance `watch.since` to the newest `last_seen` only **after** the tick completes +successfully (so a failed tick re-pulls rather than dropping findings). + +### Step 2: Classify Each Finding + +Classify every finding using the **exact same classifier as `/aep-reflect` +Step 2** — bug / refinement / discovery / opportunity shift / process. **Do not +duplicate that logic here**; apply `/aep-reflect`'s "Classify Each Observation" +rules (see `../reflect/SKILL.md` → Step 2). Watch only acts autonomously on the +two categories it can safely turn into work: + +| Classification | Watch action | +| --------------------- | ------------------------------------------------------------------------- | +| **Bug** | Create a bug story (Step 4). | +| **Refinement** | Create a refinement story in the next layer (Step 4). | +| **Discovery** | Do NOT auto-create. Surface for `/aep-reflect` → `/aep-envision`/`/aep-map`. | +| **Opportunity shift** | Do NOT auto-create. Always escalate to a human — this changes the bet. | +| **Process / Calibration** | Do NOT auto-create. Surface for `/aep-reflect`. | + +Discoveries, opportunity shifts, calibrations, and process findings **always** +go to a human regardless of `full_auto` — they change product intent or workflow, +which watch must never decide autonomously. + +### Step 3: Dedupe Against Existing Stories + +Before creating anything, check the finding against the current `stories` section +of `product-context.yaml` (and existing `watch_proposals`). Skip a finding when: + +- a story already records this `source` + `external_id` (watch stamps + `watch_origin: { source, external_id }` on every story it creates), **or** +- an open story's `title`/description clearly covers the same issue + (same error signature, same endpoint, same metric). + +If a matching story exists but is `completed`/`closed` and the issue has +**recurred** (new occurrences after `completed_at`), do not silently recreate — +add a note and surface as a regression for human attention. Never recreate work. + +### Step 4: Write Stories (or Surface Proposals) + +For each surviving **bug** / **refinement** finding, build a story: + +```yaml +- id: "watch--" + title: "" + description: " (auto-discovered by /aep-watch from )" + type: bug # or refinement + status: pending + priority: high # bugs: high; tune by count/severity (see below) + layer: # bug → current layer; refinement → next layer + module: # leave unset if the source doesn't localize it + watch_origin: + source: "" + external_id: "" + discovered_at: "" +``` + +**Priority / layer rules (mirror `/aep-reflect`):** + +- **Bug** → `priority: high`, `status: pending`, in the **current/active layer** + (escalate to `critical` when `count` or severity is high, e.g. crash affecting + many users / error_rate over threshold). +- **Refinement** → `status: pending` in the **next layer**. +- Leave `module` / `files_affected` unset when the source can't localize them; + dispatch's readiness score will route these through `/aep-design` first. + +**Then, per the confirmation policy:** + +- **Auto-create** (`full_auto: true` OR `watch.auto_create: true`): append the + story to the `stories` section. It is now a normal pending story — + `/aep-dispatch` scores it and `/aep-autopilot` picks it up on the next tick. +- **Surface** (default): append the story object to `topology.routing.watch.watch_proposals` + instead, and print it. The human runs `/aep-reflect` (or confirms inline) to + promote proposals into `stories`. + +**Validate + commit** (same guardrails as reflect/dispatch — see +`../reflect/references/yaml-guardrails.md`): + +```bash +npx js-yaml product-context.yaml > /dev/null && echo "YAML OK" +# Resolve $BASE (integration branch): override → develop → main +BASE=$(git config --get aep.integration-branch 2>/dev/null || true) +[ -z "$BASE" ] && { git show-ref --verify --quiet refs/heads/develop \ + || git show-ref --verify --quiet refs/remotes/origin/develop; } && BASE=develop +BASE=${BASE:-main} +git pull --ff-only origin "$BASE" +git add product-context.yaml +git commit -m "chore: watch — auto-discovered N stories from " +git push origin "$BASE" +``` + +Append a `changelog` entry (`type: watch`) summarizing findings ingested, +classified, deduped, and created vs. proposed. + +--- + +## Driver + +`/aep-watch` is a continuous/scheduled monitor — the same driver matrix as +`/aep-autopilot` (executor `detect()` + the driver × backend matrix in +`.claude/skills/aep-executor/references/backends.md`): + +- **Claude Code — `/loop `** (long-lived, in-session): + + ``` + /loop 30m /aep-watch tick + ``` + + Use `watch.interval` for ``. The session stays alive, so any spawned + CHECK delegate is a session-bound **native-bg-subagent**. + +- **Codex — `codex exec` cron/launchd** (ephemeral, OS-scheduled): schedule + `/aep-watch tick` externally (e.g. `launchd` `StartInterval`, cron, or a + `while … sleep` loop), one cheap one-shot per tick. Workers must be OS-bound + (codex-exec). AEP prints the snippet; it does not install the scheduler. + +`/aep-watch tick` runs one pass of the four-step loop and exits. `/aep-watch stop` +cancels the driver (`/loop` cancel, or remove the cron/launchd job). + +--- + +## Guardrails + +- **Main workspace only** — refuse to run if `pwd` contains `.feature-workspaces`. +- **Never read workspace code** — watch reads sources + `product-context.yaml` only; + any code investigation happens inside a dispatched workspace agent. +- **Reuse, don't duplicate, the reflect classifier** — Step 2 applies + `/aep-reflect` Step 2; if classification logic changes, it changes there. +- **Conservative by default** — surface proposals unless `full_auto: true` (A1) + or `watch.auto_create: true`. When in doubt, surface. +- **Only bugs and refinements are auto-creatable** — discoveries, opportunity + shifts, calibrations, and process findings always go to a human. +- **Always dedupe** — never recreate work that already has a story; stamp + `watch_origin` so future ticks recognize it. +- **Spawned workers are native-bg-subagent + liveness probe** — never trust + "state says active"; confirm via the probe, fall back on failure. +- **Advance the high-water mark only on success** — a failed tick re-pulls. + +--- + +## Cross-References + +- `../reflect/SKILL.md` — **Step 2 classifier** (bug / refinement / discovery / …), + reused here verbatim; the human-in-the-loop counterpart to watch. +- `references/telemetry-ingestion.md` — source adapters + normalized finding format + used by Step 1 (shared with `/aep-reflect` Step 1). +- `../dispatch/SKILL.md` — consumes the stories watch creates (scoring, readiness, WIP). +- `../../patterns/autopilot/SKILL.md` — the orchestrator pattern, driver matrix, + liveness probe, and main-workspace boundary watch mirrors; autopilot picks up + watch-created stories on its next tick. From 16af148cc0c7f1e29cbd9d6ec569583e419b648a Mon Sep 17 00:00:00 2001 From: Memorysaver Date: Tue, 16 Jun 2026 00:20:55 +0800 Subject: [PATCH 6/8] fix(aep-v2): address design review MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - B1: author telemetry-ingestion.md in canonical _shared/references/ (was created in a build-generated dir and wiped by build-skills.sh); rebuild materializes it into reflect/ + watch/ — G5 + watch ingestion now resolve. - S1: add guard_state entry to autopilot state-schema (post-merge-guard idempotency). - S3: register post_merge_regression in the escalation type enum. - S2: document recovery_rung in eval-protocol status.json fields. - S4: schema health_signals example ci → ci_status (matches the guard's key). - S5: skill count 16 → 17 in README + orientation; add /aep-watch to orientation table. - N1: brief Codex dogfood recipe pointer in codex-native.md. - oxfmt markdown reformatting. Co-Authored-By: Claude Opus 4.8 (1M context) --- README.md | 2 +- docs/orientation.md | 7 +- .../build/SKILL.md | 14 +- .../autopilot/references/post-merge-guard.md | 16 +- .../autopilot/references/state-schema.md | 48 +++- .../executor/references/codex-native.md | 17 ++ .../executor/references/dogfood-validation.md | 10 +- .../gen-eval/references/agent-contracts.md | 63 +++-- .../gen-eval/references/eval-protocol.md | 7 + .../gen-eval/references/recovery-ladder.md | 26 +- .../gen-eval/references/scoring-framework.md | 256 +++++++++--------- .../_shared/references/telemetry-ingestion.md | 94 +++++++ .../templates/product-context-schema.yaml | 2 +- .../references/telemetry-ingestion.md | 94 +++++++ .../references/telemetry-ingestion.md | 94 +++++++ .../templates/product-context-schema.yaml | 2 +- .../map/references/telemetry-ingestion.md | 94 +++++++ .../map/templates/product-context-schema.yaml | 2 +- .../reflect/references/telemetry-ingestion.md | 94 +++++++ skills/product-context/watch/SKILL.md | 44 +-- .../watch/references/.aep-generated | 1 + .../references/orchestration-patterns.md | 203 ++++++++++++++ .../watch/references/telemetry-ingestion.md | 94 +++++++ .../watch/references/yaml-guardrails.md | 112 ++++++++ skills/project-setup/onboard/SKILL.md | 6 +- 25 files changed, 1177 insertions(+), 225 deletions(-) create mode 100644 skills/product-context/_shared/references/telemetry-ingestion.md create mode 100644 skills/product-context/dispatch/references/telemetry-ingestion.md create mode 100644 skills/product-context/envision/references/telemetry-ingestion.md create mode 100644 skills/product-context/map/references/telemetry-ingestion.md create mode 100644 skills/product-context/reflect/references/telemetry-ingestion.md create mode 100644 skills/product-context/watch/references/.aep-generated create mode 100644 skills/product-context/watch/references/orchestration-patterns.md create mode 100644 skills/product-context/watch/references/telemetry-ingestion.md create mode 100644 skills/product-context/watch/references/yaml-guardrails.md diff --git a/README.md b/README.md index 8e4979d..111469c 100644 --- a/README.md +++ b/README.md @@ -615,7 +615,7 @@ These aren't rules we invented — they're patterns extracted from Anthropic's e ## Getting Started -**Brand new to AEP?** Start with the [Orientation Guide](docs/orientation.md) for a 10-minute tour of the mental models, the 16 skills, and the four paths — then run `/aep-onboard`. +**Brand new to AEP?** Start with the [Orientation Guide](docs/orientation.md) for a 10-minute tour of the mental models, the 17 skills, and the four paths — then run `/aep-onboard`. **New to this plugin?** diff --git a/docs/orientation.md b/docs/orientation.md index 401d35a..bae3bdc 100644 --- a/docs/orientation.md +++ b/docs/orientation.md @@ -1,6 +1,6 @@ # AEP Orientation Guide -**A 10-minute first-hour tour for new users.** Read this before (or right after) running `/aep-onboard`. When you finish, you'll know what AEP is, the three mental models that drive every skill, what each of the 16 skills does, and which of four concrete paths matches your situation. +**A 10-minute first-hour tour for new users.** Read this before (or right after) running `/aep-onboard`. When you finish, you'll know what AEP is, the three mental models that drive every skill, what each of the 17 skills does, and which of four concrete paths matches your situation. For precise definitions of every term used here, see the [Glossary](glossary.md). For a one-page decision tree, see the [Skills Quick Reference](skills-quick-reference.md). @@ -94,7 +94,7 @@ More: [README.md "The Feature Lifecycle"](../README.md) and [skills/agentic-deve --- -## 3. The 16 Skills at a Glance +## 3. The 17 Skills at a Glance | Skill | Plugin | Session | Purpose | | ------------------------ | ---------------------------- | --------- | --------------------------------------------------------------- | @@ -107,6 +107,7 @@ More: [README.md "The Feature Lifecycle"](../README.md) and [skills/agentic-deve | `/aep-dispatch` | product-context | Main | Pick next story + create OpenSpec change + hand off | | `/aep-calibrate` | product-context | Main | Human alignment checkpoint for any quality dimension | | `/aep-reflect` | product-context | Main | Classify feedback + update context (close the loop) | +| `/aep-watch` | product-context | Main | Ingest telemetry/errors → auto-file stories (self-feeding loop) | | `/aep-design` | agentic-development-workflow | Main | Interactive feature design (explore + propose + review) | | `/aep-launch` | agentic-development-workflow | Main | Spawn autonomous workspace + optional evaluator | | `/aep-build` | agentic-development-workflow | Workspace | Implement → test → PR → merge (autonomous) | @@ -237,4 +238,4 @@ One-line pointers so you know what to look up when you hit an unfamiliar term. F --- -**You're done with orientation.** The rest of AEP is discoverable from the three mental models, the 16-skill table, and the four paths. When in doubt, reach for the decision tree in the quick reference — it covers the common forks. +**You're done with orientation.** The rest of AEP is discoverable from the three mental models, the 17-skill table, and the four paths. When in doubt, reach for the decision tree in the quick reference — it covers the common forks. diff --git a/skills/agentic-development-workflow/build/SKILL.md b/skills/agentic-development-workflow/build/SKILL.md index c1d9904..dcc62b0 100644 --- a/skills/agentic-development-workflow/build/SKILL.md +++ b/skills/agentic-development-workflow/build/SKILL.md @@ -370,13 +370,13 @@ convergence rules are identical across modes. For each round N (starting at 1, max 5), the generator's response to a FAIL escalates along the **change-strategy recovery ladder** (`.claude/skills/aep-gen-eval/references/recovery-ladder.md`) rather than retrying the same way every round: -| Eval round | Rung | Generator move | -| ---------- | ---- | -------------- | -| 1–2 | **Same fix** | Same generator fixes the FAIL items in place (current default). | -| 3 | **Re-ground** | Same generator re-reads the FULL spec + design + contracts from scratch, then re-attempts. | -| 4 | **Different approach** | Spawn a **fresh `native-bg-subagent` generator** told "the previous approach failed on X; take a different design path" — not anchored on the stuck solution (it inherits the existing worktree). | -| 5 | **Decompose** | Split the story into sub-tasks; attempt the **smallest viable slice** and surface the proposed split. | -| after 5 | **Human gate** | Ladder exhausted → escalate with type `eval_not_converging`. | +| Eval round | Rung | Generator move | +| ---------- | ---------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| 1–2 | **Same fix** | Same generator fixes the FAIL items in place (current default). | +| 3 | **Re-ground** | Same generator re-reads the FULL spec + design + contracts from scratch, then re-attempts. | +| 4 | **Different approach** | Spawn a **fresh `native-bg-subagent` generator** told "the previous approach failed on X; take a different design path" — not anchored on the stuck solution (it inherits the existing worktree). | +| 5 | **Decompose** | Split the story into sub-tasks; attempt the **smallest viable slice** and surface the proposed split. | +| after 5 | **Human gate** | Ladder exhausted → escalate with type `eval_not_converging`. | Track the rung with `eval_round` + `recovery_rung` in `status.json` (see the ladder's State Tracking). **Generator≠evaluator separation holds** — the evaluator only scores; re-grounding, a fresh generator, and decomposition are all generator-side moves. **Skip the ladder and escalate immediately** on a hard-failure / security FAIL (auth-model gap, data-exposure risk), a spec contradiction, or a missing external dependency — these need human judgment, not a different approach. See the ladder file for full rung rationale and the rung-4 fresh-generator spawn contract (`native-bg-subagent` + post-spawn liveness probe). diff --git a/skills/patterns/autopilot/references/post-merge-guard.md b/skills/patterns/autopilot/references/post-merge-guard.md index 44e5e59..60795b2 100644 --- a/skills/patterns/autopilot/references/post-merge-guard.md +++ b/skills/patterns/autopilot/references/post-merge-guard.md @@ -61,13 +61,13 @@ Within the open window, each tick performs two independent reads: Read every signal named in `topology.routing.post_merge_guard.health_signals`. These are service-level, signals-only probes — no workspace code: -| Signal kind | How the orchestrator reads it (examples) | -| ------------------- | ----------------------------------------------------------------------------------------- | -| `ci_status` | `gh run view --json status,conclusion` for the post-merge pipeline | -| `health_endpoint` | `curl -fsS --max-time 5 ` (e.g. `/healthz`, `/readyz`) → expect 2xx | -| `error_rate` | query the project's metrics/log source for error-rate over the window vs. a baseline | -| `latency_p95` | same source — p95 latency vs. baseline threshold | -| `smoke_check` | a declared CLI/API smoke command exiting 0 | +| Signal kind | How the orchestrator reads it (examples) | +| ----------------- | ------------------------------------------------------------------------------------ | +| `ci_status` | `gh run view --json status,conclusion` for the post-merge pipeline | +| `health_endpoint` | `curl -fsS --max-time 5 ` (e.g. `/healthz`, `/readyz`) → expect 2xx | +| `error_rate` | query the project's metrics/log source for error-rate over the window vs. a baseline | +| `latency_p95` | same source — p95 latency vs. baseline threshold | +| `smoke_check` | a declared CLI/API smoke command exiting 0 | A signal is **red** when it fails its declared threshold (non-2xx health, CI `failure`, error-rate above baseline + margin, etc.). One transient red is not a regression — require the red to persist across **2 consecutive ticks** (or match a declared confirm rule) before treating it as confirmed, to avoid reverting on a deploy-warmup blip. @@ -117,7 +117,7 @@ A health signal is **confirmed red** (or the deploy failed). The deployed servic } ``` - **`auto_revert: true` (opt-in) and regression confirmed:** - 1. **Revert** — `gh pr revert ` (opens/auto-merges a revert PR per repo policy) or revert the merge commit on `$BASE` and push. This is the **one** sanctioned exception to "never act on the merge" — it is a *recovery* action, gated behind explicit opt-in, not a normal merge. + 1. **Revert** — `gh pr revert ` (opens/auto-merges a revert PR per repo policy) or revert the merge commit on `$BASE` and push. This is the **one** sanctioned exception to "never act on the merge" — it is a _recovery_ action, gated behind explicit opt-in, not a normal merge. 2. **Record an incident** — write `.dev-workflow/incidents/-.md` (or append to `autopilot-history.jsonl` with `type: incident`): the red signals, readings, the reverted PR, and the deploy outcome. 3. **Feed `/aep-reflect`** — hand the incident to the reflect classifier so the regression becomes a learning + a follow-up story (root-cause / guard hardening), closing the loop the same way Path 1 does for UX issues. 4. Set `guard_state.reverted = true` so no later tick reverts the same story twice (see [state](#state--idempotency)). diff --git a/skills/patterns/autopilot/references/state-schema.md b/skills/patterns/autopilot/references/state-schema.md index 9f5afe3..b0a6265 100644 --- a/skills/patterns/autopilot/references/state-schema.md +++ b/skills/patterns/autopilot/references/state-schema.md @@ -55,6 +55,18 @@ Machine-readable state file. Read and written by the autopilot tick. } ], + "guard_state": { + "PROJ-002": { + "pr_number": 142, + "deploy_status": "deployed", + "monitor_until": "2026-04-01T10:40:00Z", + "health": { "ci_status": "green", "error_rate": "ok", "health_endpoint": "ok" }, + "dogfood_done": true, + "reverted": false, + "escalated": false + } + }, + "stats": { "stories_completed": 3, "stories_failed": 0, @@ -121,16 +133,32 @@ Machine-readable state file. Read and written by the autopilot tick. #### Escalation Entry -| Field | Type | Description | -| ----------------------- | ------------ | -------------------------------------------------------------------------------------------------------- | -| `type` | enum | `"design_needed"`, `"stuck"`, `"failed"`, `"layer_gate_failed"`, `"eval_not_converging"`, `"human_gate"` | -| `story_id` | string | Related story ID | -| `workspace` | string\|null | Workspace name (null if not yet launched) | -| `reason` | string | One-line reason | -| `details` | string | Detailed explanation of why escalation triggered | -| `expected_human_action` | string | What the human should do | -| `created_at` | string | ISO8601 timestamp | -| `acknowledged` | boolean | Whether human has seen this | +| Field | Type | Description | +| ----------------------- | ------------ | ----------------------------------------------------------------------------------------------------------------------------------- | +| `type` | enum | `"design_needed"`, `"stuck"`, `"failed"`, `"layer_gate_failed"`, `"eval_not_converging"`, `"human_gate"`, `"post_merge_regression"` | +| `story_id` | string | Related story ID | +| `workspace` | string\|null | Workspace name (null if not yet launched) | +| `reason` | string | One-line reason | +| `details` | string | Detailed explanation of why escalation triggered | +| `expected_human_action` | string | What the human should do | +| `created_at` | string | ISO8601 timestamp | +| `acknowledged` | boolean | Whether human has seen this | + +#### `guard_state` Entry (post-merge guard, keyed by story_id) + +Persists the post-merge guard's per-story state so a tick is idempotent +(deploy-once, dogfood-once, revert-once, escalate-once). See +`references/post-merge-guard.md`. + +| Field | Type | Description | +| --------------- | ------- | --------------------------------------------------------------------------------- | +| `pr_number` | number | Merged PR being guarded | +| `deploy_status` | enum | `"pending"`, `"deployed"`, `"failed"` | +| `monitor_until` | string | ISO8601 — end of the `window_min` monitoring window | +| `health` | object | Last reading per configured `health_signals` key (e.g. `ci_status`, `error_rate`) | +| `dogfood_done` | boolean | Whether host-aware post-deploy dogfood has run for this story | +| `reverted` | boolean | Whether a revert was already issued (guards against double-revert) | +| `escalated` | boolean | Whether a `post_merge_regression` escalation was already emitted | --- diff --git a/skills/patterns/executor/references/codex-native.md b/skills/patterns/executor/references/codex-native.md index 3323e6e..0637632 100644 --- a/skills/patterns/executor/references/codex-native.md +++ b/skills/patterns/executor/references/codex-native.md @@ -207,3 +207,20 @@ Same as codex-subagent: `codex exec --cd ` with the evaluator prompt. The exec process exits on its own when the build completes; nothing to kill. Then the common worktree removal from `backends.md`. + +--- + +## Dogfood / post-deploy validation (Codex) + +Host-aware dogfood (`dogfood_method()`) for Codex resolves by mode, per +[`dogfood-validation.md`](dogfood-validation.md): + +- **codex-subagent** (desktop, GPT-5.4 multimodal): use the **native in-app + browser + computer-use** to drive the app and capture screenshots (computer-use + is desktop-only). Fallback: the Playwright skill, then agent-browser CLI. +- **codex-exec** (headless): **write and run a Playwright script** (no computer-use + off the desktop app). Fallback: agent-browser CLI → API/curl checks. + +Screenshots feed the multimodal evaluator's Visual Design dimension +(`aep-gen-eval/references/scoring-framework.md`). Full selection + `target_url()` +resolution live in `dogfood-validation.md`. diff --git a/skills/patterns/executor/references/dogfood-validation.md b/skills/patterns/executor/references/dogfood-validation.md index 0cd8691..0a029d6 100644 --- a/skills/patterns/executor/references/dogfood-validation.md +++ b/skills/patterns/executor/references/dogfood-validation.md @@ -50,11 +50,11 @@ dogfood_method(): else: return "degrade" # API checks ``` -| Host / mode | Native method (default) | Detection | Fallback | -| ------------------------------ | -------------------------------------- | ------------------------------- | --------------------------------- | -| **Claude Code** (any mode) | `/agent-browser:dogfood` | `agent_browser_healthy()` | non-UI → API/curl; UI → human-eval | +| Host / mode | Native method (default) | Detection | Fallback | +| ---------------------------------- | --------------------------------------------------------- | ------------------------------ | ------------------------------------ | +| **Claude Code** (any mode) | `/agent-browser:dogfood` | `agent_browser_healthy()` | non-UI → API/curl; UI → human-eval | | **Codex desktop** (codex-subagent) | native in-app browser + computer-use (GPT-5.4 multimodal) | desktop + computer-use enabled | Playwright skill → agent-browser CLI | -| **Codex headless** (codex-exec) | write + run a Playwright script | `playwright_available()` | agent-browser CLI → API checks | +| **Codex headless** (codex-exec) | write + run a Playwright script | `playwright_available()` | agent-browser CLI → API checks | > **Why Codex splits two ways.** Computer-use and the in-app (Atlas) browser are > **desktop-only**. `codex exec` (headless) has neither, so it writes and runs a @@ -97,7 +97,7 @@ post-deploy report path (staging/prod), one entry per finding: **Severity:** blocker | major | minor **Category:** UX | logic | visual | edge-case | accessibility | performance **Repro:** -**Observed:** **Expected:** +**Observed:** **Expected:** **Evidence:** ``` diff --git a/skills/patterns/gen-eval/references/agent-contracts.md b/skills/patterns/gen-eval/references/agent-contracts.md index d5b17f6..fb0dfe4 100644 --- a/skills/patterns/gen-eval/references/agent-contracts.md +++ b/skills/patterns/gen-eval/references/agent-contracts.md @@ -17,13 +17,13 @@ Role definitions and prompt templates for generator and evaluator agents. The co ## Role Separation Principle -| Rule | Rationale | -|------|-----------| -| Generator MUST NOT evaluate its own output | Agents consistently praise their own work | -| Evaluator MUST NOT see generator's self-assessment | Anchoring bias corrupts independent evaluation | -| Generator MUST NOT modify evaluator's scores or findings | Data integrity of evaluation results | -| Evaluator MUST NOT implement fixes | Role contamination — evaluator becomes invested in the fix | -| Both agents receive the SAME spec/requirements | Ensures evaluation is against the spec, not the generator's interpretation | +| Rule | Rationale | +| -------------------------------------------------------- | -------------------------------------------------------------------------- | +| Generator MUST NOT evaluate its own output | Agents consistently praise their own work | +| Evaluator MUST NOT see generator's self-assessment | Anchoring bias corrupts independent evaluation | +| Generator MUST NOT modify evaluator's scores or findings | Data integrity of evaluation results | +| Evaluator MUST NOT implement fixes | Role contamination — evaluator becomes invested in the fix | +| Both agents receive the SAME spec/requirements | Ensures evaluation is against the spec, not the generator's interpretation | --- @@ -33,12 +33,12 @@ Role definitions and prompt templates for generator and evaluator agents. The co The generator produces or validates an artifact by attempting to use it. In different contexts: -| Context | Generator does | -|---------|---------------| -| **Code review** (build) | Implements tasks, then self-checks completeness (but cannot score quality) | -| **Artifact validation** (validate) | Walks through each item mentally, identifies gaps and ambiguities | -| **Design review** | Attempts to implement the design mentally, finds missing details | -| **Document review** | Follows the document's instructions step by step | +| Context | Generator does | +| ---------------------------------- | -------------------------------------------------------------------------- | +| **Code review** (build) | Implements tasks, then self-checks completeness (but cannot score quality) | +| **Artifact validation** (validate) | Walks through each item mentally, identifies gaps and ambiguities | +| **Design review** | Attempts to implement the design mentally, finds missing details | +| **Document review** | Follows the document's instructions step by step | ### Generator constraints @@ -54,12 +54,14 @@ The generator produces a structured artifact or a findings list: ```markdown ## Assessment of [item] + **Can implement?** Yes/No **Missing details:** + - [specific gap that would cause guesswork] -**Dependency gaps:** + **Dependency gaps:** - [what this item needs but doesn't declare] -**Assumption mismatches:** + **Assumption mismatches:** - [implicit assumption that could be wrong] ``` @@ -71,13 +73,13 @@ The generator produces a structured artifact or a findings list: The evaluator independently assesses work against specifications. It has NO knowledge of the generator's internal reasoning or self-assessment. -| Context | Evaluator does | -|---------|---------------| -| **Code review** (build) | Tests running application, reviews code, scores dimensions | -| **UI work** (build) | Additionally receives screenshot(s) of the running app and scores Visual Design against the calibration/design-system spec (multimodal) | -| **Artifact validation** (validate) | Checks claims against codebase, verifies file paths, API shapes | -| **Design review** | Verifies technical feasibility against actual code | -| **Document review** | Confirms factual claims, tests commands | +| Context | Evaluator does | +| ---------------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | +| **Code review** (build) | Tests running application, reviews code, scores dimensions | +| **UI work** (build) | Additionally receives screenshot(s) of the running app and scores Visual Design against the calibration/design-system spec (multimodal) | +| **Artifact validation** (validate) | Checks claims against codebase, verifies file paths, API shapes | +| **Design review** | Verifies technical feasibility against actual code | +| **Document review** | Confirms factual claims, tests commands | ### Evaluator constraints @@ -96,7 +98,9 @@ The evaluator independently assesses work against specifications. It has NO know # Evaluation Round ## Findings + ### [PASS/FAIL]: [Finding title] ([Dimension]: [Score]) + - Steps to reproduce: [concrete steps] - Expected: [what should happen] - Actual: [what actually happens] @@ -104,14 +108,17 @@ The evaluator independently assesses work against specifications. It has NO know - Fix: [specific, actionable suggestion] ## Scores + - [Dimension 1]: [Score] — [justification referencing scale definition] - [Dimension 2]: [Score] — [justification] -... + ... ## Result: PASS / FAIL + [If FAIL: which thresholds were violated, what must be fixed] ## Verification Updates + [Which items in feature-verification.json were updated] ``` @@ -137,18 +144,22 @@ A specialized evaluator that checks whether an artifact is compatible with the d # Protocol Compatibility Report ## Required fields check + - [field]: present / MISSING - [field]: present / MISSING (required by [downstream skill]) ## Structural validation + - DAG validity: PASS / FAIL ([details]) - Cross-references: PASS / FAIL ([broken refs]) - Scoring compatibility: PASS / FAIL ([missing inputs]) ## File conflict analysis + - [file]: modified by [story A] and [story B] in same slice ## Summary + [N] required fixes, [M] warnings ``` @@ -161,12 +172,14 @@ What each agent receives determines the quality of evaluation. Too much context ### Generator context **Include:** + 1. The artifact being validated — full content 2. The artifact's purpose — what downstream consumer uses it 3. Technical constraints — stack, conventions, existing patterns 4. Dependencies — what this artifact builds on **Exclude:** + - Full codebase (evaluator's job) - History of how the artifact was created - Other artifacts not directly consumed @@ -175,12 +188,14 @@ What each agent receives determines the quality of evaluation. Too much context ### Evaluator context **Include:** + 1. The artifact being validated — full content 2. The original spec/requirements — NOT the generator's interpretation 3. Read access to the codebase — package.json, schemas, configs, source 4. The specific claims to verify — file paths, versions, API signatures **Exclude:** + - Generator's self-assessment or findings - Product vision or business context (unless evaluating product artifacts) - Other evaluator's findings (if running multiple evaluators) @@ -188,11 +203,13 @@ What each agent receives determines the quality of evaluation. Too much context ### Protocol Checker context **Include:** + 1. The artifact — specifically the section being checked 2. The downstream protocol specification — exact field requirements, format rules 3. Structural constraints — DAG rules, naming conventions **Exclude:** + - The codebase (not relevant for protocol checking) - Quality dimensions (not its role) - Business context diff --git a/skills/patterns/gen-eval/references/eval-protocol.md b/skills/patterns/gen-eval/references/eval-protocol.md index 7e7728e..e2607da 100644 --- a/skills/patterns/gen-eval/references/eval-protocol.md +++ b/skills/patterns/gen-eval/references/eval-protocol.md @@ -196,11 +196,18 @@ Used in multi-round mode (workspace context). Files live in `.dev-workflow/signa "phase_name": "code-review", "eval_round": 2, "eval_result": "fail", + "recovery_rung": "reground", "completion_pct": 75, "updated_at": "2026-03-30T12:00:00Z" } ``` +`recovery_rung` (optional) tracks which rung of the +[recovery ladder](recovery-ladder.md) the generator is on when `eval_result` +keeps failing — `same_fix` | `reground` | `fresh_generator` | `decompose`. The +autopilot tick reads it to know the ladder is being climbed before it emits an +`eval_not_converging` escalation. + ### needs-human.md (worker writes — the human-gate record) When the loop cannot converge (or any decision needs the human), the worker diff --git a/skills/patterns/gen-eval/references/recovery-ladder.md b/skills/patterns/gen-eval/references/recovery-ladder.md index fddcd49..7302847 100644 --- a/skills/patterns/gen-eval/references/recovery-ladder.md +++ b/skills/patterns/gen-eval/references/recovery-ladder.md @@ -22,13 +22,13 @@ This reference defines an escalating recovery ladder. Each rung tries something Round numbers are tunable per project; the **shape** is what matters — each rung is a strictly larger change of strategy than the one below it. -| Eval round | Rung | Strategy | -| ---------- | ---- | -------- | -| 1–2 | **Same fix** | Same generator fixes the FAIL items normally. Current default behavior. | -| 3 | **Re-ground** | Same generator re-reads the FULL spec + design + contracts **from scratch** and re-attempts. | -| 4 | **Different approach** | Spawn a **fresh generator** told "the previous approach failed on X; take a different design path." Not anchored on the stuck solution. | -| 5 | **Decompose** | Split the story into smaller sub-stories / sub-tasks; attempt the **smallest viable slice**. Surface the proposed split. | -| after 5 | **Human gate** | Ladder exhausted → escalate with type `eval_not_converging`. | +| Eval round | Rung | Strategy | +| ---------- | ---------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | +| 1–2 | **Same fix** | Same generator fixes the FAIL items normally. Current default behavior. | +| 3 | **Re-ground** | Same generator re-reads the FULL spec + design + contracts **from scratch** and re-attempts. | +| 4 | **Different approach** | Spawn a **fresh generator** told "the previous approach failed on X; take a different design path." Not anchored on the stuck solution. | +| 5 | **Decompose** | Split the story into smaller sub-stories / sub-tasks; attempt the **smallest viable slice**. Surface the proposed split. | +| after 5 | **Human gate** | Ladder exhausted → escalate with type `eval_not_converging`. | ### Round 1–2 — Same fix (current behavior) @@ -105,9 +105,9 @@ The fresh generator is still a generator: the evaluator role is untouched, and t ## Cross-References -| Where | What it covers | -| ----- | -------------- | -| `/aep-build` Phase 5 | Runs the multi-round gen/eval loop; this ladder governs what the generator does on each FAIL round. | -| `eval-protocol.md` → Convergence Rules / needs-human gate | `max_rounds`, the escalation format, and the `needs-human.md` + `blocked_on` gate record the ladder feeds into. | -| `aep-autopilot` tick-protocol Step ④ | The orchestrator observes `eval_round` / `recovery_rung`, nudges a stalled workspace, and emits the `eval_not_converging` escalation once the ladder is exhausted. It only nudges — the workspace runs its own loop and climbs its own ladder. | -| `aep-executor` `scripts/spawn-liveness-probe.sh` | Post-spawn liveness probe the rung-4 fresh generator MUST pass. | +| Where | What it covers | +| --------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `/aep-build` Phase 5 | Runs the multi-round gen/eval loop; this ladder governs what the generator does on each FAIL round. | +| `eval-protocol.md` → Convergence Rules / needs-human gate | `max_rounds`, the escalation format, and the `needs-human.md` + `blocked_on` gate record the ladder feeds into. | +| `aep-autopilot` tick-protocol Step ④ | The orchestrator observes `eval_round` / `recovery_rung`, nudges a stalled workspace, and emits the `eval_not_converging` escalation once the ladder is exhausted. It only nudges — the workspace runs its own loop and climbs its own ladder. | +| `aep-executor` `scripts/spawn-liveness-probe.sh` | Post-spawn liveness probe the rung-4 fresh generator MUST pass. | diff --git a/skills/patterns/gen-eval/references/scoring-framework.md b/skills/patterns/gen-eval/references/scoring-framework.md index affd1b3..88e5573 100644 --- a/skills/patterns/gen-eval/references/scoring-framework.md +++ b/skills/patterns/gen-eval/references/scoring-framework.md @@ -30,73 +30,73 @@ Evaluate each dimension on a 1–5 scale. Score honestly — the value of evalua Does the implementation cover all tasks and specs? -| Score | Definition | -|-------|-----------| -| 1 | Multiple tasks unimplemented or stubbed out | -| 2 | Most tasks attempted but significant gaps remain | -| 3 | All tasks addressed but some have missing edge cases or incomplete flows | -| 4 | All tasks fully implemented with minor omissions | -| 5 | Every task, edge case, and spec requirement implemented and verified | +| Score | Definition | +| ----- | ------------------------------------------------------------------------ | +| 1 | Multiple tasks unimplemented or stubbed out | +| 2 | Most tasks attempted but significant gaps remain | +| 3 | All tasks addressed but some have missing edge cases or incomplete flows | +| 4 | All tasks fully implemented with minor omissions | +| 5 | Every task, edge case, and spec requirement implemented and verified | ### 2. Correctness (1–5) Does the implementation work as specified? Are edge cases handled? -| Score | Definition | -|-------|-----------| -| 1 | Core functionality broken — primary flows fail | -| 2 | Main flows work but secondary flows or error paths fail | -| 3 | Flows work under normal conditions but break on edge cases | -| 4 | All flows work correctly with minor edge case gaps | -| 5 | All flows work correctly including error states, empty states, and boundary conditions | +| Score | Definition | +| ----- | -------------------------------------------------------------------------------------- | +| 1 | Core functionality broken — primary flows fail | +| 2 | Main flows work but secondary flows or error paths fail | +| 3 | Flows work under normal conditions but break on edge cases | +| 4 | All flows work correctly with minor edge case gaps | +| 5 | All flows work correctly including error states, empty states, and boundary conditions | ### 3. UX Quality (1–5) Is the interface intuitive, responsive, and accessible? -| Score | Definition | -|-------|-----------| -| 1 | Interface is confusing — users cannot complete basic tasks without guessing | -| 2 | Interface works but has unintuitive interactions or missing feedback | -| 3 | Functional UX with standard patterns but nothing polished | -| 4 | Clean, intuitive UX with proper loading states, error messages, and responsive layout | -| 5 | Polished UX with thoughtful transitions, accessibility, and delight details | +| Score | Definition | +| ----- | ------------------------------------------------------------------------------------- | +| 1 | Interface is confusing — users cannot complete basic tasks without guessing | +| 2 | Interface works but has unintuitive interactions or missing feedback | +| 3 | Functional UX with standard patterns but nothing polished | +| 4 | Clean, intuitive UX with proper loading states, error messages, and responsive layout | +| 5 | Polished UX with thoughtful transitions, accessibility, and delight details | ### 4. Security (1–5) Input validation, auth checks, data exposure? -| Score | Definition | -|-------|-----------| -| 1 | Critical vulnerabilities — SQL injection, XSS, or auth bypass possible | -| 2 | Major gaps — missing input validation on user-facing endpoints | -| 3 | Basic validation present but inconsistent; some endpoints lack auth checks | -| 4 | Solid validation and auth coverage with minor gaps in edge cases | -| 5 | Comprehensive validation, parameterized queries, proper auth on all routes, no data leaks | +| Score | Definition | +| ----- | ----------------------------------------------------------------------------------------- | +| 1 | Critical vulnerabilities — SQL injection, XSS, or auth bypass possible | +| 2 | Major gaps — missing input validation on user-facing endpoints | +| 3 | Basic validation present but inconsistent; some endpoints lack auth checks | +| 4 | Solid validation and auth coverage with minor gaps in edge cases | +| 5 | Comprehensive validation, parameterized queries, proper auth on all routes, no data leaks | ### 5. Code Quality (1–5) Conventions, maintainability, performance? -| Score | Definition | -|-------|-----------| -| 1 | Inconsistent patterns, duplicated logic, no error handling | -| 2 | Works but fragile — magic numbers, unclear naming, mixed conventions | -| 3 | Acceptable quality following basic conventions; some areas need cleanup | -| 4 | Clean, consistent code with proper error handling and clear structure | -| 5 | Exemplary — clear abstractions, well-named, efficient, follows all project conventions | +| Score | Definition | +| ----- | -------------------------------------------------------------------------------------- | +| 1 | Inconsistent patterns, duplicated logic, no error handling | +| 2 | Works but fragile — magic numbers, unclear naming, mixed conventions | +| 3 | Acceptable quality following basic conventions; some areas need cleanup | +| 4 | Clean, consistent code with proper error handling and clear structure | +| 5 | Exemplary — clear abstractions, well-named, efficient, follows all project conventions | ### 6. Visual Design (1–5) Does a screenshot of the running UI match the project's design system? Evaluated by feeding a screenshot of the running app to the **multimodal evaluator**, scored against the project's design-system / calibration spec (`calibration/.yaml`, e.g. `calibration/visual-design.yaml`) — spacing rhythm, visual hierarchy, brand/token consistency, alignment, and overall polish. -| Score | Definition | -|-------|-----------| -| 1 | Off-brand or visually broken — wrong colors/fonts, overlapping elements, no consistent spacing | -| 2 | Recognizable but inconsistent — ad-hoc spacing, mismatched tokens, weak hierarchy | -| 3 | Follows the design system loosely — on-brand but uneven spacing/alignment, generic polish | -| 4 | Consistent with the design system — correct tokens, clear hierarchy, aligned, minor polish gaps | -| 5 | Pixel-faithful to the design system — consistent tokens, deliberate hierarchy and spacing, fully aligned, production-grade polish | +| Score | Definition | +| ----- | --------------------------------------------------------------------------------------------------------------------------------- | +| 1 | Off-brand or visually broken — wrong colors/fonts, overlapping elements, no consistent spacing | +| 2 | Recognizable but inconsistent — ad-hoc spacing, mismatched tokens, weak hierarchy | +| 3 | Follows the design system loosely — on-brand but uneven spacing/alignment, generic polish | +| 4 | Consistent with the design system — correct tokens, clear hierarchy, aligned, minor polish gaps | +| 5 | Pixel-faithful to the design system — consistent tokens, deliberate hierarchy and spacing, fully aligned, production-grade polish | > **Hard failure:** Visual Design < 3 for the `.5` polish layer — a screenshot that does not match the design system blocks the polish layer from passing. > @@ -184,65 +184,65 @@ When evaluating product context, architecture, or design artifacts (not code): ### Completeness (1–5) -| Score | Definition | -|-------|-----------| -| 1 | Major sections missing, enums undefined, no defaults specified | -| 2 | Sections present but sparse — many fields lack values or constraints | -| 3 | All sections present with some gaps in specificity | -| 4 | Comprehensive with minor omissions (e.g., a missing enum value) | -| 5 | Every field specified, all enums listed, all defaults documented | +| Score | Definition | +| ----- | -------------------------------------------------------------------- | +| 1 | Major sections missing, enums undefined, no defaults specified | +| 2 | Sections present but sparse — many fields lack values or constraints | +| 3 | All sections present with some gaps in specificity | +| 4 | Comprehensive with minor omissions (e.g., a missing enum value) | +| 5 | Every field specified, all enums listed, all defaults documented | ### Consistency (1–5) -| Score | Definition | -|-------|-----------| -| 1 | Field names conflict across sections, broken references | -| 2 | Some naming mismatches, a few invalid cross-references | -| 3 | Generally consistent with isolated inconsistencies | -| 4 | Consistent naming and valid references with minor style variations | -| 5 | Perfectly consistent naming, all cross-references valid, uniform conventions | +| Score | Definition | +| ----- | ---------------------------------------------------------------------------- | +| 1 | Field names conflict across sections, broken references | +| 2 | Some naming mismatches, a few invalid cross-references | +| 3 | Generally consistent with isolated inconsistencies | +| 4 | Consistent naming and valid references with minor style variations | +| 5 | Perfectly consistent naming, all cross-references valid, uniform conventions | ### Implementability (1–5) -| Score | Definition | -|-------|-----------| -| 1 | Stories cannot be implemented — critical technical details missing | -| 2 | Most stories implementable but several have ambiguous acceptance criteria | -| 3 | All stories have a path to implementation with some guesswork needed | -| 4 | Clear implementation path with minor ambiguities | -| 5 | Every story is unambiguous — an implementer agent could build it without questions | +| Score | Definition | +| ----- | ---------------------------------------------------------------------------------- | +| 1 | Stories cannot be implemented — critical technical details missing | +| 2 | Most stories implementable but several have ambiguous acceptance criteria | +| 3 | All stories have a path to implementation with some guesswork needed | +| 4 | Clear implementation path with minor ambiguities | +| 5 | Every story is unambiguous — an implementer agent could build it without questions | ### Security (1–5) -| Score | Definition | -|-------|-----------| -| 1 | No security considerations in the design | -| 2 | Security mentioned but critical gaps (e.g., no auth model, PII unaddressed) | -| 3 | Basic security covered but edge cases missing | -| 4 | Comprehensive security design with minor gaps | -| 5 | Security-first design with threat model, data lineage, and compliance considerations | +| Score | Definition | +| ----- | ------------------------------------------------------------------------------------ | +| 1 | No security considerations in the design | +| 2 | Security mentioned but critical gaps (e.g., no auth model, PII unaddressed) | +| 3 | Basic security covered but edge cases missing | +| 4 | Comprehensive security design with minor gaps | +| 5 | Security-first design with threat model, data lineage, and compliance considerations | ### Downstream Compatibility (1–5) -| Score | Definition | -|-------|-----------| -| 1 | Artifact cannot be consumed by downstream skills (missing required fields) | -| 2 | Most fields present but format mismatches prevent consumption | -| 3 | Consumable with minor fixups needed | -| 4 | Fully compatible with minor cosmetic issues | -| 5 | Perfect compatibility — downstream skills can consume without any transformation | +| Score | Definition | +| ----- | -------------------------------------------------------------------------------- | +| 1 | Artifact cannot be consumed by downstream skills (missing required fields) | +| 2 | Most fields present but format mismatches prevent consumption | +| 3 | Consumable with minor fixups needed | +| 4 | Fully compatible with minor cosmetic issues | +| 5 | Perfect compatibility — downstream skills can consume without any transformation | ### Walking Skeleton Validity (1–5) Does Layer 0 represent the thinnest possible end-to-end user journey? -| Score | Definition | -|-------|-----------| -| 1 | Layer 0 has gold-plated features, infrastructure-only stories, or no clear user journey | -| 2 | A user journey exists but includes unnecessary scope — some stories could move to Layer 1+ | -| 3 | Mostly minimal but 1-2 stories feel over-scoped for a walking skeleton | -| 4 | Genuinely thin path with one minor luxury that could be deferred | -| 5 | The absolute minimum — a user can complete the crudest possible journey, nothing more | +| Score | Definition | +| ----- | ------------------------------------------------------------------------------------------ | +| 1 | Layer 0 has gold-plated features, infrastructure-only stories, or no clear user journey | +| 2 | A user journey exists but includes unnecessary scope — some stories could move to Layer 1+ | +| 3 | Mostly minimal but 1-2 stories feel over-scoped for a walking skeleton | +| 4 | Genuinely thin path with one minor luxury that could be deferred | +| 5 | The absolute minimum — a user can complete the crudest possible journey, nothing more | > "Build a skeleton that can walk before building a perfect leg." — Jeff Patton @@ -250,37 +250,37 @@ Does Layer 0 represent the thinnest possible end-to-end user journey? Does each layer add meaningful new user capability in the right order? -| Score | Definition | -|-------|-----------| -| 1 | Layers are arbitrary groupings with no clear progression of user value | -| 2 | Some layers add user value, but ordering doesn't match priority | -| 3 | Layers generally progress from core to enrichment, with 1-2 misplacements | -| 4 | Clear value progression — each layer unlocks a meaningful new user capability | -| 5 | Optimal ordering — users get the highest-value capabilities earliest, each layer builds naturally on the previous | +| Score | Definition | +| ----- | ----------------------------------------------------------------------------------------------------------------- | +| 1 | Layers are arbitrary groupings with no clear progression of user value | +| 2 | Some layers add user value, but ordering doesn't match priority | +| 3 | Layers generally progress from core to enrichment, with 1-2 misplacements | +| 4 | Clear value progression — each layer unlocks a meaningful new user capability | +| 5 | Optimal ordering — users get the highest-value capabilities earliest, each layer builds naturally on the previous | ### Vision Alignment (1–5) Do all stories trace back to the opportunity brief and product vision? -| Score | Definition | -|-------|-----------| -| 1 | Multiple stories serve no user need — pure technical infrastructure or scope creep | -| 2 | Most stories serve the vision but some are "nice to have" that crept in | -| 3 | All stories connect to user needs but some are indirect | -| 4 | Clear traceability from each story to the opportunity brief | -| 5 | Every story directly serves a stated user need, with explicit mapping to JTBD | +| Score | Definition | +| ----- | ---------------------------------------------------------------------------------- | +| 1 | Multiple stories serve no user need — pure technical infrastructure or scope creep | +| 2 | Most stories serve the vision but some are "nice to have" that crept in | +| 3 | All stories connect to user needs but some are indirect | +| 4 | Clear traceability from each story to the opportunity brief | +| 5 | Every story directly serves a stated user need, with explicit mapping to JTBD | ### INVEST Compliance (1–5) Do stories follow the INVEST criteria (Independent, Negotiable, Valuable, Estimable, Small, Testable)? -| Score | Definition | -|-------|-----------| -| 1 | Stories are coupled, vague, and untestable — they are task lists, not stories | -| 2 | Some stories meet INVEST but many are too large or have hidden dependencies | -| 3 | Most stories are independent and testable but some are oversized or bundled | -| 4 | Stories are well-formed with minor violations (e.g., one L story that should be split) | -| 5 | Every story is independent, delivers observable value, has clear acceptance criteria, and is right-sized | +| Score | Definition | +| ----- | -------------------------------------------------------------------------------------------------------- | +| 1 | Stories are coupled, vague, and untestable — they are task lists, not stories | +| 2 | Some stories meet INVEST but many are too large or have hidden dependencies | +| 3 | Most stories are independent and testable but some are oversized or bundled | +| 4 | Stories are well-formed with minor violations (e.g., one L story that should be split) | +| 5 | Every story is independent, delivers observable value, has clear acceptance criteria, and is right-sized | ### Story Mapping Hard Failure Thresholds @@ -296,23 +296,23 @@ When evaluating structured documents (RFCs, migration plans, runbooks): ### Accuracy (1–5) -| Score | Definition | -|-------|-----------| -| 1 | Multiple factual errors — wrong file paths, incorrect API signatures, outdated versions | -| 2 | Some claims incorrect or unverifiable | -| 3 | Mostly accurate with a few unverified claims | -| 4 | All verifiable claims checked and correct | -| 5 | Every claim verified against current codebase/documentation | +| Score | Definition | +| ----- | --------------------------------------------------------------------------------------- | +| 1 | Multiple factual errors — wrong file paths, incorrect API signatures, outdated versions | +| 2 | Some claims incorrect or unverifiable | +| 3 | Mostly accurate with a few unverified claims | +| 4 | All verifiable claims checked and correct | +| 5 | Every claim verified against current codebase/documentation | ### Executability (1–5) -| Score | Definition | -|-------|-----------| -| 1 | Cannot be followed — missing steps, wrong commands, undefined prerequisites | -| 2 | Followable with significant guesswork required | -| 3 | Can be followed but some steps need interpretation | -| 4 | Clear step-by-step with minor assumptions | -| 5 | Fully executable — every command correct, every prerequisite listed | +| Score | Definition | +| ----- | --------------------------------------------------------------------------- | +| 1 | Cannot be followed — missing steps, wrong commands, undefined prerequisites | +| 2 | Followable with significant guesswork required | +| 3 | Can be followed but some steps need interpretation | +| 4 | Clear step-by-step with minor assumptions | +| 5 | Fully executable — every command correct, every prerequisite listed | ### Completeness (1–5) @@ -386,14 +386,14 @@ Generator must fix both issues before re-evaluation. These are common evaluator failure modes — watch for them: -| Anti-Pattern | What Happens | Why It's Wrong | -|-------------|-------------|---------------| -| **Surface testing** | Only test the happy path | Bugs hide in error paths and edge cases | -| **Rationalization** | "This is probably fine because..." | If you found a problem, score it honestly | -| **Score inflation** | Everything gets 4-5 | Compare against scale definitions, not gut feel | -| **Scope creep** | "It would be nice if..." | Only evaluate against the spec, not wishlist items | -| **Premature approval** | Passing after finding only minor issues | Minor issues compound — evaluate the whole surface first | -| **Self-persuasion** | Identifying a problem then arguing it away | The problem exists. Score accordingly | +| Anti-Pattern | What Happens | Why It's Wrong | +| ---------------------- | ------------------------------------------ | -------------------------------------------------------- | +| **Surface testing** | Only test the happy path | Bugs hide in error paths and edge cases | +| **Rationalization** | "This is probably fine because..." | If you found a problem, score it honestly | +| **Score inflation** | Everything gets 4-5 | Compare against scale definitions, not gut feel | +| **Scope creep** | "It would be nice if..." | Only evaluate against the spec, not wishlist items | +| **Premature approval** | Passing after finding only minor issues | Minor issues compound — evaluate the whole surface first | +| **Self-persuasion** | Identifying a problem then arguing it away | The problem exists. Score accordingly | --- @@ -415,10 +415,12 @@ Create a `validation-criteria.md` in your project's `.dev-workflow/` directory: # Project Validation Criteria ## Additional dimensions + - API Design: Check for consistent naming, proper status codes, error format - Data Privacy: Verify PII handling, encryption, deletion cascade ## Project-specific hard failures + - Any endpoint missing Zod validation → Security FAIL - Any database change missing migration → Completeness FAIL ``` diff --git a/skills/product-context/_shared/references/telemetry-ingestion.md b/skills/product-context/_shared/references/telemetry-ingestion.md new file mode 100644 index 0000000..0f91bc1 --- /dev/null +++ b/skills/product-context/_shared/references/telemetry-ingestion.md @@ -0,0 +1,94 @@ +# Telemetry Ingestion & Outcome Auto-Evaluation + +How `/aep-reflect` (and `/aep-watch`) pull real-world signals automatically, and +how a layer's **quantitative** outcome contract is evaluated without a human. +This augments the interactive reflect flow — it never replaces human review by +default. (Gap G5.) + +> **Authoring note:** this file is canonical in +> `skills/product-context/_shared/references/`; `scripts/build-skills.sh` +> materializes it into each consuming skill's `references/`. + +--- + +## 1. Automated source ingestion + +Pull from read-only sources with `bash`/`curl`/`jq` and reduce each to the +**normalized observation record** the reflect Step 2 classifier consumes: + +```json +{ + "source": "error_stream | analytics | monitoring | bug_tracker", + "signal": "one-line description of what was observed", + "evidence": "url | query | sample (no secrets)", + "story_ref": "", + "suggested_class": "bug | refinement | discovery | opportunity_shift | process | null" +} +``` + +`suggested_class` is a hint only — the reflect Step 2 classifier (and the human, +unless `full_auto`) makes the final call. Ingested records are **merged** with +interactive input before classification; automation augments, never replaces. + +### Source config + +Endpoints live under `topology.routing.telemetry_sources` (a list). Each entry: + +```yaml +telemetry_sources: + - kind: error_stream # error_stream | analytics | monitoring | bug_tracker + endpoint: "https://…/api/…?since={since}" # {since} = last-ingest high-water mark + token_env: SENTRY_TOKEN # NAME of an env var / secret — never the secret itself + metric_map: # for analytics/monitoring: outcome-metric name → query + activation_rate: "SELECT … " +``` + +**Safety:** access is **read-only**; reference credentials by env-var / secret-store +name only — **never embed secrets in the repo or in `product-context.yaml`**. + +--- + +## 2. Outcome-contract auto-evaluation + +A layer's `outcome_contract` carries a `success_metric` (`type` + `target`) and a +`decision_rule` (`keep_if` / `otherwise`). Evaluate per +`topology.routing.auto_outcome_eval`: + +| Metric `type` | `auto_outcome_eval: quantitative` | default (`none`) | +| ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | +| **quantitative** (numeric, measurable from a source) | fetch actual value via the matching `telemetry_sources` query, apply `keep_if`/`otherwise` mechanically, record result — **no pause** | human pause (current behavior) | +| **qualitative** | human pause — **unless** `full_auto: true` (then agent-judgment auto-eval) | human pause | + +On a **fetch failure or ambiguity**, fall back to the human pause (fail safe, not +fail open). Record every auto-evaluation in the `changelog`: + +```yaml +- date: YYYY-MM-DD + type: outcome_evaluation + summary: "Layer N: = vs target → passed|failed (auto)" +``` + +--- + +## 3. `full_auto` interaction (A1) + +`topology.routing.full_auto` (default **false**) is the master switch. It only +changes the **qualitative** path: + +| `full_auto` | `auto_outcome_eval` | quantitative outcome | qualitative outcome | +| --------------- | ---------------------- | -------------------- | ---------------------------- | +| false (default) | none | human pause | human pause | +| false | quantitative | auto-eval | human pause | +| true | (implied quantitative) | auto-eval | **agent-judgment auto-eval** | + +Default keeps humans in the loop; only an explicit `full_auto: true` removes the +qualitative pause. + +--- + +## Cross-references + +- `/aep-reflect` Step 1 (Gather Feedback) and Step 2.75 (Evaluate Outcome Contracts) +- `/aep-watch` (reuses the normalized observation record for its ingest step) +- `aep-autopilot` `references/tick-protocol.md` — Step ⑥ Layer Completion (what the + auto-eval lets advance without a pause) diff --git a/skills/product-context/_shared/templates/product-context-schema.yaml b/skills/product-context/_shared/templates/product-context-schema.yaml index b431c2f..684e7f5 100644 --- a/skills/product-context/_shared/templates/product-context-schema.yaml +++ b/skills/product-context/_shared/templates/product-context-schema.yaml @@ -410,7 +410,7 @@ topology: post_merge_guard: # G4a — watch merged stories' deploy health (see autopilot/references/post-merge-guard.md) window_min: 15 auto_revert: false # conservative default: warn + escalate only; true = auto `gh pr revert` on confirmed regression - health_signals: [] # e.g. ["ci", "error_rate", "health_endpoint"] + health_signals: [] # e.g. ["ci_status", "error_rate", "health_endpoint"] telemetry_sources: [] # G5 — read-only error-log/analytics/monitoring sources (reference env/secret store; never embed secrets) watch: # G6 /aep-watch self-feeding discovery sources: [] diff --git a/skills/product-context/dispatch/references/telemetry-ingestion.md b/skills/product-context/dispatch/references/telemetry-ingestion.md new file mode 100644 index 0000000..0f91bc1 --- /dev/null +++ b/skills/product-context/dispatch/references/telemetry-ingestion.md @@ -0,0 +1,94 @@ +# Telemetry Ingestion & Outcome Auto-Evaluation + +How `/aep-reflect` (and `/aep-watch`) pull real-world signals automatically, and +how a layer's **quantitative** outcome contract is evaluated without a human. +This augments the interactive reflect flow — it never replaces human review by +default. (Gap G5.) + +> **Authoring note:** this file is canonical in +> `skills/product-context/_shared/references/`; `scripts/build-skills.sh` +> materializes it into each consuming skill's `references/`. + +--- + +## 1. Automated source ingestion + +Pull from read-only sources with `bash`/`curl`/`jq` and reduce each to the +**normalized observation record** the reflect Step 2 classifier consumes: + +```json +{ + "source": "error_stream | analytics | monitoring | bug_tracker", + "signal": "one-line description of what was observed", + "evidence": "url | query | sample (no secrets)", + "story_ref": "", + "suggested_class": "bug | refinement | discovery | opportunity_shift | process | null" +} +``` + +`suggested_class` is a hint only — the reflect Step 2 classifier (and the human, +unless `full_auto`) makes the final call. Ingested records are **merged** with +interactive input before classification; automation augments, never replaces. + +### Source config + +Endpoints live under `topology.routing.telemetry_sources` (a list). Each entry: + +```yaml +telemetry_sources: + - kind: error_stream # error_stream | analytics | monitoring | bug_tracker + endpoint: "https://…/api/…?since={since}" # {since} = last-ingest high-water mark + token_env: SENTRY_TOKEN # NAME of an env var / secret — never the secret itself + metric_map: # for analytics/monitoring: outcome-metric name → query + activation_rate: "SELECT … " +``` + +**Safety:** access is **read-only**; reference credentials by env-var / secret-store +name only — **never embed secrets in the repo or in `product-context.yaml`**. + +--- + +## 2. Outcome-contract auto-evaluation + +A layer's `outcome_contract` carries a `success_metric` (`type` + `target`) and a +`decision_rule` (`keep_if` / `otherwise`). Evaluate per +`topology.routing.auto_outcome_eval`: + +| Metric `type` | `auto_outcome_eval: quantitative` | default (`none`) | +| ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | +| **quantitative** (numeric, measurable from a source) | fetch actual value via the matching `telemetry_sources` query, apply `keep_if`/`otherwise` mechanically, record result — **no pause** | human pause (current behavior) | +| **qualitative** | human pause — **unless** `full_auto: true` (then agent-judgment auto-eval) | human pause | + +On a **fetch failure or ambiguity**, fall back to the human pause (fail safe, not +fail open). Record every auto-evaluation in the `changelog`: + +```yaml +- date: YYYY-MM-DD + type: outcome_evaluation + summary: "Layer N: = vs target → passed|failed (auto)" +``` + +--- + +## 3. `full_auto` interaction (A1) + +`topology.routing.full_auto` (default **false**) is the master switch. It only +changes the **qualitative** path: + +| `full_auto` | `auto_outcome_eval` | quantitative outcome | qualitative outcome | +| --------------- | ---------------------- | -------------------- | ---------------------------- | +| false (default) | none | human pause | human pause | +| false | quantitative | auto-eval | human pause | +| true | (implied quantitative) | auto-eval | **agent-judgment auto-eval** | + +Default keeps humans in the loop; only an explicit `full_auto: true` removes the +qualitative pause. + +--- + +## Cross-references + +- `/aep-reflect` Step 1 (Gather Feedback) and Step 2.75 (Evaluate Outcome Contracts) +- `/aep-watch` (reuses the normalized observation record for its ingest step) +- `aep-autopilot` `references/tick-protocol.md` — Step ⑥ Layer Completion (what the + auto-eval lets advance without a pause) diff --git a/skills/product-context/envision/references/telemetry-ingestion.md b/skills/product-context/envision/references/telemetry-ingestion.md new file mode 100644 index 0000000..0f91bc1 --- /dev/null +++ b/skills/product-context/envision/references/telemetry-ingestion.md @@ -0,0 +1,94 @@ +# Telemetry Ingestion & Outcome Auto-Evaluation + +How `/aep-reflect` (and `/aep-watch`) pull real-world signals automatically, and +how a layer's **quantitative** outcome contract is evaluated without a human. +This augments the interactive reflect flow — it never replaces human review by +default. (Gap G5.) + +> **Authoring note:** this file is canonical in +> `skills/product-context/_shared/references/`; `scripts/build-skills.sh` +> materializes it into each consuming skill's `references/`. + +--- + +## 1. Automated source ingestion + +Pull from read-only sources with `bash`/`curl`/`jq` and reduce each to the +**normalized observation record** the reflect Step 2 classifier consumes: + +```json +{ + "source": "error_stream | analytics | monitoring | bug_tracker", + "signal": "one-line description of what was observed", + "evidence": "url | query | sample (no secrets)", + "story_ref": "", + "suggested_class": "bug | refinement | discovery | opportunity_shift | process | null" +} +``` + +`suggested_class` is a hint only — the reflect Step 2 classifier (and the human, +unless `full_auto`) makes the final call. Ingested records are **merged** with +interactive input before classification; automation augments, never replaces. + +### Source config + +Endpoints live under `topology.routing.telemetry_sources` (a list). Each entry: + +```yaml +telemetry_sources: + - kind: error_stream # error_stream | analytics | monitoring | bug_tracker + endpoint: "https://…/api/…?since={since}" # {since} = last-ingest high-water mark + token_env: SENTRY_TOKEN # NAME of an env var / secret — never the secret itself + metric_map: # for analytics/monitoring: outcome-metric name → query + activation_rate: "SELECT … " +``` + +**Safety:** access is **read-only**; reference credentials by env-var / secret-store +name only — **never embed secrets in the repo or in `product-context.yaml`**. + +--- + +## 2. Outcome-contract auto-evaluation + +A layer's `outcome_contract` carries a `success_metric` (`type` + `target`) and a +`decision_rule` (`keep_if` / `otherwise`). Evaluate per +`topology.routing.auto_outcome_eval`: + +| Metric `type` | `auto_outcome_eval: quantitative` | default (`none`) | +| ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | +| **quantitative** (numeric, measurable from a source) | fetch actual value via the matching `telemetry_sources` query, apply `keep_if`/`otherwise` mechanically, record result — **no pause** | human pause (current behavior) | +| **qualitative** | human pause — **unless** `full_auto: true` (then agent-judgment auto-eval) | human pause | + +On a **fetch failure or ambiguity**, fall back to the human pause (fail safe, not +fail open). Record every auto-evaluation in the `changelog`: + +```yaml +- date: YYYY-MM-DD + type: outcome_evaluation + summary: "Layer N: = vs target → passed|failed (auto)" +``` + +--- + +## 3. `full_auto` interaction (A1) + +`topology.routing.full_auto` (default **false**) is the master switch. It only +changes the **qualitative** path: + +| `full_auto` | `auto_outcome_eval` | quantitative outcome | qualitative outcome | +| --------------- | ---------------------- | -------------------- | ---------------------------- | +| false (default) | none | human pause | human pause | +| false | quantitative | auto-eval | human pause | +| true | (implied quantitative) | auto-eval | **agent-judgment auto-eval** | + +Default keeps humans in the loop; only an explicit `full_auto: true` removes the +qualitative pause. + +--- + +## Cross-references + +- `/aep-reflect` Step 1 (Gather Feedback) and Step 2.75 (Evaluate Outcome Contracts) +- `/aep-watch` (reuses the normalized observation record for its ingest step) +- `aep-autopilot` `references/tick-protocol.md` — Step ⑥ Layer Completion (what the + auto-eval lets advance without a pause) diff --git a/skills/product-context/envision/templates/product-context-schema.yaml b/skills/product-context/envision/templates/product-context-schema.yaml index b431c2f..684e7f5 100644 --- a/skills/product-context/envision/templates/product-context-schema.yaml +++ b/skills/product-context/envision/templates/product-context-schema.yaml @@ -410,7 +410,7 @@ topology: post_merge_guard: # G4a — watch merged stories' deploy health (see autopilot/references/post-merge-guard.md) window_min: 15 auto_revert: false # conservative default: warn + escalate only; true = auto `gh pr revert` on confirmed regression - health_signals: [] # e.g. ["ci", "error_rate", "health_endpoint"] + health_signals: [] # e.g. ["ci_status", "error_rate", "health_endpoint"] telemetry_sources: [] # G5 — read-only error-log/analytics/monitoring sources (reference env/secret store; never embed secrets) watch: # G6 /aep-watch self-feeding discovery sources: [] diff --git a/skills/product-context/map/references/telemetry-ingestion.md b/skills/product-context/map/references/telemetry-ingestion.md new file mode 100644 index 0000000..0f91bc1 --- /dev/null +++ b/skills/product-context/map/references/telemetry-ingestion.md @@ -0,0 +1,94 @@ +# Telemetry Ingestion & Outcome Auto-Evaluation + +How `/aep-reflect` (and `/aep-watch`) pull real-world signals automatically, and +how a layer's **quantitative** outcome contract is evaluated without a human. +This augments the interactive reflect flow — it never replaces human review by +default. (Gap G5.) + +> **Authoring note:** this file is canonical in +> `skills/product-context/_shared/references/`; `scripts/build-skills.sh` +> materializes it into each consuming skill's `references/`. + +--- + +## 1. Automated source ingestion + +Pull from read-only sources with `bash`/`curl`/`jq` and reduce each to the +**normalized observation record** the reflect Step 2 classifier consumes: + +```json +{ + "source": "error_stream | analytics | monitoring | bug_tracker", + "signal": "one-line description of what was observed", + "evidence": "url | query | sample (no secrets)", + "story_ref": "", + "suggested_class": "bug | refinement | discovery | opportunity_shift | process | null" +} +``` + +`suggested_class` is a hint only — the reflect Step 2 classifier (and the human, +unless `full_auto`) makes the final call. Ingested records are **merged** with +interactive input before classification; automation augments, never replaces. + +### Source config + +Endpoints live under `topology.routing.telemetry_sources` (a list). Each entry: + +```yaml +telemetry_sources: + - kind: error_stream # error_stream | analytics | monitoring | bug_tracker + endpoint: "https://…/api/…?since={since}" # {since} = last-ingest high-water mark + token_env: SENTRY_TOKEN # NAME of an env var / secret — never the secret itself + metric_map: # for analytics/monitoring: outcome-metric name → query + activation_rate: "SELECT … " +``` + +**Safety:** access is **read-only**; reference credentials by env-var / secret-store +name only — **never embed secrets in the repo or in `product-context.yaml`**. + +--- + +## 2. Outcome-contract auto-evaluation + +A layer's `outcome_contract` carries a `success_metric` (`type` + `target`) and a +`decision_rule` (`keep_if` / `otherwise`). Evaluate per +`topology.routing.auto_outcome_eval`: + +| Metric `type` | `auto_outcome_eval: quantitative` | default (`none`) | +| ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | +| **quantitative** (numeric, measurable from a source) | fetch actual value via the matching `telemetry_sources` query, apply `keep_if`/`otherwise` mechanically, record result — **no pause** | human pause (current behavior) | +| **qualitative** | human pause — **unless** `full_auto: true` (then agent-judgment auto-eval) | human pause | + +On a **fetch failure or ambiguity**, fall back to the human pause (fail safe, not +fail open). Record every auto-evaluation in the `changelog`: + +```yaml +- date: YYYY-MM-DD + type: outcome_evaluation + summary: "Layer N: = vs target → passed|failed (auto)" +``` + +--- + +## 3. `full_auto` interaction (A1) + +`topology.routing.full_auto` (default **false**) is the master switch. It only +changes the **qualitative** path: + +| `full_auto` | `auto_outcome_eval` | quantitative outcome | qualitative outcome | +| --------------- | ---------------------- | -------------------- | ---------------------------- | +| false (default) | none | human pause | human pause | +| false | quantitative | auto-eval | human pause | +| true | (implied quantitative) | auto-eval | **agent-judgment auto-eval** | + +Default keeps humans in the loop; only an explicit `full_auto: true` removes the +qualitative pause. + +--- + +## Cross-references + +- `/aep-reflect` Step 1 (Gather Feedback) and Step 2.75 (Evaluate Outcome Contracts) +- `/aep-watch` (reuses the normalized observation record for its ingest step) +- `aep-autopilot` `references/tick-protocol.md` — Step ⑥ Layer Completion (what the + auto-eval lets advance without a pause) diff --git a/skills/product-context/map/templates/product-context-schema.yaml b/skills/product-context/map/templates/product-context-schema.yaml index b431c2f..684e7f5 100644 --- a/skills/product-context/map/templates/product-context-schema.yaml +++ b/skills/product-context/map/templates/product-context-schema.yaml @@ -410,7 +410,7 @@ topology: post_merge_guard: # G4a — watch merged stories' deploy health (see autopilot/references/post-merge-guard.md) window_min: 15 auto_revert: false # conservative default: warn + escalate only; true = auto `gh pr revert` on confirmed regression - health_signals: [] # e.g. ["ci", "error_rate", "health_endpoint"] + health_signals: [] # e.g. ["ci_status", "error_rate", "health_endpoint"] telemetry_sources: [] # G5 — read-only error-log/analytics/monitoring sources (reference env/secret store; never embed secrets) watch: # G6 /aep-watch self-feeding discovery sources: [] diff --git a/skills/product-context/reflect/references/telemetry-ingestion.md b/skills/product-context/reflect/references/telemetry-ingestion.md new file mode 100644 index 0000000..0f91bc1 --- /dev/null +++ b/skills/product-context/reflect/references/telemetry-ingestion.md @@ -0,0 +1,94 @@ +# Telemetry Ingestion & Outcome Auto-Evaluation + +How `/aep-reflect` (and `/aep-watch`) pull real-world signals automatically, and +how a layer's **quantitative** outcome contract is evaluated without a human. +This augments the interactive reflect flow — it never replaces human review by +default. (Gap G5.) + +> **Authoring note:** this file is canonical in +> `skills/product-context/_shared/references/`; `scripts/build-skills.sh` +> materializes it into each consuming skill's `references/`. + +--- + +## 1. Automated source ingestion + +Pull from read-only sources with `bash`/`curl`/`jq` and reduce each to the +**normalized observation record** the reflect Step 2 classifier consumes: + +```json +{ + "source": "error_stream | analytics | monitoring | bug_tracker", + "signal": "one-line description of what was observed", + "evidence": "url | query | sample (no secrets)", + "story_ref": "", + "suggested_class": "bug | refinement | discovery | opportunity_shift | process | null" +} +``` + +`suggested_class` is a hint only — the reflect Step 2 classifier (and the human, +unless `full_auto`) makes the final call. Ingested records are **merged** with +interactive input before classification; automation augments, never replaces. + +### Source config + +Endpoints live under `topology.routing.telemetry_sources` (a list). Each entry: + +```yaml +telemetry_sources: + - kind: error_stream # error_stream | analytics | monitoring | bug_tracker + endpoint: "https://…/api/…?since={since}" # {since} = last-ingest high-water mark + token_env: SENTRY_TOKEN # NAME of an env var / secret — never the secret itself + metric_map: # for analytics/monitoring: outcome-metric name → query + activation_rate: "SELECT … " +``` + +**Safety:** access is **read-only**; reference credentials by env-var / secret-store +name only — **never embed secrets in the repo or in `product-context.yaml`**. + +--- + +## 2. Outcome-contract auto-evaluation + +A layer's `outcome_contract` carries a `success_metric` (`type` + `target`) and a +`decision_rule` (`keep_if` / `otherwise`). Evaluate per +`topology.routing.auto_outcome_eval`: + +| Metric `type` | `auto_outcome_eval: quantitative` | default (`none`) | +| ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | +| **quantitative** (numeric, measurable from a source) | fetch actual value via the matching `telemetry_sources` query, apply `keep_if`/`otherwise` mechanically, record result — **no pause** | human pause (current behavior) | +| **qualitative** | human pause — **unless** `full_auto: true` (then agent-judgment auto-eval) | human pause | + +On a **fetch failure or ambiguity**, fall back to the human pause (fail safe, not +fail open). Record every auto-evaluation in the `changelog`: + +```yaml +- date: YYYY-MM-DD + type: outcome_evaluation + summary: "Layer N: = vs target → passed|failed (auto)" +``` + +--- + +## 3. `full_auto` interaction (A1) + +`topology.routing.full_auto` (default **false**) is the master switch. It only +changes the **qualitative** path: + +| `full_auto` | `auto_outcome_eval` | quantitative outcome | qualitative outcome | +| --------------- | ---------------------- | -------------------- | ---------------------------- | +| false (default) | none | human pause | human pause | +| false | quantitative | auto-eval | human pause | +| true | (implied quantitative) | auto-eval | **agent-judgment auto-eval** | + +Default keeps humans in the loop; only an explicit `full_auto: true` removes the +qualitative pause. + +--- + +## Cross-references + +- `/aep-reflect` Step 1 (Gather Feedback) and Step 2.75 (Evaluate Outcome Contracts) +- `/aep-watch` (reuses the normalized observation record for its ingest step) +- `aep-autopilot` `references/tick-protocol.md` — Step ⑥ Layer Completion (what the + auto-eval lets advance without a pause) diff --git a/skills/product-context/watch/SKILL.md b/skills/product-context/watch/SKILL.md index 6f82372..3266589 100644 --- a/skills/product-context/watch/SKILL.md +++ b/skills/product-context/watch/SKILL.md @@ -23,7 +23,7 @@ sources → [ /aep-watch: pull → classify → dedupe → write stories ] → p `/aep-reflect` is the **human-in-the-loop** feedback classifier you run after shipping. `/aep-watch` is its **always-on** sibling: same classification logic, -no human prompting each finding — it is what makes the loop *continuous*. +no human prompting each finding — it is what makes the loop _continuous_. **Where this fits:** @@ -75,19 +75,19 @@ Watch is driven entirely by `topology.routing.watch` in `product-context.yaml`: ```yaml topology: routing: - full_auto: false # A1 master switch (see below) + full_auto: false # A1 master switch (see below) watch: - sources: # what to pull from — see references/telemetry-ingestion.md - - type: bug_tracker # e.g. github_issues, linear, jira, sentry, datadog, log_stream + sources: # what to pull from — see references/telemetry-ingestion.md + - type: bug_tracker # e.g. github_issues, linear, jira, sentry, datadog, log_stream query: "is:open label:bug" - type: error_stream dsn: "" - type: telemetry metric: "error_rate" threshold: 0.02 - interval: 30m # poll cadence for the /loop or cron driver - auto_create: false # write stories directly vs. surface proposals - since: null # high-water mark — last ingested timestamp (watch maintains this) + interval: 30m # poll cadence for the /loop or cron driver + auto_create: false # write stories directly vs. surface proposals + since: null # high-water mark — last ingested timestamp (watch maintains this) ``` **Confirmation policy (default conservative):** @@ -129,11 +129,11 @@ Step 1 draws on) — do not invent a new finding shape here. Each finding normal ```yaml - source: "sentry" - external_id: "ISSUE-4821" # stable id used for dedupe + external_id: "ISSUE-4821" # stable id used for dedupe title: "TypeError in checkout flow" - detail: "..." # stack/message/metric summary - signal: error_stream # bug_tracker | error_stream | telemetry - count: 142 # occurrences / affected users (priority input) + detail: "..." # stack/message/metric summary + signal: error_stream # bug_tracker | error_stream | telemetry + count: 142 # occurrences / affected users (priority input) first_seen: "" last_seen: "" ``` @@ -149,13 +149,13 @@ duplicate that logic here**; apply `/aep-reflect`'s "Classify Each Observation" rules (see `../reflect/SKILL.md` → Step 2). Watch only acts autonomously on the two categories it can safely turn into work: -| Classification | Watch action | -| --------------------- | ------------------------------------------------------------------------- | -| **Bug** | Create a bug story (Step 4). | -| **Refinement** | Create a refinement story in the next layer (Step 4). | -| **Discovery** | Do NOT auto-create. Surface for `/aep-reflect` → `/aep-envision`/`/aep-map`. | -| **Opportunity shift** | Do NOT auto-create. Always escalate to a human — this changes the bet. | -| **Process / Calibration** | Do NOT auto-create. Surface for `/aep-reflect`. | +| Classification | Watch action | +| ------------------------- | ---------------------------------------------------------------------------- | +| **Bug** | Create a bug story (Step 4). | +| **Refinement** | Create a refinement story in the next layer (Step 4). | +| **Discovery** | Do NOT auto-create. Surface for `/aep-reflect` → `/aep-envision`/`/aep-map`. | +| **Opportunity shift** | Do NOT auto-create. Always escalate to a human — this changes the bet. | +| **Process / Calibration** | Do NOT auto-create. Surface for `/aep-reflect`. | Discoveries, opportunity shifts, calibrations, and process findings **always** go to a human regardless of `full_auto` — they change product intent or workflow, @@ -183,11 +183,11 @@ For each surviving **bug** / **refinement** finding, build a story: - id: "watch--" title: "" description: " (auto-discovered by /aep-watch from )" - type: bug # or refinement + type: bug # or refinement status: pending - priority: high # bugs: high; tune by count/severity (see below) - layer: # bug → current layer; refinement → next layer - module: # leave unset if the source doesn't localize it + priority: high # bugs: high; tune by count/severity (see below) + layer: # bug → current layer; refinement → next layer + module: # leave unset if the source doesn't localize it watch_origin: source: "" external_id: "" diff --git a/skills/product-context/watch/references/.aep-generated b/skills/product-context/watch/references/.aep-generated new file mode 100644 index 0000000..146dc47 --- /dev/null +++ b/skills/product-context/watch/references/.aep-generated @@ -0,0 +1 @@ +Generated by scripts/build-skills.sh from skills/product-context/_shared/. Do not edit; edit _shared/ and rebuild. diff --git a/skills/product-context/watch/references/orchestration-patterns.md b/skills/product-context/watch/references/orchestration-patterns.md new file mode 100644 index 0000000..b48b249 --- /dev/null +++ b/skills/product-context/watch/references/orchestration-patterns.md @@ -0,0 +1,203 @@ +# Orchestration Patterns + +Detailed patterns for the control plane's orchestrator — state management, context assembly, layer gating, and failure handling. Read this when setting up or debugging the execution pipeline. + +--- + +## Work Graph as State Machine + +The work graph is a live state machine. Every story node holds a status and transitions based on events. + +### State Transitions + +``` +pending → ready (all dependency stories reach 'completed') +ready → in_progress (orchestrator dispatches to agent) +in_progress → in_review (agent submits PR) +in_review → completed (verification passes) +in_review → in_progress (verification fails, retry initiated) +in_progress → failed (retry limit exceeded, escalated) +pending → blocked (a dependency story enters 'failed') +any → deferred (user explicitly postpones) +``` + +### Orchestrator Loop + +The orchestrator is event-driven, not polling-based: + +1. **Event received** (story completed, PR submitted, verification result, failure). +2. **Update state** of the affected story in the work graph. +3. **Cascade check**: Does this transition unlock new stories? (completed → check dependents). Does it block stories? (failed → mark dependents as blocked). +4. **Dispatch**: For each newly `ready` story, run conflict detection, assemble context, dispatch to agent per routing rules. +5. **Layer check**: Are all stories in the current layer `completed`? If yes, trigger Integration Gate. +6. **Alert check**: Any cost anomalies? Any critical path blockages? Notify user if needed. + +### Concurrency Control + +- Maximum parallel agents is configurable. Start with 5–10. +- Two stories with overlapping "Files Likely Affected" must not run in parallel — serialize them. +- If two parallel stories produce merge conflicts, the later PR rebases on the merged one and re-verifies. + +--- + +## Context Assembly + +### The Problem Context Assembly Solves + +An agent's output quality is directly proportional to the relevance and precision of its input context. Too little context → the agent guesses. Too much context → the agent gets confused or hits token limits. Context assembly is the art of giving each agent exactly what it needs and nothing more. + +### Assembly Rules + +For each agent role, the Agent Topology document defines a **context window composition** — the ordered list of what goes in. The orchestrator follows this list mechanically: + +1. **Read the composition spec** for the target agent role. +2. **Prune the Context Document** to the sections listed in the spec. +3. **Extract the relevant System Map slice** — the story's module and its adjacent interfaces only. Do not include unrelated modules. +4. **Collect dependency artifacts** — for each completed dependency, extract the public interface (types, exports, API surface). Do not include internal implementation unless the composition spec explicitly requires it. +5. **Validate the package** — all required fields present, no references to missing artifacts. +6. **Measure the package** — if it exceeds the target token budget for the role, escalate for manual pruning or split the story. + +### Common Assembly Failures + +- **Missing dependency artifact**: A dependency is marked `completed` but its output artifact is not found. This usually means the previous agent's output contract was not enforced. Fix: add post-completion validation in the handoff contract. +- **Stale interface contract**: The System Map was amended but the context package still references the old version. Fix: always read interface contracts from the latest System Map, not from cached copies. +- **Context overflow**: The assembled package exceeds the agent's token budget. Fix: either prune more aggressively (summarize dependency artifacts instead of including full source) or split the story into smaller units. + +--- + +## Layer Gating + +### Gate Design + +Each layer has an Integration Gate — tests that verify stories work together. The gate is NOT the sum of individual story tests. It tests emergent behavior at integration boundaries. + +**Layer 0 gate** is the most important test in the pipeline. It executes the exact user journey from the Context Document's Layer 0 MVP Contract. If the walking skeleton doesn't work end-to-end, something is architecturally wrong. + +**Subsequent layer gates** test: + +1. All previous layer journeys still work (regression). +2. New capabilities added in this layer work end-to-end. +3. Interface contracts honored under realistic conditions (not just mocks). + +### Gate Failure Protocol + +``` +Gate fails + → Identify failure boundary (which module interface) + → Check: implementation vs contract mismatch? + → Implementation wrong: create fix story → Phase 4 + → Contract wrong: trigger Architecture Review → Phase 2 + → Assess impact on completed stories + → May require re-execution of affected stories +``` + +Gate failure on a contract issue is the most expensive failure in the pipeline because it can invalidate already-completed work. This is why Phase 2 (System Map approval) is a human-reviewed gate — catching contract errors early prevents cascading rework. + +--- + +## Failure Handling + +### Why Fresh-Agent Retry Works + +When an agent fails and retries, it carries the full reasoning trajectory from its first attempt. If that trajectory led to a dead end, the retry often follows the same path — the agent is stuck in its own logic. A fresh agent receives only the structured failure log, not the reasoning. It approaches the problem without the stuck trajectory. + +The **failure log's "what was NOT tried" field** is the highest-value signal for the fresh agent. It provides starting points the previous agent considered but did not explore. + +### Failure Log Schema + +``` +{ + story_id: string, + attempt_number: number, + agent_role: string, + + approach_summary: string, // What the agent tried to do + failure_point: string, // Which verification step failed + error_output: string, // Exact error messages or test failures + hypothesis: string, // Agent's best guess about root cause + not_tried: string[], // Alternative approaches considered but not attempted + + context_issues?: string, // Any problems with the context package + time_spent_seconds: number, + tokens_used: number +} +``` + +### Cascade Prevention + +When a story fails: + +1. Mark direct dependents as `blocked`. +2. Continue executing non-blocked stories in the same layer. +3. If the failed story is on the **critical path** → alert user immediately (entire layer is blocked). +4. If NOT on critical path → other work continues. User addresses failure asynchronously. +5. When the failed story is eventually resolved (fixed or deferred), unblock dependents and resume normal dispatch. + +### Escalation Format + +When a story reaches human escalation, present: + +1. The story spec (what was being attempted). +2. All failure logs from all attempts (what happened). +3. The fresh agent's failure log specifically (the most informed analysis). +4. Current impact: which stories are blocked, is this on the critical path? +5. Suggested options: fix the story, simplify the story, defer it, or modify the architecture. + +--- + +## State Persistence + +The orchestrator's state must survive crashes. + +### Storage Options + +- **File-based (JSON in repo)**: Simple, version-controlled. Sufficient for most MVP projects. Limitation: does not support concurrent orchestrators. +- **SQLite**: Supports querying ("show all failed stories") and concurrent access. Better for larger projects. +- **External store (Redis, Postgres)**: For production-grade orchestration with multiple concurrent sessions. + +For MVP-stage projects, start with JSON in the repo. Upgrade when the limitation matters. + +### State Snapshot Schema + +``` +{ + project_id: string, + current_layer: number, + stories: { + [story_id]: { + status: "pending" | "ready" | "in_progress" | "in_review" | "completed" | "failed" | "blocked" | "deferred", + assigned_agent?: string, + attempt_count: number, + last_updated: ISO8601, + failure_logs?: FailureLog[], + pr_url?: string, + completed_at?: ISO8601 + } + }, + layer_gates: { + [layer_number]: { + status: "not_started" | "running" | "passed" | "failed", + test_results?: TestResult[], + completed_at?: ISO8601 + } + }, + cost_summary: { + total_cost_usd: number, + cost_by_layer: { [layer]: number }, + cost_by_role: { [role]: number }, + cost_by_story: { [story_id]: number } + }, + last_updated: ISO8601 +} +``` + +### State Inspection + +The user should be able to query the current state at any time: + +- Progress per layer: completed / in_progress / pending / failed / blocked +- Critical path status: what is the next bottleneck? +- Cost breakdown: where is the money going? +- Blocked stories: what is waiting on what? + +Provide a simple CLI command or dashboard that reads the state file and renders this overview. diff --git a/skills/product-context/watch/references/telemetry-ingestion.md b/skills/product-context/watch/references/telemetry-ingestion.md new file mode 100644 index 0000000..0f91bc1 --- /dev/null +++ b/skills/product-context/watch/references/telemetry-ingestion.md @@ -0,0 +1,94 @@ +# Telemetry Ingestion & Outcome Auto-Evaluation + +How `/aep-reflect` (and `/aep-watch`) pull real-world signals automatically, and +how a layer's **quantitative** outcome contract is evaluated without a human. +This augments the interactive reflect flow — it never replaces human review by +default. (Gap G5.) + +> **Authoring note:** this file is canonical in +> `skills/product-context/_shared/references/`; `scripts/build-skills.sh` +> materializes it into each consuming skill's `references/`. + +--- + +## 1. Automated source ingestion + +Pull from read-only sources with `bash`/`curl`/`jq` and reduce each to the +**normalized observation record** the reflect Step 2 classifier consumes: + +```json +{ + "source": "error_stream | analytics | monitoring | bug_tracker", + "signal": "one-line description of what was observed", + "evidence": "url | query | sample (no secrets)", + "story_ref": "", + "suggested_class": "bug | refinement | discovery | opportunity_shift | process | null" +} +``` + +`suggested_class` is a hint only — the reflect Step 2 classifier (and the human, +unless `full_auto`) makes the final call. Ingested records are **merged** with +interactive input before classification; automation augments, never replaces. + +### Source config + +Endpoints live under `topology.routing.telemetry_sources` (a list). Each entry: + +```yaml +telemetry_sources: + - kind: error_stream # error_stream | analytics | monitoring | bug_tracker + endpoint: "https://…/api/…?since={since}" # {since} = last-ingest high-water mark + token_env: SENTRY_TOKEN # NAME of an env var / secret — never the secret itself + metric_map: # for analytics/monitoring: outcome-metric name → query + activation_rate: "SELECT … " +``` + +**Safety:** access is **read-only**; reference credentials by env-var / secret-store +name only — **never embed secrets in the repo or in `product-context.yaml`**. + +--- + +## 2. Outcome-contract auto-evaluation + +A layer's `outcome_contract` carries a `success_metric` (`type` + `target`) and a +`decision_rule` (`keep_if` / `otherwise`). Evaluate per +`topology.routing.auto_outcome_eval`: + +| Metric `type` | `auto_outcome_eval: quantitative` | default (`none`) | +| ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | +| **quantitative** (numeric, measurable from a source) | fetch actual value via the matching `telemetry_sources` query, apply `keep_if`/`otherwise` mechanically, record result — **no pause** | human pause (current behavior) | +| **qualitative** | human pause — **unless** `full_auto: true` (then agent-judgment auto-eval) | human pause | + +On a **fetch failure or ambiguity**, fall back to the human pause (fail safe, not +fail open). Record every auto-evaluation in the `changelog`: + +```yaml +- date: YYYY-MM-DD + type: outcome_evaluation + summary: "Layer N: = vs target → passed|failed (auto)" +``` + +--- + +## 3. `full_auto` interaction (A1) + +`topology.routing.full_auto` (default **false**) is the master switch. It only +changes the **qualitative** path: + +| `full_auto` | `auto_outcome_eval` | quantitative outcome | qualitative outcome | +| --------------- | ---------------------- | -------------------- | ---------------------------- | +| false (default) | none | human pause | human pause | +| false | quantitative | auto-eval | human pause | +| true | (implied quantitative) | auto-eval | **agent-judgment auto-eval** | + +Default keeps humans in the loop; only an explicit `full_auto: true` removes the +qualitative pause. + +--- + +## Cross-references + +- `/aep-reflect` Step 1 (Gather Feedback) and Step 2.75 (Evaluate Outcome Contracts) +- `/aep-watch` (reuses the normalized observation record for its ingest step) +- `aep-autopilot` `references/tick-protocol.md` — Step ⑥ Layer Completion (what the + auto-eval lets advance without a pause) diff --git a/skills/product-context/watch/references/yaml-guardrails.md b/skills/product-context/watch/references/yaml-guardrails.md new file mode 100644 index 0000000..4e455d7 --- /dev/null +++ b/skills/product-context/watch/references/yaml-guardrails.md @@ -0,0 +1,112 @@ +# YAML Guardrails for product-context.yaml + +Every skill that writes to `product-context.yaml` must validate the file before committing. Invalid YAML silently breaks the dashboard and blocks all downstream consumers. + +## Validation Command + +Run this after every edit to `product-context.yaml`: + +```bash +npx js-yaml product-context.yaml > /dev/null && echo "YAML OK" +``` + +If the project has the `@agentic-engineering-patterns/api` package, use the actual loader for deeper validation (Zod schema + preprocessing): + +```bash +npx tsx -e " + const { loadProductContext } = require('@agentic-engineering-patterns/api/lib/product-context-loader'); + loadProductContext(process.env.PRODUCT_CONTEXT_PATH || './product-context.yaml'); + console.log('YAML + schema OK'); +" +``` + +**If validation fails, fix the YAML before committing.** Do not commit broken YAML under any circumstances. + +## Common YAML Pitfalls in product-context.yaml + +These are the patterns that most frequently break the parser when agents write to the file. + +### 1. List items ending with a colon + +A trailing colon makes YAML interpret the item as a mapping key. If the next lines are indented, YAML expects a value — and fails. + +```yaml +# BROKEN — YAML treats this as a mapping key +acceptance_criteria: + - Generate page redesigned for multi-step video workflow: + - Intent prompt input + - Multi-step progress display + +# FIXED — quote the entire item, flatten sub-items +acceptance_criteria: + - "Generate page redesigned for multi-step video workflow: intent prompt input, multi-step progress display" +``` + +**Rule:** Never end a list item with `:` followed by indented sub-items. Either quote the item or flatten the sub-list. + +### 2. Embedded double quotes inside list items + +YAML interprets `"text"` as a quoted string boundary. Content after the closing quote is invalid. + +```yaml +# BROKEN — YAML sees "Complete Your Profile" as the full string, then chokes on the rest +- "Complete Your Profile" guard includes link to /profile + +# FIXED — wrap in double quotes, use single quotes inside +- "'Complete Your Profile' guard includes link to /profile" + +# ALSO FIXED — escape inner quotes +- "\"Complete Your Profile\" guard includes link to /profile" +``` + +**Rule:** If a list item contains embedded double quotes, wrap the entire value in double quotes and use single quotes (or escaped quotes) inside. + +### 3. Colons in the middle of list items + +A colon followed by a space (`: `) triggers YAML key-value parsing. + +```yaml +# BROKEN — YAML tries to parse "Dashboard" as a key +- Dashboard: creator dashboard showing recent generations + +# WORKS (preprocessor handles this) — but quoting is safer +- "Dashboard: creator dashboard showing recent generations" +``` + +**Rule:** The `preprocessYaml` function in the loader auto-quotes most of these, but when writing new content, prefer explicit quoting for items containing `: `. + +### 4. Special characters: @, {, } + +```yaml +# BROKEN — @ is a YAML tag indicator, { starts a flow mapping +- @mention the user +- Use {variable} interpolation + +# FIXED +- "@mention the user" +- "Use {variable} interpolation" +``` + +**Rule:** Quote list items containing `@`, `{`, or `}`. + +### 5. Nested sub-lists under string items + +YAML list items are scalar values — they cannot have children unless the item is a mapping key. + +```yaml +# BROKEN — a string item cannot have sub-items +- Main feature description + - Sub-feature A + - Sub-feature B + +# FIXED — flatten into one item or use a mapping structure +- "Main feature description: Sub-feature A, Sub-feature B" +``` + +## Pre-commit Checklist + +Before committing any change to `product-context.yaml`: + +1. Run the validation command above +2. If adding `acceptance_criteria`, `description`, or any free-text list: scan for colons, quotes, and special characters +3. If the validation command is not available (e.g., no Node.js), at minimum review list items for the patterns above diff --git a/skills/project-setup/onboard/SKILL.md b/skills/project-setup/onboard/SKILL.md index 3391eba..f9c8e71 100644 --- a/skills/project-setup/onboard/SKILL.md +++ b/skills/project-setup/onboard/SKILL.md @@ -13,7 +13,7 @@ Set up your environment for agentic TypeScript development AND get oriented to h > **Returning user?** If you've run `/aep-onboard` before and you're just re-verifying your environment, skip to Phase 1. -Before installing tools, get the mental model. AEP is not a "command runner" — it's a workflow that separates _thinking_ (what to build) from _doing_ (building it). Installing the tools without understanding this will leave you staring at a blank terminal wondering which of 16 skills to run first. +Before installing tools, get the mental model. AEP is not a "command runner" — it's a workflow that separates _thinking_ (what to build) from _doing_ (building it). Installing the tools without understanding this will leave you staring at a blank terminal wondering which of 17 skills to run first. **The three mental models you need:** @@ -25,7 +25,7 @@ Before installing tools, get the mental model. AEP is not a "command runner" — **v2 split-mode (good to know):** Some projects store product context in two files — `product/index.yaml` (stable intent: opportunity, personas, capabilities, constraints) + `product-context.yaml` (mutable state: architecture, stories, cost, changelog). All skills auto-detect which mode a project uses. If you see only `product-context.yaml`, that's v1 single-file mode and it works exactly the same way. See [docs/aep-v2-improvement-guideline.md](../../../docs/aep-v2-improvement-guideline.md). -**Next step:** for the full 10-minute first-hour guide — including a table of all 16 skills, four concrete paths (new product / existing project / single feature / hands-free), and a glossary shortlist — read **[docs/orientation.md](../../../docs/orientation.md)**. Then come back to Phase 1. +**Next step:** for the full 10-minute first-hour guide — including a table of all 17 skills, four concrete paths (new product / existing project / single feature / hands-free), and a glossary shortlist — read **[docs/orientation.md](../../../docs/orientation.md)**. Then come back to Phase 1. --- @@ -364,7 +364,7 @@ Pointers for going deeper. None of these are required reading — check what's r **Mental models & concepts** -- [docs/orientation.md](../../../docs/orientation.md) — the canonical first-hour guide (mental models + 16 skills + four paths) +- [docs/orientation.md](../../../docs/orientation.md) — the canonical first-hour guide (mental models + 17 skills + four paths) - [README.md "Why This Exists"](../../../README.md#why-this-exists) — the full argument for spec-precision-over-execution-speed - [docs/glossary.md](../../../docs/glossary.md) — precise definitions for every AEP term (ubiquitous language) From c43288dd3aea4d763c3ce794a6e5aa000f585268 Mon Sep 17 00:00:00 2001 From: Memorysaver Date: Tue, 16 Jun 2026 00:22:15 +0800 Subject: [PATCH 7/8] =?UTF-8?q?chore:=20release=20v2.0.0=20(autonomy=20loo?= =?UTF-8?q?p=20=E2=80=94=20G2=E2=80=93G7=20+=20full=5Fauto)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- .claude-plugin/marketplace.json | 2 +- CHANGELOG.md | 45 +++++++++++++++++++++++++++++++++ 2 files changed, 46 insertions(+), 1 deletion(-) diff --git a/.claude-plugin/marketplace.json b/.claude-plugin/marketplace.json index d2ae612..b86d7f0 100644 --- a/.claude-plugin/marketplace.json +++ b/.claude-plugin/marketplace.json @@ -5,7 +5,7 @@ "email": "mho@looplia.run" }, "metadata": { - "version": "1.8.0", + "version": "2.0.0", "description": "Skills for product planning, project scaffolding, and agentic development workflows." }, "plugins": [ diff --git a/CHANGELOG.md b/CHANGELOG.md index d4f7f09..92b5355 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -21,6 +21,51 @@ bug fixes → **patch**; removing or breaking a skill contract → **major**. _Nothing yet._ +## [2.0.0] - 2026-06-16 + +The **autonomy loop** release. Closes the loop-engineering gaps identified in +`docs/research/loop-engineering-autonomy-gap.md` (G2–G7) and adds a `full_auto` +master switch. Every new capability defaults to **human-in-the-loop** — autonomy +is opt-in via `topology.routing` flags. + +### Added + +- **`/aep-watch` skill** (G6) — continuously ingests telemetry / error streams / + bug trackers, classifies findings with the `/aep-reflect` classifier, and + auto-files bug/refinement stories so reflect→dispatch becomes self-feeding. +- **Change-strategy recovery ladder** (G2) — `gen-eval/references/recovery-ladder.md`; + on repeated eval FAIL the build climbs same-fix → re-ground → fresh + `native-bg-subagent` generator → decompose **before** the `eval_not_converging` + human gate. +- **Host-aware post-deploy dogfood** (G4b) — `executor/references/dogfood-validation.md`: + `dogfood_method()` (Claude → agent-browser; Codex → native in-app browser / + computer-use, or Playwright headless) + `target_url()` (config-first, CI fallback). +- **Post-merge guard** (G4a) — `autopilot/references/post-merge-guard.md` + tick + Step ③.5: monitors merged stories' deploy health; dogfood issues → reflect story; + hard regression → conservative `auto_revert` (default off, warn + escalate). +- **Telemetry-driven reflect** (G5) — `reflect/references/telemetry-ingestion.md`: + automated source ingestion + quantitative outcome-contract auto-evaluation. +- **Visual Design evaluator dimension** (G3) — vision-model scoring of screenshots + against the design system, for both Claude and Codex (multimodal). +- **`full_auto` master switch** (A1) — `topology.routing.full_auto` (default false) + gates the strategic human pauses (design escalation, qualitative outcome eval); + implies `auto_design` + `auto_outcome_eval` + `watch.auto_create`. New config keys + added to the product-context schema. + +### Changed + +- `/aep-build` Phase 5 climbs the recovery ladder; Phase 6 dogfood is host-aware + (degrades instead of skipping when agent-browser is absent). +- `/aep-reflect` Step 1 supports automated ingestion; Step 2.75 auto-evaluates + quantitative outcome contracts (qualitative still pauses unless `full_auto`). +- `/aep-autopilot` gains the post-merge guard step and `full_auto`-aware routing; + loop hygiene unified on `--max-turns` (G7). + +### Fixed + +- Carries forward the v1.8.0 executor fix (claude-team removed; `native-bg-subagent` + default + post-spawn liveness probe). Every new spawn path uses it. + ## [1.8.0] - 2026-06-15 ### Changed From 94c76a2687afcc2782c11b96d1f6e9c8fa5e6b00 Mon Sep 17 00:00:00 2001 From: Memorysaver Date: Tue, 16 Jun 2026 01:02:59 +0800 Subject: [PATCH 8/8] feat(aep-v2): telemetry source determination (hybrid metric-driven + coverage guard) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Closes the v2 telemetry gap: consumers shipped without a way to decide/wire sources. - Coverage rule + coverage_check() helper in telemetry-ingestion.md (canonical _shared/references): a source is needed iff a quantitative success_metric or health_signal requires it. - /aep-map gains a Telemetry Binding step (the decision owner): bind each needed signal to a detected/declared source via metric_map; flag the unmeasurable. - /aep-scaffold audit detects the observability stack (Sentry/Datadog/PostHog/ OTel/health endpoint) → candidate telemetry_sources. - /aep-watch (Step 0 precondition), /aep-reflect Step 2.75, and post-merge guard run coverage_check() and BLOCK the auto path when the map binding is incomplete ("run /aep-map observability step") — never silently no-op. - schema documents telemetry_sources[].metric_map + the coverage rule. Folded into the unreleased v2.0.0 (PR #11). oxfmt + build-skills in sync. Co-Authored-By: Claude Opus 4.8 (1M context) --- CHANGELOG.md | 6 +++ .../autopilot/references/post-merge-guard.md | 4 +- .../autopilot/references/tick-protocol.md | 2 +- .../_shared/references/telemetry-ingestion.md | 50 ++++++++++++++++++- .../templates/product-context-schema.yaml | 3 +- .../references/telemetry-ingestion.md | 50 ++++++++++++++++++- .../references/telemetry-ingestion.md | 50 ++++++++++++++++++- .../templates/product-context-schema.yaml | 3 +- skills/product-context/map/SKILL.md | 24 +++++++++ .../map/references/telemetry-ingestion.md | 50 ++++++++++++++++++- .../map/templates/product-context-schema.yaml | 3 +- skills/product-context/reflect/SKILL.md | 2 +- .../reflect/references/telemetry-ingestion.md | 50 ++++++++++++++++++- skills/product-context/watch/SKILL.md | 15 ++++++ .../watch/references/telemetry-ingestion.md | 50 ++++++++++++++++++- skills/project-setup/scaffold/SKILL.md | 20 ++++++++ 16 files changed, 364 insertions(+), 18 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 92b5355..5069874 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -45,6 +45,12 @@ is opt-in via `topology.routing` flags. hard regression → conservative `auto_revert` (default off, warn + escalate). - **Telemetry-driven reflect** (G5) — `reflect/references/telemetry-ingestion.md`: automated source ingestion + quantitative outcome-contract auto-evaluation. +- **Telemetry source determination** — projects decide sources via a hybrid + metric-driven rule: `/aep-scaffold`/`/aep-onboard` detect the observability stack + (candidate sources); `/aep-map` binds each quantitative `success_metric` + + `health_signal` to a source (`metric_map`); a shared `coverage_check()` lets + `/aep-watch`, `/aep-reflect`, and the post-merge guard **block auto when the + binding is incomplete** instead of silently no-op'ing. - **Visual Design evaluator dimension** (G3) — vision-model scoring of screenshots against the design system, for both Claude and Codex (multimodal). - **`full_auto` master switch** (A1) — `topology.routing.full_auto` (default false) diff --git a/skills/patterns/autopilot/references/post-merge-guard.md b/skills/patterns/autopilot/references/post-merge-guard.md index 60795b2..36427e3 100644 --- a/skills/patterns/autopilot/references/post-merge-guard.md +++ b/skills/patterns/autopilot/references/post-merge-guard.md @@ -59,7 +59,9 @@ Within the open window, each tick performs two independent reads: ### (a) Health signals -Read every signal named in `topology.routing.post_merge_guard.health_signals`. These are service-level, signals-only probes — no workspace code: +Read every signal named in `topology.routing.post_merge_guard.health_signals`. These are service-level, signals-only probes — no workspace code. + +> **Coverage precondition.** Run `coverage_check(health_signals)` (`../../../product-context/reflect/references/telemetry-ingestion.md` §1.5) first: a signal like `error_rate` / `latency_p95` that needs a metrics source must be **bound** (the `/aep-map` Telemetry Binding step wired a `telemetry_sources` entry / `health_url`). An **unbound** signal is reported as "telemetry binding incomplete — run /aep-map", **not** treated as green — never infer health from a signal you can't actually read. (`ci_status` / `health_endpoint` / `smoke_check` are self-describing and need no binding.) | Signal kind | How the orchestrator reads it (examples) | | ----------------- | ------------------------------------------------------------------------------------ | diff --git a/skills/patterns/autopilot/references/tick-protocol.md b/skills/patterns/autopilot/references/tick-protocol.md index aded908..9f79712 100644 --- a/skills/patterns/autopilot/references/tick-protocol.md +++ b/skills/patterns/autopilot/references/tick-protocol.md @@ -462,7 +462,7 @@ If all stories in the active layer are completed (after wraps): 1. Suggest running the layer gate integration test 2. If gate passes: update `layer_gates[layer].status: passed` 3. **Outcome contract check:** If `product.layers[active_layer].outcome_contract` exists, decide whether to auto-evaluate or pause: - - **Quantitative auto-eval:** If `topology.routing.auto_outcome_eval: quantitative` **and** the contract's metric is quantitative (a measurable threshold) → auto-evaluate it via `../../../product-context/reflect/references/telemetry-ingestion.md` (ingest the telemetry, compare against the threshold) and **advance without pausing** when it passes. If the metric is qualitative, fall through to the pause rule below. + - **Quantitative auto-eval:** If `topology.routing.auto_outcome_eval: quantitative` **and** the contract's metric is quantitative (a measurable threshold) → first run `coverage_check([metric])` (`../../../product-context/reflect/references/telemetry-ingestion.md` §1.5); if the metric isn't bound to a telemetry source (the `/aep-map` Telemetry Binding step wasn't done) → **pause** and escalate "run /aep-map observability step" (do not claim auto-coverage). If covered → auto-evaluate via the telemetry-ingestion recipe (ingest the telemetry, compare against the threshold) and **advance without pausing** when it passes. If the metric is qualitative, fall through to the pause rule below. - **Qualitative / default pause:** Otherwise (no `auto_outcome_eval`, a qualitative metric, etc.) → **pause** and add an escalation requesting the user to evaluate the outcome contract before advancing — **UNLESS** `topology.routing.full_auto: true`, in which case auto-evaluate via the telemetry-ingestion recipe and advance without pause. Outcome evaluation otherwise requires human judgment (user testing, analytics, qualitative assessment). The user runs `/aep-reflect` which evaluates outcome contracts in Step 2.75. After `/aep-reflect` completes, resume autopilot. - Default (no `auto_outcome_eval` / `full_auto` false) preserves the current human pause. 4. If no outcome contract or outcome evaluation passes: advance to next layer diff --git a/skills/product-context/_shared/references/telemetry-ingestion.md b/skills/product-context/_shared/references/telemetry-ingestion.md index 0f91bc1..3dca538 100644 --- a/skills/product-context/_shared/references/telemetry-ingestion.md +++ b/skills/product-context/_shared/references/telemetry-ingestion.md @@ -48,11 +48,57 @@ name only — **never embed secrets in the repo or in `product-context.yaml`**. --- +## 1.5 Deciding which sources to wire (the coverage rule) + +You don't list telemetry for its own sake — **a source is needed _iff_ some +declared signal requires it.** The decision is **hybrid**: + +1. **Metric-driven (what signals do we need?)** — enumerate every **quantitative** + `success_metric` across `product.layers[].outcome_contract` plus every + `topology.routing.post_merge_guard.health_signals` entry. That set _is_ the + demand for telemetry. +2. **Inventory (which tool provides each?)** — `/aep-scaffold`'s audit detects the + project's observability stack (Sentry, Datadog, PostHog, OpenTelemetry, log + drains, `/healthz`-style endpoints) and records **candidate** `telemetry_sources` + (kind + endpoint + `token_env`, no `metric_map` yet); you can also add candidates + by hand. +3. **Bind (`/aep-map`)** — for each needed signal, attach it to a candidate source + by adding a `metric_map: { : "" }` entry. A needed + signal with no measurable source is **flagged**, not ignored: make the metric + qualitative, or record it `unmeasured` — never leave a quantitative metric + silently un-sourced. + +### `coverage_check(needed)` — the guard helper + +Consumers that rely on telemetry (`/aep-watch`, `/aep-reflect` Step 2.75, +`/aep-autopilot`) call this **before** trusting auto behavior. It is pure +config inspection — no network: + +``` +coverage_check(needed_signals): + missing = [] + for sig in needed_signals: # quantitative success_metric names + health_signals + if no telemetry_sources[*].metric_map has key == sig + (and, for a health_signal, no source/endpoint provides it): + missing.append(sig) + return { covered: missing == [], missing } +``` + +**On `covered == false`:** surface +`"telemetry binding incomplete for — run /aep-map (observability step)"` +and **block the auto path** (watch refuses to claim auto-coverage; reflect falls +back to the human pause; autopilot pauses). Missing wiring must **block auto, +never silently no-op** — that's the v2 human-in-the-loop default. + +--- + ## 2. Outcome-contract auto-evaluation A layer's `outcome_contract` carries a `success_metric` (`type` + `target`) and a -`decision_rule` (`keep_if` / `otherwise`). Evaluate per -`topology.routing.auto_outcome_eval`: +`decision_rule` (`keep_if` / `otherwise`). **Precondition:** run +`coverage_check([success_metric])` (§1.5) first — if the metric isn't bound to a +source, take the human-pause path (the binding is incomplete; do not auto-eval). +When covered, evaluate per `topology.routing.auto_outcome_eval`: | Metric `type` | `auto_outcome_eval: quantitative` | default (`none`) | | ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | diff --git a/skills/product-context/_shared/templates/product-context-schema.yaml b/skills/product-context/_shared/templates/product-context-schema.yaml index 684e7f5..12d6891 100644 --- a/skills/product-context/_shared/templates/product-context-schema.yaml +++ b/skills/product-context/_shared/templates/product-context-schema.yaml @@ -411,7 +411,8 @@ topology: window_min: 15 auto_revert: false # conservative default: warn + escalate only; true = auto `gh pr revert` on confirmed regression health_signals: [] # e.g. ["ci_status", "error_rate", "health_endpoint"] - telemetry_sources: [] # G5 — read-only error-log/analytics/monitoring sources (reference env/secret store; never embed secrets) + telemetry_sources: [] # G5 — read-only signal sources. Detected by /aep-scaffold audit (or set by hand); /aep-map binds each needed quantitative success_metric + health_signal via metric_map (coverage rule: reflect/references/telemetry-ingestion.md §1.5). token_env only — never embed secrets. + # - { kind: error_stream, endpoint: "https://…?since={since}", token_env: SENTRY_TOKEN, metric_map: { error_rate: "" } } watch: # G6 /aep-watch self-feeding discovery sources: [] interval: 30m diff --git a/skills/product-context/dispatch/references/telemetry-ingestion.md b/skills/product-context/dispatch/references/telemetry-ingestion.md index 0f91bc1..3dca538 100644 --- a/skills/product-context/dispatch/references/telemetry-ingestion.md +++ b/skills/product-context/dispatch/references/telemetry-ingestion.md @@ -48,11 +48,57 @@ name only — **never embed secrets in the repo or in `product-context.yaml`**. --- +## 1.5 Deciding which sources to wire (the coverage rule) + +You don't list telemetry for its own sake — **a source is needed _iff_ some +declared signal requires it.** The decision is **hybrid**: + +1. **Metric-driven (what signals do we need?)** — enumerate every **quantitative** + `success_metric` across `product.layers[].outcome_contract` plus every + `topology.routing.post_merge_guard.health_signals` entry. That set _is_ the + demand for telemetry. +2. **Inventory (which tool provides each?)** — `/aep-scaffold`'s audit detects the + project's observability stack (Sentry, Datadog, PostHog, OpenTelemetry, log + drains, `/healthz`-style endpoints) and records **candidate** `telemetry_sources` + (kind + endpoint + `token_env`, no `metric_map` yet); you can also add candidates + by hand. +3. **Bind (`/aep-map`)** — for each needed signal, attach it to a candidate source + by adding a `metric_map: { : "" }` entry. A needed + signal with no measurable source is **flagged**, not ignored: make the metric + qualitative, or record it `unmeasured` — never leave a quantitative metric + silently un-sourced. + +### `coverage_check(needed)` — the guard helper + +Consumers that rely on telemetry (`/aep-watch`, `/aep-reflect` Step 2.75, +`/aep-autopilot`) call this **before** trusting auto behavior. It is pure +config inspection — no network: + +``` +coverage_check(needed_signals): + missing = [] + for sig in needed_signals: # quantitative success_metric names + health_signals + if no telemetry_sources[*].metric_map has key == sig + (and, for a health_signal, no source/endpoint provides it): + missing.append(sig) + return { covered: missing == [], missing } +``` + +**On `covered == false`:** surface +`"telemetry binding incomplete for — run /aep-map (observability step)"` +and **block the auto path** (watch refuses to claim auto-coverage; reflect falls +back to the human pause; autopilot pauses). Missing wiring must **block auto, +never silently no-op** — that's the v2 human-in-the-loop default. + +--- + ## 2. Outcome-contract auto-evaluation A layer's `outcome_contract` carries a `success_metric` (`type` + `target`) and a -`decision_rule` (`keep_if` / `otherwise`). Evaluate per -`topology.routing.auto_outcome_eval`: +`decision_rule` (`keep_if` / `otherwise`). **Precondition:** run +`coverage_check([success_metric])` (§1.5) first — if the metric isn't bound to a +source, take the human-pause path (the binding is incomplete; do not auto-eval). +When covered, evaluate per `topology.routing.auto_outcome_eval`: | Metric `type` | `auto_outcome_eval: quantitative` | default (`none`) | | ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | diff --git a/skills/product-context/envision/references/telemetry-ingestion.md b/skills/product-context/envision/references/telemetry-ingestion.md index 0f91bc1..3dca538 100644 --- a/skills/product-context/envision/references/telemetry-ingestion.md +++ b/skills/product-context/envision/references/telemetry-ingestion.md @@ -48,11 +48,57 @@ name only — **never embed secrets in the repo or in `product-context.yaml`**. --- +## 1.5 Deciding which sources to wire (the coverage rule) + +You don't list telemetry for its own sake — **a source is needed _iff_ some +declared signal requires it.** The decision is **hybrid**: + +1. **Metric-driven (what signals do we need?)** — enumerate every **quantitative** + `success_metric` across `product.layers[].outcome_contract` plus every + `topology.routing.post_merge_guard.health_signals` entry. That set _is_ the + demand for telemetry. +2. **Inventory (which tool provides each?)** — `/aep-scaffold`'s audit detects the + project's observability stack (Sentry, Datadog, PostHog, OpenTelemetry, log + drains, `/healthz`-style endpoints) and records **candidate** `telemetry_sources` + (kind + endpoint + `token_env`, no `metric_map` yet); you can also add candidates + by hand. +3. **Bind (`/aep-map`)** — for each needed signal, attach it to a candidate source + by adding a `metric_map: { : "" }` entry. A needed + signal with no measurable source is **flagged**, not ignored: make the metric + qualitative, or record it `unmeasured` — never leave a quantitative metric + silently un-sourced. + +### `coverage_check(needed)` — the guard helper + +Consumers that rely on telemetry (`/aep-watch`, `/aep-reflect` Step 2.75, +`/aep-autopilot`) call this **before** trusting auto behavior. It is pure +config inspection — no network: + +``` +coverage_check(needed_signals): + missing = [] + for sig in needed_signals: # quantitative success_metric names + health_signals + if no telemetry_sources[*].metric_map has key == sig + (and, for a health_signal, no source/endpoint provides it): + missing.append(sig) + return { covered: missing == [], missing } +``` + +**On `covered == false`:** surface +`"telemetry binding incomplete for — run /aep-map (observability step)"` +and **block the auto path** (watch refuses to claim auto-coverage; reflect falls +back to the human pause; autopilot pauses). Missing wiring must **block auto, +never silently no-op** — that's the v2 human-in-the-loop default. + +--- + ## 2. Outcome-contract auto-evaluation A layer's `outcome_contract` carries a `success_metric` (`type` + `target`) and a -`decision_rule` (`keep_if` / `otherwise`). Evaluate per -`topology.routing.auto_outcome_eval`: +`decision_rule` (`keep_if` / `otherwise`). **Precondition:** run +`coverage_check([success_metric])` (§1.5) first — if the metric isn't bound to a +source, take the human-pause path (the binding is incomplete; do not auto-eval). +When covered, evaluate per `topology.routing.auto_outcome_eval`: | Metric `type` | `auto_outcome_eval: quantitative` | default (`none`) | | ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | diff --git a/skills/product-context/envision/templates/product-context-schema.yaml b/skills/product-context/envision/templates/product-context-schema.yaml index 684e7f5..12d6891 100644 --- a/skills/product-context/envision/templates/product-context-schema.yaml +++ b/skills/product-context/envision/templates/product-context-schema.yaml @@ -411,7 +411,8 @@ topology: window_min: 15 auto_revert: false # conservative default: warn + escalate only; true = auto `gh pr revert` on confirmed regression health_signals: [] # e.g. ["ci_status", "error_rate", "health_endpoint"] - telemetry_sources: [] # G5 — read-only error-log/analytics/monitoring sources (reference env/secret store; never embed secrets) + telemetry_sources: [] # G5 — read-only signal sources. Detected by /aep-scaffold audit (or set by hand); /aep-map binds each needed quantitative success_metric + health_signal via metric_map (coverage rule: reflect/references/telemetry-ingestion.md §1.5). token_env only — never embed secrets. + # - { kind: error_stream, endpoint: "https://…?since={since}", token_env: SENTRY_TOKEN, metric_map: { error_rate: "" } } watch: # G6 /aep-watch self-feeding discovery sources: [] interval: 30m diff --git a/skills/product-context/map/SKILL.md b/skills/product-context/map/SKILL.md index 6dbacbb..e8a7441 100644 --- a/skills/product-context/map/SKILL.md +++ b/skills/product-context/map/SKILL.md @@ -133,6 +133,30 @@ For each layer that has an `outcome_contract` defined (see `product.layers[].out The outcome contract is evaluated by `/aep-reflect` after layer completion. It answers: "did this layer achieve what we hypothesized?" +### Telemetry Binding (observability) + +This is where the project **decides its telemetry sources** — metric-driven, then +inventory (see the coverage rule in `references/telemetry-ingestion.md` §1.5). + +1. **Collect the needed signals:** every **quantitative** `success_metric` + (`type` ∈ `task_completion_rate | time_to_complete | error_rate | satisfaction_score`) + across the layers, plus any `topology.routing.post_merge_guard.health_signals` + you intend to monitor. That set is the demand for telemetry. +2. **Bind each to a source:** start from the **candidate `telemetry_sources`** + detected by `/aep-scaffold`'s audit (or ask the user which tool provides + each — Sentry / Datadog / PostHog / analytics / health endpoint). For each needed + signal add a `metric_map: { : "" }` entry on the + matching source, and fill its `endpoint` + `token_env` (name only — never the + secret). +3. **Flag the unmeasurable:** a quantitative `success_metric` with no source either + becomes **qualitative** (it will pause for human judgment in `/aep-reflect`) or is + recorded `unmeasured` — never leave a quantitative metric silently un-sourced. + +Write the result to `topology.routing.telemetry_sources` (+ `health_signals`). +`/aep-reflect`, `/aep-watch`, and `/aep-autopilot` run `coverage_check()` against +this before trusting any auto path; an incomplete binding **blocks auto**, it does +not silently no-op. + ### Capability Maps (multi-journey products) If `product/index.yaml` exists (created by `/aep-envision` for multi-journey products), also write per-capability `map.yaml` files: diff --git a/skills/product-context/map/references/telemetry-ingestion.md b/skills/product-context/map/references/telemetry-ingestion.md index 0f91bc1..3dca538 100644 --- a/skills/product-context/map/references/telemetry-ingestion.md +++ b/skills/product-context/map/references/telemetry-ingestion.md @@ -48,11 +48,57 @@ name only — **never embed secrets in the repo or in `product-context.yaml`**. --- +## 1.5 Deciding which sources to wire (the coverage rule) + +You don't list telemetry for its own sake — **a source is needed _iff_ some +declared signal requires it.** The decision is **hybrid**: + +1. **Metric-driven (what signals do we need?)** — enumerate every **quantitative** + `success_metric` across `product.layers[].outcome_contract` plus every + `topology.routing.post_merge_guard.health_signals` entry. That set _is_ the + demand for telemetry. +2. **Inventory (which tool provides each?)** — `/aep-scaffold`'s audit detects the + project's observability stack (Sentry, Datadog, PostHog, OpenTelemetry, log + drains, `/healthz`-style endpoints) and records **candidate** `telemetry_sources` + (kind + endpoint + `token_env`, no `metric_map` yet); you can also add candidates + by hand. +3. **Bind (`/aep-map`)** — for each needed signal, attach it to a candidate source + by adding a `metric_map: { : "" }` entry. A needed + signal with no measurable source is **flagged**, not ignored: make the metric + qualitative, or record it `unmeasured` — never leave a quantitative metric + silently un-sourced. + +### `coverage_check(needed)` — the guard helper + +Consumers that rely on telemetry (`/aep-watch`, `/aep-reflect` Step 2.75, +`/aep-autopilot`) call this **before** trusting auto behavior. It is pure +config inspection — no network: + +``` +coverage_check(needed_signals): + missing = [] + for sig in needed_signals: # quantitative success_metric names + health_signals + if no telemetry_sources[*].metric_map has key == sig + (and, for a health_signal, no source/endpoint provides it): + missing.append(sig) + return { covered: missing == [], missing } +``` + +**On `covered == false`:** surface +`"telemetry binding incomplete for — run /aep-map (observability step)"` +and **block the auto path** (watch refuses to claim auto-coverage; reflect falls +back to the human pause; autopilot pauses). Missing wiring must **block auto, +never silently no-op** — that's the v2 human-in-the-loop default. + +--- + ## 2. Outcome-contract auto-evaluation A layer's `outcome_contract` carries a `success_metric` (`type` + `target`) and a -`decision_rule` (`keep_if` / `otherwise`). Evaluate per -`topology.routing.auto_outcome_eval`: +`decision_rule` (`keep_if` / `otherwise`). **Precondition:** run +`coverage_check([success_metric])` (§1.5) first — if the metric isn't bound to a +source, take the human-pause path (the binding is incomplete; do not auto-eval). +When covered, evaluate per `topology.routing.auto_outcome_eval`: | Metric `type` | `auto_outcome_eval: quantitative` | default (`none`) | | ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | diff --git a/skills/product-context/map/templates/product-context-schema.yaml b/skills/product-context/map/templates/product-context-schema.yaml index 684e7f5..12d6891 100644 --- a/skills/product-context/map/templates/product-context-schema.yaml +++ b/skills/product-context/map/templates/product-context-schema.yaml @@ -411,7 +411,8 @@ topology: window_min: 15 auto_revert: false # conservative default: warn + escalate only; true = auto `gh pr revert` on confirmed regression health_signals: [] # e.g. ["ci_status", "error_rate", "health_endpoint"] - telemetry_sources: [] # G5 — read-only error-log/analytics/monitoring sources (reference env/secret store; never embed secrets) + telemetry_sources: [] # G5 — read-only signal sources. Detected by /aep-scaffold audit (or set by hand); /aep-map binds each needed quantitative success_metric + health_signal via metric_map (coverage rule: reflect/references/telemetry-ingestion.md §1.5). token_env only — never embed secrets. + # - { kind: error_stream, endpoint: "https://…?since={since}", token_env: SENTRY_TOKEN, metric_map: { error_rate: "" } } watch: # G6 /aep-watch self-feeding discovery sources: [] interval: 30m diff --git a/skills/product-context/reflect/SKILL.md b/skills/product-context/reflect/SKILL.md index f9cf201..3df9259 100644 --- a/skills/product-context/reflect/SKILL.md +++ b/skills/product-context/reflect/SKILL.md @@ -144,7 +144,7 @@ If the completed layer has an `outcome_contract` defined in `product.layers[]`: **Auto-evaluation (optional, opt-in):** The pause above can be skipped per `references/telemetry-ingestion.md`: -- If `topology.routing.auto_outcome_eval: quantitative` **and** the success metric is quantitative (a numeric target measurable from analytics/monitoring) → fetch the actual value per `references/telemetry-ingestion.md`, apply `keep_if`/`otherwise` mechanically, and record the result in the changelog — no pause. If the metric can't be fetched, fall back to the human pause. +- If `topology.routing.auto_outcome_eval: quantitative` **and** the success metric is quantitative (a numeric target measurable from analytics/monitoring) → first run `coverage_check([metric])` (`references/telemetry-ingestion.md` §1.5): if the metric isn't bound to a telemetry source (the `/aep-map` Telemetry Binding step wasn't done), **fall back to the human pause** and note "run /aep-map observability step". If covered → fetch the actual value per `references/telemetry-ingestion.md`, apply `keep_if`/`otherwise` mechanically, and record the result in the changelog — no pause. (A fetch failure also falls back to the human pause.) - **Qualitative** metrics still pause for the human as described above — **unless** `topology.routing.full_auto: true`, in which case the agent evaluates the qualitative metric by its own judgment and applies the decision rule with no pause. - Default (`auto_outcome_eval: none`, `full_auto: false`) preserves the current human-in-the-loop behavior exactly. diff --git a/skills/product-context/reflect/references/telemetry-ingestion.md b/skills/product-context/reflect/references/telemetry-ingestion.md index 0f91bc1..3dca538 100644 --- a/skills/product-context/reflect/references/telemetry-ingestion.md +++ b/skills/product-context/reflect/references/telemetry-ingestion.md @@ -48,11 +48,57 @@ name only — **never embed secrets in the repo or in `product-context.yaml`**. --- +## 1.5 Deciding which sources to wire (the coverage rule) + +You don't list telemetry for its own sake — **a source is needed _iff_ some +declared signal requires it.** The decision is **hybrid**: + +1. **Metric-driven (what signals do we need?)** — enumerate every **quantitative** + `success_metric` across `product.layers[].outcome_contract` plus every + `topology.routing.post_merge_guard.health_signals` entry. That set _is_ the + demand for telemetry. +2. **Inventory (which tool provides each?)** — `/aep-scaffold`'s audit detects the + project's observability stack (Sentry, Datadog, PostHog, OpenTelemetry, log + drains, `/healthz`-style endpoints) and records **candidate** `telemetry_sources` + (kind + endpoint + `token_env`, no `metric_map` yet); you can also add candidates + by hand. +3. **Bind (`/aep-map`)** — for each needed signal, attach it to a candidate source + by adding a `metric_map: { : "" }` entry. A needed + signal with no measurable source is **flagged**, not ignored: make the metric + qualitative, or record it `unmeasured` — never leave a quantitative metric + silently un-sourced. + +### `coverage_check(needed)` — the guard helper + +Consumers that rely on telemetry (`/aep-watch`, `/aep-reflect` Step 2.75, +`/aep-autopilot`) call this **before** trusting auto behavior. It is pure +config inspection — no network: + +``` +coverage_check(needed_signals): + missing = [] + for sig in needed_signals: # quantitative success_metric names + health_signals + if no telemetry_sources[*].metric_map has key == sig + (and, for a health_signal, no source/endpoint provides it): + missing.append(sig) + return { covered: missing == [], missing } +``` + +**On `covered == false`:** surface +`"telemetry binding incomplete for — run /aep-map (observability step)"` +and **block the auto path** (watch refuses to claim auto-coverage; reflect falls +back to the human pause; autopilot pauses). Missing wiring must **block auto, +never silently no-op** — that's the v2 human-in-the-loop default. + +--- + ## 2. Outcome-contract auto-evaluation A layer's `outcome_contract` carries a `success_metric` (`type` + `target`) and a -`decision_rule` (`keep_if` / `otherwise`). Evaluate per -`topology.routing.auto_outcome_eval`: +`decision_rule` (`keep_if` / `otherwise`). **Precondition:** run +`coverage_check([success_metric])` (§1.5) first — if the metric isn't bound to a +source, take the human-pause path (the binding is incomplete; do not auto-eval). +When covered, evaluate per `topology.routing.auto_outcome_eval`: | Metric `type` | `auto_outcome_eval: quantitative` | default (`none`) | | ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | diff --git a/skills/product-context/watch/SKILL.md b/skills/product-context/watch/SKILL.md index 3266589..e792b5c 100644 --- a/skills/product-context/watch/SKILL.md +++ b/skills/product-context/watch/SKILL.md @@ -114,12 +114,27 @@ Each tick runs the same four-step body. **Idempotent** — re-running with no ne source data produces no new stories (the dedupe + `since` high-water mark guarantee it). ``` +⓪ PRECHECK → verify the /aep-map telemetry binding is complete (coverage_check) ① PULL → fetch new findings from each configured source (since high-water mark) ② CLASSIFY → run each finding through the /aep-reflect Step 2 classifier ③ DEDUPE → drop findings that already map to an existing story ④ WRITE → create bug/refinement stories (or surface proposals) ``` +### Step 0: Precondition — verify the map binding + +`/aep-watch` consumes telemetry sources, so first confirm `/aep-map` actually +**bound** them — don't silently watch nothing. Run `coverage_check()` (the helper +in `references/telemetry-ingestion.md` §1.5) over the signals this watch needs: +each `topology.routing.watch.sources[]` entry (and any `metric`/`error_stream` it +relies on) must resolve to a wired `topology.routing.telemetry_sources` entry with +a `metric_map`. + +- **Covered** → proceed to Step 1. +- **Not covered** (sources empty, or a referenced metric has no `metric_map`) → + **do not claim auto-coverage.** Surface: + `"telemetry binding incomplete for — run /aep-map (Telemetry Binding step) before /aep-watch can ingest it"`, skip the uncovered sources, and (if nothing is covered) stop the tick with that message. A missing binding **blocks**; it never silently no-ops. + ### Step 1: Pull from Sources For each entry in `watch.sources`, pull findings created/updated since diff --git a/skills/product-context/watch/references/telemetry-ingestion.md b/skills/product-context/watch/references/telemetry-ingestion.md index 0f91bc1..3dca538 100644 --- a/skills/product-context/watch/references/telemetry-ingestion.md +++ b/skills/product-context/watch/references/telemetry-ingestion.md @@ -48,11 +48,57 @@ name only — **never embed secrets in the repo or in `product-context.yaml`**. --- +## 1.5 Deciding which sources to wire (the coverage rule) + +You don't list telemetry for its own sake — **a source is needed _iff_ some +declared signal requires it.** The decision is **hybrid**: + +1. **Metric-driven (what signals do we need?)** — enumerate every **quantitative** + `success_metric` across `product.layers[].outcome_contract` plus every + `topology.routing.post_merge_guard.health_signals` entry. That set _is_ the + demand for telemetry. +2. **Inventory (which tool provides each?)** — `/aep-scaffold`'s audit detects the + project's observability stack (Sentry, Datadog, PostHog, OpenTelemetry, log + drains, `/healthz`-style endpoints) and records **candidate** `telemetry_sources` + (kind + endpoint + `token_env`, no `metric_map` yet); you can also add candidates + by hand. +3. **Bind (`/aep-map`)** — for each needed signal, attach it to a candidate source + by adding a `metric_map: { : "" }` entry. A needed + signal with no measurable source is **flagged**, not ignored: make the metric + qualitative, or record it `unmeasured` — never leave a quantitative metric + silently un-sourced. + +### `coverage_check(needed)` — the guard helper + +Consumers that rely on telemetry (`/aep-watch`, `/aep-reflect` Step 2.75, +`/aep-autopilot`) call this **before** trusting auto behavior. It is pure +config inspection — no network: + +``` +coverage_check(needed_signals): + missing = [] + for sig in needed_signals: # quantitative success_metric names + health_signals + if no telemetry_sources[*].metric_map has key == sig + (and, for a health_signal, no source/endpoint provides it): + missing.append(sig) + return { covered: missing == [], missing } +``` + +**On `covered == false`:** surface +`"telemetry binding incomplete for — run /aep-map (observability step)"` +and **block the auto path** (watch refuses to claim auto-coverage; reflect falls +back to the human pause; autopilot pauses). Missing wiring must **block auto, +never silently no-op** — that's the v2 human-in-the-loop default. + +--- + ## 2. Outcome-contract auto-evaluation A layer's `outcome_contract` carries a `success_metric` (`type` + `target`) and a -`decision_rule` (`keep_if` / `otherwise`). Evaluate per -`topology.routing.auto_outcome_eval`: +`decision_rule` (`keep_if` / `otherwise`). **Precondition:** run +`coverage_check([success_metric])` (§1.5) first — if the metric isn't bound to a +source, take the human-pause path (the binding is incomplete; do not auto-eval). +When covered, evaluate per `topology.routing.auto_outcome_eval`: | Metric `type` | `auto_outcome_eval: quantitative` | default (`none`) | | ---------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | ------------------------------ | diff --git a/skills/project-setup/scaffold/SKILL.md b/skills/project-setup/scaffold/SKILL.md index 38b2b9b..1bc777e 100644 --- a/skills/project-setup/scaffold/SKILL.md +++ b/skills/project-setup/scaffold/SKILL.md @@ -629,10 +629,30 @@ grep -q '.dev-workflow/' .gitignore 2>/dev/null && echo "[x]" || echo "[ ] MISSI printf " %-45s" ".feature-workspaces/ in .gitignore:" grep -q '.feature-workspaces/' .gitignore 2>/dev/null && echo "[x]" || echo "[ ] MISSING" + +# Observability stack (candidate telemetry sources for /aep-map binding) +echo "--- Observability (telemetry source candidates) ---" +deps="$(cat package.json 2>/dev/null) $(cat pyproject.toml 2>/dev/null)" +for probe in "sentry:error_stream" "datadog:monitoring" "posthog:analytics" "amplitude:analytics" "@opentelemetry:monitoring" "newrelic:monitoring"; do + tool="${probe%%:*}"; kind="${probe##*:}" + printf " %-45s" "$tool ($kind):" + echo "$deps" | grep -qi "$tool" && echo "[detected]" || echo "[ ]" +done +printf " %-45s" "health endpoint (/healthz|/readyz|/health):" +grep -rqiE '/(healthz|readyz|health)\b' . --include='*.ts' --include='*.js' --include='*.py' 2>/dev/null && echo "[detected]" || echo "[ ]" ``` Show the user the results. Only proceed to fill gaps for items marked `[ ] MISSING`. +**Observability → telemetry candidates.** For each `[detected]` tool, record a +**candidate** entry under `topology.routing.telemetry_sources` (`kind` + a +`token_env` name for its API key — never the secret; leave `endpoint`/`metric_map` +for `/aep-map` to bind). These are just candidates: `/aep-map`'s Telemetry Binding +step ties each needed `success_metric` / `health_signal` to one of them (coverage +rule in `aep-reflect/references/telemetry-ingestion.md` §1.5). If nothing is +detected, that's fine — note it so `/aep-map` knows quantitative metrics may need a +tool added or must stay qualitative. + --- ## Phase 3E: Fill Gaps