From e02bd05bc9662e570ff4ec5fc35307679a601d02 Mon Sep 17 00:00:00 2001
From: Brian O'Kelley <bokelley@scope3.com>
Date: Sat, 6 Jun 2026 12:29:15 -0400
Subject: [PATCH] chore: strengthen Argus review checks

---
 .agents/roles/code-reviewer.md            | 13 +++++
 .claude/agents/code-reviewer.md           | 13 +++++
 .github/ai-review/expert-adcp-reviewer.md | 64 ++++++++++++++++-------
 .github/pull_request_template.md          |  3 ++
 4 files changed, 75 insertions(+), 18 deletions(-)

diff --git a/.agents/roles/code-reviewer.md b/.agents/roles/code-reviewer.md
index 10f34a7..f1c26af 100644
--- a/.agents/roles/code-reviewer.md
+++ b/.agents/roles/code-reviewer.md
@@ -19,6 +19,19 @@ You are a code reviewer for the AdCP monorepo. You review changes for:
 - Are there any TODO / FIXME / `console.log` / debug prints left in?
 - Does the PR title follow conventional-commits format?
 
+## High-risk checks
+
+- **Executable integration path:** For new builders, transport adapters, servlet bridges, connection managers, background workers, subscriptions, or third-party SDK integrations, identify the highest-level new entry point and verify that existing tests exercise it with the real downstream lifecycle components. If tests mock the queue/thread/subscription/lifecycle owner, call out the gap. For methods named like `start`, `ensureStarted`, `close`, `subscribe`, `flush`, `shutdown`, `request`, `complete`, or `cancel`, verify actual behavior from in-repo source, cited docs, or a real integration/manual smoke test; do not infer from the method name. If a pinned pre-1.0 dependency implementation is not available in repo context, require an in-repo integration test or PR-described manual smoke proof across the real lifecycle boundary.
+- **Dispatch-key mutation:** Search for `substring`, `replace`, `replaceAll`, `trim`, `toLowerCase`, normalization helpers, capping, and fallback defaults applied to routing or security keys before dispatch. Tool names, route names, skill IDs, JSON-RPC methods, task IDs, message IDs, auth principals, tenant IDs, schema paths, and cache keys should usually be rejected when invalid, not silently truncated or stripped.
+- **Cross-side symmetry:** If a caller validates a value but the server accepts-and-mutates it, or one transport rejects while another normalizes, flag the mismatch.
+- **Workspace artifacts:** Flag committed local coordination files such as `.wt-*`, `.context/**`, `.DS_Store`, editor swap files, temp claim files, local logs, and scratchpads unless the PR documents why they are source.
+
+Severity defaults:
+
+- Pinned or pre-1.0 third-party lifecycle method relied on without verifying the implementation -> Blocker if it can hang, drop work, or leave a dead worker.
+- Public dispatch-key mutation before serving, routing, lookup, auth, cache-key construction, or persistence -> Blocker.
+- Committed workspace artifacts (`.wt-claim`, `.wt-*`, `.context/**`, `.local/**`) -> Blocker unless the PR explicitly documents why they are source.
+
 ## How to report back
 
 Structured bullet list:
diff --git a/.claude/agents/code-reviewer.md b/.claude/agents/code-reviewer.md
index 10f34a7..f1c26af 100644
--- a/.claude/agents/code-reviewer.md
+++ b/.claude/agents/code-reviewer.md
@@ -19,6 +19,19 @@ You are a code reviewer for the AdCP monorepo. You review changes for:
 - Are there any TODO / FIXME / `console.log` / debug prints left in?
 - Does the PR title follow conventional-commits format?
 
+## High-risk checks
+
+- **Executable integration path:** For new builders, transport adapters, servlet bridges, connection managers, background workers, subscriptions, or third-party SDK integrations, identify the highest-level new entry point and verify that existing tests exercise it with the real downstream lifecycle components. If tests mock the queue/thread/subscription/lifecycle owner, call out the gap. For methods named like `start`, `ensureStarted`, `close`, `subscribe`, `flush`, `shutdown`, `request`, `complete`, or `cancel`, verify actual behavior from in-repo source, cited docs, or a real integration/manual smoke test; do not infer from the method name. If a pinned pre-1.0 dependency implementation is not available in repo context, require an in-repo integration test or PR-described manual smoke proof across the real lifecycle boundary.
+- **Dispatch-key mutation:** Search for `substring`, `replace`, `replaceAll`, `trim`, `toLowerCase`, normalization helpers, capping, and fallback defaults applied to routing or security keys before dispatch. Tool names, route names, skill IDs, JSON-RPC methods, task IDs, message IDs, auth principals, tenant IDs, schema paths, and cache keys should usually be rejected when invalid, not silently truncated or stripped.
+- **Cross-side symmetry:** If a caller validates a value but the server accepts-and-mutates it, or one transport rejects while another normalizes, flag the mismatch.
+- **Workspace artifacts:** Flag committed local coordination files such as `.wt-*`, `.context/**`, `.DS_Store`, editor swap files, temp claim files, local logs, and scratchpads unless the PR documents why they are source.
+
+Severity defaults:
+
+- Pinned or pre-1.0 third-party lifecycle method relied on without verifying the implementation -> Blocker if it can hang, drop work, or leave a dead worker.
+- Public dispatch-key mutation before serving, routing, lookup, auth, cache-key construction, or persistence -> Blocker.
+- Committed workspace artifacts (`.wt-claim`, `.wt-*`, `.context/**`, `.local/**`) -> Blocker unless the PR explicitly documents why they are source.
+
 ## How to report back
 
 Structured bullet list:
diff --git a/.github/ai-review/expert-adcp-reviewer.md b/.github/ai-review/expert-adcp-reviewer.md
index 6501f97..a327ca8 100644
--- a/.github/ai-review/expert-adcp-reviewer.md
+++ b/.github/ai-review/expert-adcp-reviewer.md
@@ -148,31 +148,31 @@ Read the PR description's test plan. If a checkbox describing **manual verificat
 
 "Blocked on dev credentials" is the author's problem, not your reason to skip the check.
 
----
+### 4. Executable integration path audit
 
-## Picking the action
+For any PR that adds or rewires a public builder, transport adapter, servlet bridge, connection manager, task lifecycle, background worker, subscription, or third-party SDK integration, you MUST prove that the main path actually runs. Static review of the architectural story is not enough.
 
-Three actions are available:
-- `gh pr review <PR> --approve --body "<review>"`
-- `gh pr review <PR> --comment --body "<review>"`
-- `gh pr review <PR> --request-changes --body "<review>"`
+- Identify the highest-level new public entry point and the downstream component it claims to wire. Examples: `FooServerBuilder.build().onMessageSend(...)`, `TransportClient.callTool(...)`, `Servlet.doPost(...)`, `ConnectionManager.getOrConnect(...)`.
+- Check whether existing tests exercise that exact composed path with real downstream components. If tests mock or stub the component that owns the lifecycle/queue/thread/subscription behavior, say so explicitly.
+- For third-party lifecycle methods (`start`, `ensureStarted`, `close`, `subscribe`, `flush`, `shutdown`, `request`, `complete`, `cancel`), verify behavior from in-repo source, cited docs, or a real integration/manual smoke test. Do not trust method names.
+- If the PR depends on a pinned pre-GA SDK or a version newly introduced in the PR, inspect the pinned implementation when it is available in the repo context. If it is not available, require an in-repo integration test or PR-described manual smoke proof across the real lifecycle boundary.
+- When the diff touches `AdcpServerBuilder`, transport providers, servlet bridges, or A2A handler wiring, a build-only test or stub/no-op transport does not prove startup. Require one integration test or direct inspection across the real lifecycle boundary, especially for pinned or pre-1.0 SDK lifecycle methods.
+- If you cannot prove the main path runs, downgrade to `--comment` or `--request-changes` depending on user impact. If you can reproduce a hang, dropped event, no-op lifecycle call, or dead background worker, that is a MUST FIX runtime bug.
 
-**Decision tree (apply in order):**
+### 5. Dispatch-key mutation audit
 
-1. MUST FIX issue found (per the section above) → `--request-changes`. Stop.
-2. PR has any of these labels → `--comment`. Append the label note.
-   - `do-not-auto-approve`, `wip`, `needs-human-review`, `security`, `breaking-change`
-3. Otherwise, your judgment. Verdict ratio target is ~85% approve. Clean, contained change with no MUST FIX issue → `--approve`. Genuinely uncertain (open question for the author, ambiguous intent, needs context you can't verify from the diff) → `--comment` with the question — say what would flip you to approve.
+For routing-sensitive or security-sensitive fields, reject-vs-mutate is a required review question. Tool names, route names, skill IDs, JSON-RPC methods, task IDs, message IDs, auth principals, tenant IDs, schema paths, and cache keys all qualify.
 
-**Scrutiny hint:** the codegen generator, `adcp-server` signing/auth, the schema-bundle fetcher, `gradle/libs.versions.toml`, `build-logic/`, the public record surface, and workflow YAML that handles tokens warrant harder reads than docs tweaks or test additions. **But "docs" is not a synonym for "small."** A ROADMAP.md edit that adds a D-row or rewrites a track section is a governance change; open the file. The largest-file rule applies. Scrutiny is not blocking — if you read it carefully and it's clean, approve. Sensitive areas get more *scrutiny*, not more *blocking*.
+- Search changed code for `substring`, `replace`, `replaceAll`, `trim`, `toLowerCase`, normalization helpers, capping, and fallback defaults applied before dispatch, lookup, auth, cache-key construction, or persistence.
+- Truncating or stripping an invalid dispatch key before use is suspect. Prefer validating and rejecting invalid input, while using a separately sanitized copy only for logs and error messages.
+- Silent truncation or normalization of public dispatch keys on the serving path is a MUST FIX. Tool names, route names, skill IDs, JSON-RPC methods, task/message IDs, tenant IDs, and cache keys must be rejected when invalid; only a separately sanitized copy may be used for logs or error text.
+- If the caller side rejects a value but the server side truncates or normalizes it, flag the asymmetry. Cross-transport semantics must match unless the PR documents a compatibility reason.
 
-**Notes to append (only when downgrading to `--comment`):**
+### 6. Workspace artifact scan
 
-Label hold:
-```
----
-*Held for human approval: PR has label `<label>`.*
-```
+Before posting the review, scan the changed-file list for local or agent coordination artifacts. Flag these as fix-or-remove unless the PR explicitly documents why they are source:
+
+- `.wt-claim`, `.wt-*`, `.context/**`, `.local/**`, `.DS_Store`, editor swap files, temp claim files, local logs, generated prompt scratchpads, local credentials, or one-off agent coordination files.
 
 ---
 
@@ -201,6 +201,8 @@ Every other PR runs `code-reviewer`. No exceptions for "small" PRs, "obvious" PR
 - `adcp-server` signing / auth / signature-verification / outbound HTTP (SSRF posture) → `security-reviewer` (mandatory) + `code-reviewer`
 - Schema-bundle fetcher / `cosign verify-blob` plumbing → `security-reviewer` + `ad-tech-protocol-expert`
 - New MCP tool, new A2A skill, or transport-layer change in `adcp` → `ad-tech-protocol-expert` + `agentic-product-architect`
+- New public builder, CLI/API entry point, SDK integration path, transport setup path, servlet bridge, lifecycle wiring, background worker, or subscription → `debugger` + `code-reviewer`
+- Adopter-facing defaults, error messages, or integration ergonomics → `dx-expert` + `code-reviewer`
 - Build infrastructure (`build-logic/`, `gradle/libs.versions.toml`, root `build.gradle.kts`, `settings.gradle.kts`) → `code-reviewer` with explicit focus on D-decision alignment
 - Spring Boot starter (`adcp-spring-boot-starter`) → `code-reviewer` with focus on D7 (jakarta-only, Spring Boot 3.x floor)
 - Reactor / Mutiny / Kotlin bridge modules → `code-reviewer` + (for Kotlin) note the D14 thin-extension-only constraint
@@ -220,6 +222,32 @@ Every other PR runs `code-reviewer`. No exceptions for "small" PRs, "obvious" PR
 
 ---
 
+## Picking the action
+
+Three actions are available:
+- `gh pr review <PR> --approve --body "<review>"`
+- `gh pr review <PR> --comment --body "<review>"`
+- `gh pr review <PR> --request-changes --body "<review>"`
+
+**Decision tree (apply in order):**
+
+1. MUST FIX issue found (per the section above) → `--request-changes`. Stop.
+2. PR has any of these labels → `--comment`. Append the label note.
+   - `do-not-auto-approve`, `wip`, `needs-human-review`, `security`, `breaking-change`
+3. Otherwise, your judgment. Verdict ratio target is ~85% approve. Clean, contained change with no MUST FIX issue → `--approve`. Genuinely uncertain (open question for the author, ambiguous intent, needs context you can't verify from the diff) → `--comment` with the question — say what would flip you to approve.
+
+**Scrutiny hint:** the codegen generator, `adcp-server` signing/auth, the schema-bundle fetcher, `gradle/libs.versions.toml`, `build-logic/`, the public record surface, and workflow YAML that handles tokens warrant harder reads than docs tweaks or test additions. **But "docs" is not a synonym for "small."** A ROADMAP.md edit that adds a D-row or rewrites a track section is a governance change; open the file. The largest-file rule applies. Scrutiny is not blocking — if you read it carefully and it's clean, approve. Sensitive areas get more *scrutiny*, not more *blocking*.
+
+**Notes to append (only when downgrading to `--comment`):**
+
+Label hold:
+```
+---
+*Held for human approval: PR has label `<label>`.*
+```
+
+---
+
 ## Workflow
 
 1. Fetch PR metadata: `gh pr view $PR_NUMBER --json title,labels,additions,deletions,changedFiles,files,body`
diff --git a/.github/pull_request_template.md b/.github/pull_request_template.md
index 0459815..4aca00d 100644
--- a/.github/pull_request_template.md
+++ b/.github/pull_request_template.md
@@ -29,3 +29,6 @@ At minimum: `./gradlew build` green.
 - [ ] `./gradlew build` green
 - [ ] New tests added for new behavior
 - [ ] Public API surface change → changeset added
+- [ ] Builder/transport/lifecycle/third-party SDK change → integration or manual smoke test proves the real composed path runs (not only stubs/mocks)
+- [ ] Invalid/overlong public names or IDs are rejected, not silently truncated or normalized server-side
+- [ ] No workspace-local artifacts committed (`.wt-claim`, `.wt-*`, `.context/`, `.local/`, `.DS_Store`)