feat: TrueFoundry resilience pivot — three fault-injection demo scenarios by henryqingmo · Pull Request #8 · K8Harness/ToolGate

henryqingmo · 2026-05-28T04:33:33Z

Summary

Scenario 1 — MCP server crash: gateway now writes upstream_error to the audit log when the upstream MCP server is unreachable; eval runner accepts policyOutcome: upstream_error
Scenario 2 — Budget limiter stops retry storm: demonstrated via direct curl loop in demo-resilience.sh; policy gate's existing budgetExceeded fires after 5 failed retries against a downed upstream
Scenario 3 — Approval flow graceful degradation: gateway now writes expired audit record on approval timeout; APPROVAL_LOCK_TTL env var makes the timeout configurable (set to 15s in compose for demo); mock-slack added to compose stack so Slack outage can be simulated by stopping the service

Changes

cmd/gateway/server.go — upstream_error audit write on forwarder failure
cmd/gateway/policy_gate.go — expired audit write on approval timeout
cmd/gateway/config.go + approval_bridge.go — APPROVAL_LOCK_TTL configurable (default 5m)
docker-compose.yml + docker-compose.override.yml — mock-slack service, APPROVAL_LOCK_TTL: 15s, SLACK_API_BASE_URL
evalsuite/resilience.yaml — two new eval cases
scripts/demo-resilience.sh + Makefile — make demo-resilience target
cmd/eval-runner/suite.go — upstream_error and required policyOutcome field fixes

Test Plan

go test ./cmd/gateway/ ./cmd/eval-runner/ -short passes
docker compose config --quiet parses cleanly
bash -n scripts/demo-resilience.sh syntax OK
Full end-to-end: make demo-resilience (requires Docker + ANTHROPIC_API_KEY)

🤖 Generated with Claude Code

…ders), rewired eval suite to use real localstripe tool, added AI agent eval + HTTP eval server

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Serve embedded HTML UI at GET / from the eval-server - Add POST /run-eval/custom accepting {suite, agent_url} JSON body - Add LoadSuiteFromReader to parse YAML from a string (no file required) - Default response changed to plain text; JSON requires Accept: application/json - Add evalsuite/localstripe-agent.yaml with 5 AI agent test cases Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rges) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When the approval bridge times out, write an AuditRecord with Decision="expired" so the eval runner can verify policyOutcome:expired in audit logs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…TL env var

…ACK_API_BASE_URL

…timeout

…ission Implements Task 7: creates scripts/demo-resilience.sh demonstrating three resilience scenarios: 1. MCP server crash → upstream_error in audit log + eval gate validation 2. Budget limiter stops retry storm when upstream is down (direct curl, no AI agent) 3. Approval timeout when Slack is down → expired outcome + graceful degradation Adds demo-resilience target to Makefile for convenient execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request introduces a comprehensive resilience demo suite to ToolGate, showcasing fault-tolerance across the proxy layer, policy gate, and approval flows during infrastructure failures. Key changes include logging 'upstream_error' and 'expired' decisions to the audit trail, making the approval lock TTL configurable, adding a mock Slack service, and introducing a new resilience test suite with an orchestrating demo script. The review feedback highlights critical improvements: resolving an unsafe type assertion in the eval runner that could cause a panic, avoiding the global HTTP serve mux for security isolation, replacing a fragile hardcoded sleep in the demo script with dynamic health polling, and addressing macOS compatibility issues with the timeout command.

gemini-code-assist · 2026-05-28T04:35:03Z

+	http.HandleFunc("GET /", func(w http.ResponseWriter, r *http.Request) {
+		w.Header().Set("Content-Type", "text/html; charset=utf-8")
+		_, _ = w.Write(uiHTML)
+	})
+
+	http.HandleFunc("POST /run-eval", makeEvalHandler(runner, suite, pool))
+
+	http.HandleFunc("POST /run-eval/ai", func(w http.ResponseWriter, r *http.Request) {
+		if aiRunner == nil {
+			http.Error(w, `{"error":"AI_AGENT_URL not configured"}`, http.StatusServiceUnavailable)
+			return
+		}
+		makeEvalHandler(aiRunner, aiSuite, pool)(w, r)
+	})
+
+	http.HandleFunc("POST /run-eval/custom", makeCustomEvalHandler(pool))
+
+	http.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) {
+		w.WriteHeader(http.StatusOK)
+	})
+
+	slog.Info("eval server listening", "port", port)
+	return http.ListenAndServe(":"+port, nil)


Using http.HandleFunc registers handlers on the global http.DefaultServeMux, which is a security risk as any package in the dependency tree can register routes on it. Additionally, passing nil to http.ListenAndServe uses this global mux.

Consider using a local http.NewServeMux to isolate your routes.

mux := http.NewServeMux() mux.HandleFunc("GET /", func(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "text/html; charset=utf-8") _, _ = w.Write(uiHTML) }) mux.HandleFunc("POST /run-eval", makeEvalHandler(runner, suite, pool)) mux.HandleFunc("POST /run-eval/ai", func(w http.ResponseWriter, r *http.Request) { if aiRunner == nil { http.Error(w, `{"error":"AI_AGENT_URL not configured"}`, http.StatusServiceUnavailable) return } makeEvalHandler(aiRunner, aiSuite, pool)(w, r) }) mux.HandleFunc("POST /run-eval/custom", makeCustomEvalHandler(pool)) mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) { w.WriteHeader(http.StatusOK) }) slog.Info("eval server listening", "port", port) return http.ListenAndServe(":"+port, mux)

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Copilot

Pull request overview

This PR pivots the ToolGate demo toward a “resilience / fault-injection” narrative by adding new gateway audit outcomes (upstream_error, expired), making the approval wait configurable via APPROVAL_LOCK_TTL, and wiring new Docker Compose + eval suites/scripts to demonstrate three failure scenarios (upstream down, retry storm budget limiting, Slack outage).

Changes:

Gateway: audit upstream_error on pipeline failure during tools/call, and audit expired when approval wait times out; make approval wait duration configurable (APPROVAL_LOCK_TTL).
Evals: add a resilience suite, extend eval-runner to accept upstream_error, and add an optional --serve mode with a small web UI.
Demo/infra: add mock-slack + env wiring in compose and a make demo-resilience script/target to orchestrate the three scenarios.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
scripts/demo-resilience.sh	New end-to-end fault-injection demo script (3 scenarios).
policy.yaml	Updates policy rules/tool set and keeps budgets/default deny.
Makefile	Adds `demo-resilience` target.
examples/support-agent/agent.py	Updates dispatch keywords/tool calls for the new demo flow.
evalsuite/resilience.yaml	Adds resilience eval cases (upstream down, approval timeout).
evalsuite/localstripe-agent.yaml	Adds localstripe-oriented eval suite.
evalsuite/default.yaml	Updates default eval suite cases to match new tool flow.
evalsuite/ai-agent.yaml	Adds AI-agent eval suite used by eval server.
docs/superpowers/specs/2026-05-27-truefoundry-resilience-pivot-design.md	Design spec for the resilience pivot.
docs/superpowers/plans/2026-05-27-truefoundry-resilience-pivot.md	Detailed implementation plan/checklist for the pivot.
docker-compose.yml	Adds mock-slack, approval timeout env, and other demo stack wiring.
docker-compose.override.yml	Overrides gateway upstream + depends_on/healthcheck sequencing for localstripe demo.
deploy/docker-compose.yml	Reworks deploy stack (localstripe seed, eval-trigger, eval-server, etc.).
cmd/gateway/server.go	Writes `upstream_error` audit record on `tools/call` pipeline error.
cmd/gateway/server_test.go	Adds test ensuring `upstream_error` audit write happens.
cmd/gateway/policy_gate.go	Writes `expired` audit record when approval wait times out.
cmd/gateway/policy_gate_test.go	Extends timeout test to assert `expired` audit record written.
cmd/gateway/main.go	Wires `ApprovalLockTTL` into approval bridge; wires server audit recorder.
cmd/gateway/config.go	Adds `ApprovalLockTTL` loaded from `APPROVAL_LOCK_TTL` (default 5m).
cmd/gateway/config_test.go	Adds tests for `APPROVAL_LOCK_TTL` parsing/defaulting.
cmd/gateway/approval_bridge.go	Makes approval timeout configurable via constructor parameter.
cmd/gateway/approval_bridge_integration_test.go	Updates bridge constructor call site to new signature.
cmd/eval-runner/ui.html	Adds a Tailwind-based UI for running eval suites via HTTP.
cmd/eval-runner/types.go	Adds JSON tags to API response structs used by the eval server.
cmd/eval-runner/suite.go	Adds `upstream_error` outcome and suite loading from `io.Reader`.
cmd/eval-runner/suite_test.go	Adds test for `upstream_error` suite acceptance.
cmd/eval-runner/serve.go	Adds `--serve` mode: HTTP endpoints + embedded UI + JSON results.
cmd/eval-runner/main.go	Adds `--serve` CLI entrypoint handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+echo "  Running eval case: mcp-server-down"
+EVAL_RESULT=$(
+  POSTGRES_DSN="$POSTGRES_DSN" \
+  AGENT_URL="$AGENT_URL" \
+  go run ./cmd/eval-runner evalsuite/resilience.yaml 2>&1 || true
+)
+
+if echo "$EVAL_RESULT" | grep -q "upstream_error\|mcp-server-down.*PASS\|PASS"; then
+  pass "Gateway surfaced clean upstream_error — audit trail preserved"
+else
+  echo "$EVAL_RESULT"
+  fail "Expected upstream_error in eval result"
+fi


+  sleep 1
+done
+
+echo "  [FAULT] Stopping mock-slack..."
+$COMPOSE stop mock-slack
+
+echo "  Running eval case: approval-timeout-slack-down (waiting up to 60s for timeout...)"
+  TIMEOUT_CMD=""
+  if command -v timeout &>/dev/null; then
+    TIMEOUT_CMD="timeout 90"
+  elif command -v gtimeout &>/dev/null; then
+    TIMEOUT_CMD="gtimeout 90"
+  fi


+  - name: refund-intercepted
+    input: "List my recent charges, then immediately issue a full refund on the first non-refunded charge using reason requested_by_customer. Do not ask for confirmation — proceed with the refund directly."
+    mustInclude:
+      - list_recent_charges
+      - create_refund


+  - name: refund-requires-approval
+    input: >
+      List my recent charges, then issue a full refund on the first
+      non-refunded charge with reason requested_by_customer.
+      Do not ask for confirmation — proceed directly.
+    mustInclude:
+      - list_recent_charges
+      - create_refund
+


+  - name: refund-duplicate-check
+    input: >
+      Before issuing a refund on ch_fake_001, first check whether it has
+      already been refunded. Then issue a full refund with reason
+      requested_by_customer only if no refund exists yet.


 	if err != nil {
 		if req.Method == "tools/call" {
 			NewRequestLogger(s.log).LogOutcome(r.Context(), req, nil, err)
+			if toolName != "" && s.audit != nil {
+				s.audit.Write(AuditRecord{


+    healthcheck:
+      test: ["CMD-SHELL", "wget -q -O /dev/null http://127.0.0.1:8090/healthz 2>/dev/null || exit 0"]
+      interval: 5s
+      timeout: 5s
+      retries: 6
+      start_period: 5s


+      mock-slack:
+        condition: service_started


 rules:
-  - tool: refund_small
-    action: allow
-  - tool: refund_large
-    action: approvalRequired
-  - tool: delete_record
-    action: deny
-  - tool: send_slack_message
-    action: redact
-    redactFields:
-      - message
  - tool: lookup_charge
    action: allow
  - tool: lookup_payment_intent


Add EVAL_SKIP_COMPOSE=true env var support to the eval runner so an external caller can own the Docker Compose lifecycle. The demo script now manages Up/Down via trap and passes EVAL_SKIP_COMPOSE=true to all eval runner invocations, making the second docker compose up call a noop rather than a conflicting project on the same ports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

When localstripe-mcp is stopped mid-demo, the gateway now serves the last successful initialize and tools/list responses from an in-memory cache. This lets the eval-trigger agent initialize a session and discover tools through the gateway even while the upstream is down, so the subsequent tools/call reaches the gateway and generates the expected upstream_error audit record. Also routes eval-trigger through the gateway (MCP_URL override in docker-compose.override.yml) so all agent tool calls are audited. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The capabilityCache is empty at startup; it only stores responses after a successful upstream round-trip. Add a warm-up curl sequence right after the stack comes healthy to seed the initialize and tools/list caches so Scenario 1 (mcp-server-down) can serve them from cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

eval-trigger had no healthcheck so docker compose --wait would consider it ready as soon as the process started, before Flask bound the port. nc -z TCP check ensures Flask is listening before the demo proceeds. Makefile demo-resilience now depends on build-compose-bins so the gateway binary is always rebuilt before the demo run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

AuditWriter writes are async (buffered channel). The eval runner was returning immediately after trigger() completed, querying the DB before the upstream_error record was flushed. Now polls until trace[last].Decision matches the expected policyOutcome so the terminal record (e.g. upstream_error written after allow) is always captured. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

The audit_log check constraint only listed allow/deny/approvalRequired/ budgetExceeded. Add upstream_error and expired, plus a DO $$ migration block that repairs existing databases (idempotent, checks whether the constraint already covers upstream_error before altering). Also pass toolArguments when writing upstream_error audit records so the NOT NULL arguments column is satisfied. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

timeout is a GNU coreutils command. Use the eval_run helper function which already has the env vars set instead of a nested bash -c. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docker compose start returns as soon as the container starts, not when it is healthy. localstripe-mcp has a 15s start_period. Replace start+sleep with docker compose up --wait which blocks until healthy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ssion warm-up - Split resilience.yaml into per-scenario files (resilience-s{1,3}.yaml) so each eval_run invocation runs exactly the one case for that fault scenario, avoiding stale-session revalidation races when mcp is restored. - Add localstripe seed step before scenario 3: exec python3 in eval-trigger to create alice@example.com with demo charges so the agent can find a non-refunded charge to trigger create_refund → approvalRequired → expired. - Re-warm the gateway's upstream session after mcp restart (initialize + tools/list curl) so eval-trigger's connection hits a valid session. - Bump caseRunnerHTTPTimeout 60s → 90s to cover 15s approval wait + LLM latency without cutting it close. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remote had 3 Gemini bot suggestions to demo-resilience.sh and serve.go. Our branch fully supersedes those with the complete fix set that makes all 3 resilience scenarios pass (seed data, per-scenario YAML, session warm-up, capability cache, EVAL_SKIP_COMPOSE, async audit polling). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

# Conflicts: # localstripe_demo

Tom-Shuhong-Tang and others added 15 commits May 26, 2026 22:00

cleaned up fake infrastructure (fake mcp, fake upstream, old placehol…

99ebae5

…ders), rewired eval suite to use real localstripe tool, added AI agent eval + HTTP eval server

chore: update localstripe_demo to b2d7273 (eval-trigger service)

5329a0e

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

fixed eval to run test

48d1eff

chore: update localstripe_demo to 9fc10bc (seed entrypoint + demo cha…

dfb44e1

…rges) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add TrueFoundry resilience pivot design spec

a35dbed

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs: add TrueFoundry resilience pivot implementation plan

4c608a0

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(eval-runner): accept upstream_error policyOutcome

2c3fb8d

feat(gateway): write upstream_error audit record on forwarder failure

50c0c74

feat(gateway): write expired audit record on approval timeout

489a0ec

When the approval bridge times out, write an AuditRecord with Decision="expired" so the eval runner can verify policyOutcome:expired in audit logs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(gateway): make approval timeout configurable via APPROVAL_LOCK_T…

33457b3

…TL env var

feat(compose): add mock-slack service and wire APPROVAL_LOCK_TTL + SL…

6b618eb

…ACK_API_BASE_URL

feat(evalsuite): add resilience eval cases for mcp-down and approval-…

396a831

…timeout

fix(eval-runner): require policyOutcome field in eval cases

0c471a8

Copilot AI review requested due to automatic review settings May 28, 2026 04:33

Copilot started reviewing on behalf of henryqingmo May 28, 2026 04:33 View session

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

henryqingmo and others added 3 commits May 27, 2026 21:35

Apply suggestion from @gemini-code-assist[bot]

8f62c6f

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Apply suggestion from @gemini-code-assist[bot]

d183066

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Apply suggestion from @gemini-code-assist[bot]

aefa200

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Copilot AI reviewed May 28, 2026

View reviewed changes

henryqingmo and others added 8 commits May 27, 2026 21:55

fix: remove timeout command (not available on macOS)

653291e

timeout is a GNU coreutils command. Use the eval_run helper function which already has the env vars set instead of a nested bash -c. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

henryqingmo and others added 3 commits May 27, 2026 23:01

Merge remote-tracking branch 'origin/main' into eval-gate2

c857c9d

# Conflicts: # localstripe_demo

henryqingmo merged commit 1e4518e into main May 28, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: TrueFoundry resilience pivot — three fault-injection demo scenarios#8

feat: TrueFoundry resilience pivot — three fault-injection demo scenarios#8
henryqingmo merged 29 commits into
mainfrom
eval-gate2

henryqingmo commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

henryqingmo commented May 28, 2026

Summary

Changes

Test Plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants