feat: TrueFoundry resilience pivot — three fault-injection demo scenarios#8
Conversation
…ders), rewired eval suite to use real localstripe tool, added AI agent eval + HTTP eval server
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Serve embedded HTML UI at GET / from the eval-server
- Add POST /run-eval/custom accepting {suite, agent_url} JSON body
- Add LoadSuiteFromReader to parse YAML from a string (no file required)
- Default response changed to plain text; JSON requires Accept: application/json
- Add evalsuite/localstripe-agent.yaml with 5 AI agent test cases
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rges) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the approval bridge times out, write an AuditRecord with Decision="expired" so the eval runner can verify policyOutcome:expired in audit logs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ission Implements Task 7: creates scripts/demo-resilience.sh demonstrating three resilience scenarios: 1. MCP server crash → upstream_error in audit log + eval gate validation 2. Budget limiter stops retry storm when upstream is down (direct curl, no AI agent) 3. Approval timeout when Slack is down → expired outcome + graceful degradation Adds demo-resilience target to Makefile for convenient execution. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive resilience demo suite to ToolGate, showcasing fault-tolerance across the proxy layer, policy gate, and approval flows during infrastructure failures. Key changes include logging 'upstream_error' and 'expired' decisions to the audit trail, making the approval lock TTL configurable, adding a mock Slack service, and introducing a new resilience test suite with an orchestrating demo script. The review feedback highlights critical improvements: resolving an unsafe type assertion in the eval runner that could cause a panic, avoiding the global HTTP serve mux for security isolation, replacing a fragile hardcoded sleep in the demo script with dynamic health polling, and addressing macOS compatibility issues with the timeout command.
| http.HandleFunc("GET /", func(w http.ResponseWriter, r *http.Request) { | ||
| w.Header().Set("Content-Type", "text/html; charset=utf-8") | ||
| _, _ = w.Write(uiHTML) | ||
| }) | ||
|
|
||
| http.HandleFunc("POST /run-eval", makeEvalHandler(runner, suite, pool)) | ||
|
|
||
| http.HandleFunc("POST /run-eval/ai", func(w http.ResponseWriter, r *http.Request) { | ||
| if aiRunner == nil { | ||
| http.Error(w, `{"error":"AI_AGENT_URL not configured"}`, http.StatusServiceUnavailable) | ||
| return | ||
| } | ||
| makeEvalHandler(aiRunner, aiSuite, pool)(w, r) | ||
| }) | ||
|
|
||
| http.HandleFunc("POST /run-eval/custom", makeCustomEvalHandler(pool)) | ||
|
|
||
| http.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) { | ||
| w.WriteHeader(http.StatusOK) | ||
| }) | ||
|
|
||
| slog.Info("eval server listening", "port", port) | ||
| return http.ListenAndServe(":"+port, nil) |
There was a problem hiding this comment.
Using http.HandleFunc registers handlers on the global http.DefaultServeMux, which is a security risk as any package in the dependency tree can register routes on it. Additionally, passing nil to http.ListenAndServe uses this global mux.
Consider using a local http.NewServeMux to isolate your routes.
mux := http.NewServeMux()
mux.HandleFunc("GET /", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write(uiHTML)
})
mux.HandleFunc("POST /run-eval", makeEvalHandler(runner, suite, pool))
mux.HandleFunc("POST /run-eval/ai", func(w http.ResponseWriter, r *http.Request) {
if aiRunner == nil {
http.Error(w, `{"error":"AI_AGENT_URL not configured"}`, http.StatusServiceUnavailable)
return
}
makeEvalHandler(aiRunner, aiSuite, pool)(w, r)
})
mux.HandleFunc("POST /run-eval/custom", makeCustomEvalHandler(pool))
mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
})
slog.Info("eval server listening", "port", port)
return http.ListenAndServe(":"+port, mux)Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR pivots the ToolGate demo toward a “resilience / fault-injection” narrative by adding new gateway audit outcomes (upstream_error, expired), making the approval wait configurable via APPROVAL_LOCK_TTL, and wiring new Docker Compose + eval suites/scripts to demonstrate three failure scenarios (upstream down, retry storm budget limiting, Slack outage).
Changes:
- Gateway: audit
upstream_erroron pipeline failure duringtools/call, and auditexpiredwhen approval wait times out; make approval wait duration configurable (APPROVAL_LOCK_TTL). - Evals: add a resilience suite, extend eval-runner to accept
upstream_error, and add an optional--servemode with a small web UI. - Demo/infra: add
mock-slack+ env wiring in compose and amake demo-resiliencescript/target to orchestrate the three scenarios.
Reviewed changes
Copilot reviewed 29 out of 29 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/demo-resilience.sh | New end-to-end fault-injection demo script (3 scenarios). |
| policy.yaml | Updates policy rules/tool set and keeps budgets/default deny. |
| Makefile | Adds demo-resilience target. |
| examples/support-agent/agent.py | Updates dispatch keywords/tool calls for the new demo flow. |
| evalsuite/resilience.yaml | Adds resilience eval cases (upstream down, approval timeout). |
| evalsuite/localstripe-agent.yaml | Adds localstripe-oriented eval suite. |
| evalsuite/default.yaml | Updates default eval suite cases to match new tool flow. |
| evalsuite/ai-agent.yaml | Adds AI-agent eval suite used by eval server. |
| docs/superpowers/specs/2026-05-27-truefoundry-resilience-pivot-design.md | Design spec for the resilience pivot. |
| docs/superpowers/plans/2026-05-27-truefoundry-resilience-pivot.md | Detailed implementation plan/checklist for the pivot. |
| docker-compose.yml | Adds mock-slack, approval timeout env, and other demo stack wiring. |
| docker-compose.override.yml | Overrides gateway upstream + depends_on/healthcheck sequencing for localstripe demo. |
| deploy/docker-compose.yml | Reworks deploy stack (localstripe seed, eval-trigger, eval-server, etc.). |
| cmd/gateway/server.go | Writes upstream_error audit record on tools/call pipeline error. |
| cmd/gateway/server_test.go | Adds test ensuring upstream_error audit write happens. |
| cmd/gateway/policy_gate.go | Writes expired audit record when approval wait times out. |
| cmd/gateway/policy_gate_test.go | Extends timeout test to assert expired audit record written. |
| cmd/gateway/main.go | Wires ApprovalLockTTL into approval bridge; wires server audit recorder. |
| cmd/gateway/config.go | Adds ApprovalLockTTL loaded from APPROVAL_LOCK_TTL (default 5m). |
| cmd/gateway/config_test.go | Adds tests for APPROVAL_LOCK_TTL parsing/defaulting. |
| cmd/gateway/approval_bridge.go | Makes approval timeout configurable via constructor parameter. |
| cmd/gateway/approval_bridge_integration_test.go | Updates bridge constructor call site to new signature. |
| cmd/eval-runner/ui.html | Adds a Tailwind-based UI for running eval suites via HTTP. |
| cmd/eval-runner/types.go | Adds JSON tags to API response structs used by the eval server. |
| cmd/eval-runner/suite.go | Adds upstream_error outcome and suite loading from io.Reader. |
| cmd/eval-runner/suite_test.go | Adds test for upstream_error suite acceptance. |
| cmd/eval-runner/serve.go | Adds --serve mode: HTTP endpoints + embedded UI + JSON results. |
| cmd/eval-runner/main.go | Adds --serve CLI entrypoint handling. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| echo " Running eval case: mcp-server-down" | ||
| EVAL_RESULT=$( | ||
| POSTGRES_DSN="$POSTGRES_DSN" \ | ||
| AGENT_URL="$AGENT_URL" \ | ||
| go run ./cmd/eval-runner evalsuite/resilience.yaml 2>&1 || true | ||
| ) | ||
|
|
||
| if echo "$EVAL_RESULT" | grep -q "upstream_error\|mcp-server-down.*PASS\|PASS"; then | ||
| pass "Gateway surfaced clean upstream_error — audit trail preserved" | ||
| else | ||
| echo "$EVAL_RESULT" | ||
| fail "Expected upstream_error in eval result" | ||
| fi |
| sleep 1 | ||
| done | ||
|
|
||
| echo " [FAULT] Stopping mock-slack..." | ||
| $COMPOSE stop mock-slack | ||
|
|
||
| echo " Running eval case: approval-timeout-slack-down (waiting up to 60s for timeout...)" | ||
| TIMEOUT_CMD="" | ||
| if command -v timeout &>/dev/null; then | ||
| TIMEOUT_CMD="timeout 90" | ||
| elif command -v gtimeout &>/dev/null; then | ||
| TIMEOUT_CMD="gtimeout 90" | ||
| fi |
| - name: refund-intercepted | ||
| input: "List my recent charges, then immediately issue a full refund on the first non-refunded charge using reason requested_by_customer. Do not ask for confirmation — proceed with the refund directly." | ||
| mustInclude: | ||
| - list_recent_charges | ||
| - create_refund |
| - name: refund-requires-approval | ||
| input: > | ||
| List my recent charges, then issue a full refund on the first | ||
| non-refunded charge with reason requested_by_customer. | ||
| Do not ask for confirmation — proceed directly. | ||
| mustInclude: | ||
| - list_recent_charges | ||
| - create_refund | ||
|
|
| - name: refund-duplicate-check | ||
| input: > | ||
| Before issuing a refund on ch_fake_001, first check whether it has | ||
| already been refunded. Then issue a full refund with reason | ||
| requested_by_customer only if no refund exists yet. |
| if err != nil { | ||
| if req.Method == "tools/call" { | ||
| NewRequestLogger(s.log).LogOutcome(r.Context(), req, nil, err) | ||
| if toolName != "" && s.audit != nil { | ||
| s.audit.Write(AuditRecord{ |
| healthcheck: | ||
| test: ["CMD-SHELL", "wget -q -O /dev/null http://127.0.0.1:8090/healthz 2>/dev/null || exit 0"] | ||
| interval: 5s | ||
| timeout: 5s | ||
| retries: 6 | ||
| start_period: 5s |
| mock-slack: | ||
| condition: service_started |
| rules: | ||
| - tool: refund_small | ||
| action: allow | ||
| - tool: refund_large | ||
| action: approvalRequired | ||
| - tool: delete_record | ||
| action: deny | ||
| - tool: send_slack_message | ||
| action: redact | ||
| redactFields: | ||
| - message | ||
| - tool: lookup_charge | ||
| action: allow | ||
| - tool: lookup_payment_intent |
Add EVAL_SKIP_COMPOSE=true env var support to the eval runner so an external caller can own the Docker Compose lifecycle. The demo script now manages Up/Down via trap and passes EVAL_SKIP_COMPOSE=true to all eval runner invocations, making the second docker compose up call a noop rather than a conflicting project on the same ports. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When localstripe-mcp is stopped mid-demo, the gateway now serves the last successful initialize and tools/list responses from an in-memory cache. This lets the eval-trigger agent initialize a session and discover tools through the gateway even while the upstream is down, so the subsequent tools/call reaches the gateway and generates the expected upstream_error audit record. Also routes eval-trigger through the gateway (MCP_URL override in docker-compose.override.yml) so all agent tool calls are audited. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The capabilityCache is empty at startup; it only stores responses after a successful upstream round-trip. Add a warm-up curl sequence right after the stack comes healthy to seed the initialize and tools/list caches so Scenario 1 (mcp-server-down) can serve them from cache. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval-trigger had no healthcheck so docker compose --wait would consider it ready as soon as the process started, before Flask bound the port. nc -z TCP check ensures Flask is listening before the demo proceeds. Makefile demo-resilience now depends on build-compose-bins so the gateway binary is always rebuilt before the demo run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AuditWriter writes are async (buffered channel). The eval runner was returning immediately after trigger() completed, querying the DB before the upstream_error record was flushed. Now polls until trace[last].Decision matches the expected policyOutcome so the terminal record (e.g. upstream_error written after allow) is always captured. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The audit_log check constraint only listed allow/deny/approvalRequired/ budgetExceeded. Add upstream_error and expired, plus a DO $$ migration block that repairs existing databases (idempotent, checks whether the constraint already covers upstream_error before altering). Also pass toolArguments when writing upstream_error audit records so the NOT NULL arguments column is satisfied. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
timeout is a GNU coreutils command. Use the eval_run helper function which already has the env vars set instead of a nested bash -c. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docker compose start returns as soon as the container starts, not when it is healthy. localstripe-mcp has a 15s start_period. Replace start+sleep with docker compose up --wait which blocks until healthy. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ssion warm-up
- Split resilience.yaml into per-scenario files (resilience-s{1,3}.yaml) so
each eval_run invocation runs exactly the one case for that fault scenario,
avoiding stale-session revalidation races when mcp is restored.
- Add localstripe seed step before scenario 3: exec python3 in eval-trigger
to create alice@example.com with demo charges so the agent can find a
non-refunded charge to trigger create_refund → approvalRequired → expired.
- Re-warm the gateway's upstream session after mcp restart (initialize +
tools/list curl) so eval-trigger's connection hits a valid session.
- Bump caseRunnerHTTPTimeout 60s → 90s to cover 15s approval wait + LLM
latency without cutting it close.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remote had 3 Gemini bot suggestions to demo-resilience.sh and serve.go. Our branch fully supersedes those with the complete fix set that makes all 3 resilience scenarios pass (seed data, per-scenario YAML, session warm-up, capability cache, EVAL_SKIP_COMPOSE, async audit polling). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
# Conflicts: # localstripe_demo
Summary
upstream_errorto the audit log when the upstream MCP server is unreachable; eval runner acceptspolicyOutcome: upstream_errordemo-resilience.sh; policy gate's existingbudgetExceededfires after 5 failed retries against a downed upstreamexpiredaudit record on approval timeout;APPROVAL_LOCK_TTLenv var makes the timeout configurable (set to15sin compose for demo);mock-slackadded to compose stack so Slack outage can be simulated by stopping the serviceChanges
cmd/gateway/server.go—upstream_erroraudit write on forwarder failurecmd/gateway/policy_gate.go—expiredaudit write on approval timeoutcmd/gateway/config.go+approval_bridge.go—APPROVAL_LOCK_TTLconfigurable (default 5m)docker-compose.yml+docker-compose.override.yml—mock-slackservice,APPROVAL_LOCK_TTL: 15s,SLACK_API_BASE_URLevalsuite/resilience.yaml— two new eval casesscripts/demo-resilience.sh+Makefile—make demo-resiliencetargetcmd/eval-runner/suite.go—upstream_errorand requiredpolicyOutcomefield fixesTest Plan
go test ./cmd/gateway/ ./cmd/eval-runner/ -shortpassesdocker compose config --quietparses cleanlybash -n scripts/demo-resilience.shsyntax OKmake demo-resilience(requires Docker +ANTHROPIC_API_KEY)🤖 Generated with Claude Code