Skip to content

feat: TrueFoundry resilience pivot — three fault-injection demo scenarios#8

Merged
henryqingmo merged 29 commits into
mainfrom
eval-gate2
May 28, 2026
Merged

feat: TrueFoundry resilience pivot — three fault-injection demo scenarios#8
henryqingmo merged 29 commits into
mainfrom
eval-gate2

Conversation

@henryqingmo
Copy link
Copy Markdown
Contributor

Summary

  • Scenario 1 — MCP server crash: gateway now writes upstream_error to the audit log when the upstream MCP server is unreachable; eval runner accepts policyOutcome: upstream_error
  • Scenario 2 — Budget limiter stops retry storm: demonstrated via direct curl loop in demo-resilience.sh; policy gate's existing budgetExceeded fires after 5 failed retries against a downed upstream
  • Scenario 3 — Approval flow graceful degradation: gateway now writes expired audit record on approval timeout; APPROVAL_LOCK_TTL env var makes the timeout configurable (set to 15s in compose for demo); mock-slack added to compose stack so Slack outage can be simulated by stopping the service

Changes

  • cmd/gateway/server.goupstream_error audit write on forwarder failure
  • cmd/gateway/policy_gate.goexpired audit write on approval timeout
  • cmd/gateway/config.go + approval_bridge.goAPPROVAL_LOCK_TTL configurable (default 5m)
  • docker-compose.yml + docker-compose.override.ymlmock-slack service, APPROVAL_LOCK_TTL: 15s, SLACK_API_BASE_URL
  • evalsuite/resilience.yaml — two new eval cases
  • scripts/demo-resilience.sh + Makefilemake demo-resilience target
  • cmd/eval-runner/suite.goupstream_error and required policyOutcome field fixes

Test Plan

  • go test ./cmd/gateway/ ./cmd/eval-runner/ -short passes
  • docker compose config --quiet parses cleanly
  • bash -n scripts/demo-resilience.sh syntax OK
  • Full end-to-end: make demo-resilience (requires Docker + ANTHROPIC_API_KEY)

🤖 Generated with Claude Code

Tom-Shuhong-Tang and others added 15 commits May 26, 2026 22:00
…ders), rewired eval suite to use real localstripe tool, added AI agent eval + HTTP eval server
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Serve embedded HTML UI at GET / from the eval-server
- Add POST /run-eval/custom accepting {suite, agent_url} JSON body
- Add LoadSuiteFromReader to parse YAML from a string (no file required)
- Default response changed to plain text; JSON requires Accept: application/json
- Add evalsuite/localstripe-agent.yaml with 5 AI agent test cases

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rges)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When the approval bridge times out, write an AuditRecord with Decision="expired"
so the eval runner can verify policyOutcome:expired in audit logs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ission

Implements Task 7: creates scripts/demo-resilience.sh demonstrating three resilience scenarios:
1. MCP server crash → upstream_error in audit log + eval gate validation
2. Budget limiter stops retry storm when upstream is down (direct curl, no AI agent)
3. Approval timeout when Slack is down → expired outcome + graceful degradation

Adds demo-resilience target to Makefile for convenient execution.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 04:33
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a comprehensive resilience demo suite to ToolGate, showcasing fault-tolerance across the proxy layer, policy gate, and approval flows during infrastructure failures. Key changes include logging 'upstream_error' and 'expired' decisions to the audit trail, making the approval lock TTL configurable, adding a mock Slack service, and introducing a new resilience test suite with an orchestrating demo script. The review feedback highlights critical improvements: resolving an unsafe type assertion in the eval runner that could cause a panic, avoiding the global HTTP serve mux for security isolation, replacing a fragile hardcoded sleep in the demo script with dynamic health polling, and addressing macOS compatibility issues with the timeout command.

Comment thread cmd/eval-runner/serve.go Outdated
Comment thread cmd/eval-runner/serve.go
Comment on lines +69 to +91
http.HandleFunc("GET /", func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "text/html; charset=utf-8")
_, _ = w.Write(uiHTML)
})

http.HandleFunc("POST /run-eval", makeEvalHandler(runner, suite, pool))

http.HandleFunc("POST /run-eval/ai", func(w http.ResponseWriter, r *http.Request) {
if aiRunner == nil {
http.Error(w, `{"error":"AI_AGENT_URL not configured"}`, http.StatusServiceUnavailable)
return
}
makeEvalHandler(aiRunner, aiSuite, pool)(w, r)
})

http.HandleFunc("POST /run-eval/custom", makeCustomEvalHandler(pool))

http.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) {
w.WriteHeader(http.StatusOK)
})

slog.Info("eval server listening", "port", port)
return http.ListenAndServe(":"+port, nil)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using http.HandleFunc registers handlers on the global http.DefaultServeMux, which is a security risk as any package in the dependency tree can register routes on it. Additionally, passing nil to http.ListenAndServe uses this global mux.

Consider using a local http.NewServeMux to isolate your routes.

	mux := http.NewServeMux()
	mux.HandleFunc("GET /", func(w http.ResponseWriter, r *http.Request) {
		w.Header().Set("Content-Type", "text/html; charset=utf-8")
		_, _ = w.Write(uiHTML)
	})

	mux.HandleFunc("POST /run-eval", makeEvalHandler(runner, suite, pool))

	mux.HandleFunc("POST /run-eval/ai", func(w http.ResponseWriter, r *http.Request) {
		if aiRunner == nil {
			http.Error(w, `{"error":"AI_AGENT_URL not configured"}`, http.StatusServiceUnavailable)
			return
		}
		makeEvalHandler(aiRunner, aiSuite, pool)(w, r)
	})

	mux.HandleFunc("POST /run-eval/custom", makeCustomEvalHandler(pool))

	mux.HandleFunc("GET /healthz", func(w http.ResponseWriter, r *http.Request) {
		w.WriteHeader(http.StatusOK)
	})

	slog.Info("eval server listening", "port", port)
	return http.ListenAndServe(":"+port, mux)

Comment thread scripts/demo-resilience.sh Outdated
Comment thread scripts/demo-resilience.sh Outdated
henryqingmo and others added 3 commits May 27, 2026 21:35
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR pivots the ToolGate demo toward a “resilience / fault-injection” narrative by adding new gateway audit outcomes (upstream_error, expired), making the approval wait configurable via APPROVAL_LOCK_TTL, and wiring new Docker Compose + eval suites/scripts to demonstrate three failure scenarios (upstream down, retry storm budget limiting, Slack outage).

Changes:

  • Gateway: audit upstream_error on pipeline failure during tools/call, and audit expired when approval wait times out; make approval wait duration configurable (APPROVAL_LOCK_TTL).
  • Evals: add a resilience suite, extend eval-runner to accept upstream_error, and add an optional --serve mode with a small web UI.
  • Demo/infra: add mock-slack + env wiring in compose and a make demo-resilience script/target to orchestrate the three scenarios.

Reviewed changes

Copilot reviewed 29 out of 29 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
scripts/demo-resilience.sh New end-to-end fault-injection demo script (3 scenarios).
policy.yaml Updates policy rules/tool set and keeps budgets/default deny.
Makefile Adds demo-resilience target.
examples/support-agent/agent.py Updates dispatch keywords/tool calls for the new demo flow.
evalsuite/resilience.yaml Adds resilience eval cases (upstream down, approval timeout).
evalsuite/localstripe-agent.yaml Adds localstripe-oriented eval suite.
evalsuite/default.yaml Updates default eval suite cases to match new tool flow.
evalsuite/ai-agent.yaml Adds AI-agent eval suite used by eval server.
docs/superpowers/specs/2026-05-27-truefoundry-resilience-pivot-design.md Design spec for the resilience pivot.
docs/superpowers/plans/2026-05-27-truefoundry-resilience-pivot.md Detailed implementation plan/checklist for the pivot.
docker-compose.yml Adds mock-slack, approval timeout env, and other demo stack wiring.
docker-compose.override.yml Overrides gateway upstream + depends_on/healthcheck sequencing for localstripe demo.
deploy/docker-compose.yml Reworks deploy stack (localstripe seed, eval-trigger, eval-server, etc.).
cmd/gateway/server.go Writes upstream_error audit record on tools/call pipeline error.
cmd/gateway/server_test.go Adds test ensuring upstream_error audit write happens.
cmd/gateway/policy_gate.go Writes expired audit record when approval wait times out.
cmd/gateway/policy_gate_test.go Extends timeout test to assert expired audit record written.
cmd/gateway/main.go Wires ApprovalLockTTL into approval bridge; wires server audit recorder.
cmd/gateway/config.go Adds ApprovalLockTTL loaded from APPROVAL_LOCK_TTL (default 5m).
cmd/gateway/config_test.go Adds tests for APPROVAL_LOCK_TTL parsing/defaulting.
cmd/gateway/approval_bridge.go Makes approval timeout configurable via constructor parameter.
cmd/gateway/approval_bridge_integration_test.go Updates bridge constructor call site to new signature.
cmd/eval-runner/ui.html Adds a Tailwind-based UI for running eval suites via HTTP.
cmd/eval-runner/types.go Adds JSON tags to API response structs used by the eval server.
cmd/eval-runner/suite.go Adds upstream_error outcome and suite loading from io.Reader.
cmd/eval-runner/suite_test.go Adds test for upstream_error suite acceptance.
cmd/eval-runner/serve.go Adds --serve mode: HTTP endpoints + embedded UI + JSON results.
cmd/eval-runner/main.go Adds --serve CLI entrypoint handling.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +22 to +34
echo " Running eval case: mcp-server-down"
EVAL_RESULT=$(
POSTGRES_DSN="$POSTGRES_DSN" \
AGENT_URL="$AGENT_URL" \
go run ./cmd/eval-runner evalsuite/resilience.yaml 2>&1 || true
)

if echo "$EVAL_RESULT" | grep -q "upstream_error\|mcp-server-down.*PASS\|PASS"; then
pass "Gateway surfaced clean upstream_error — audit trail preserved"
else
echo "$EVAL_RESULT"
fail "Expected upstream_error in eval result"
fi
Comment thread scripts/demo-resilience.sh Outdated
Comment on lines +83 to +95
sleep 1
done

echo " [FAULT] Stopping mock-slack..."
$COMPOSE stop mock-slack

echo " Running eval case: approval-timeout-slack-down (waiting up to 60s for timeout...)"
TIMEOUT_CMD=""
if command -v timeout &>/dev/null; then
TIMEOUT_CMD="timeout 90"
elif command -v gtimeout &>/dev/null; then
TIMEOUT_CMD="gtimeout 90"
fi
Comment thread evalsuite/ai-agent.yaml
Comment on lines +8 to +12
- name: refund-intercepted
input: "List my recent charges, then immediately issue a full refund on the first non-refunded charge using reason requested_by_customer. Do not ask for confirmation — proceed with the refund directly."
mustInclude:
- list_recent_charges
- create_refund
Comment on lines +20 to +28
- name: refund-requires-approval
input: >
List my recent charges, then issue a full refund on the first
non-refunded charge with reason requested_by_customer.
Do not ask for confirmation — proceed directly.
mustInclude:
- list_recent_charges
- create_refund

Comment on lines +29 to +33
- name: refund-duplicate-check
input: >
Before issuing a refund on ch_fake_001, first check whether it has
already been refunded. Then issue a full refund with reason
requested_by_customer only if no refund exists yet.
Comment thread cmd/gateway/server.go
Comment on lines 130 to +134
if err != nil {
if req.Method == "tools/call" {
NewRequestLogger(s.log).LogOutcome(r.Context(), req, nil, err)
if toolName != "" && s.audit != nil {
s.audit.Write(AuditRecord{
Comment thread docker-compose.yml
Comment on lines +142 to +147
healthcheck:
test: ["CMD-SHELL", "wget -q -O /dev/null http://127.0.0.1:8090/healthz 2>/dev/null || exit 0"]
interval: 5s
timeout: 5s
retries: 6
start_period: 5s
Comment on lines +10 to +11
mock-slack:
condition: service_started
Comment thread policy.yaml
Comment on lines 1 to 4
rules:
- tool: refund_small
action: allow
- tool: refund_large
action: approvalRequired
- tool: delete_record
action: deny
- tool: send_slack_message
action: redact
redactFields:
- message
- tool: lookup_charge
action: allow
- tool: lookup_payment_intent
henryqingmo and others added 8 commits May 27, 2026 21:55
Add EVAL_SKIP_COMPOSE=true env var support to the eval runner so an
external caller can own the Docker Compose lifecycle. The demo script
now manages Up/Down via trap and passes EVAL_SKIP_COMPOSE=true to all
eval runner invocations, making the second docker compose up call a
noop rather than a conflicting project on the same ports.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
When localstripe-mcp is stopped mid-demo, the gateway now serves the last
successful initialize and tools/list responses from an in-memory cache.
This lets the eval-trigger agent initialize a session and discover tools
through the gateway even while the upstream is down, so the subsequent
tools/call reaches the gateway and generates the expected upstream_error
audit record.

Also routes eval-trigger through the gateway (MCP_URL override in
docker-compose.override.yml) so all agent tool calls are audited.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The capabilityCache is empty at startup; it only stores responses after
a successful upstream round-trip. Add a warm-up curl sequence right
after the stack comes healthy to seed the initialize and tools/list
caches so Scenario 1 (mcp-server-down) can serve them from cache.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
eval-trigger had no healthcheck so docker compose --wait would consider
it ready as soon as the process started, before Flask bound the port.
nc -z TCP check ensures Flask is listening before the demo proceeds.

Makefile demo-resilience now depends on build-compose-bins so the
gateway binary is always rebuilt before the demo run.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
AuditWriter writes are async (buffered channel). The eval runner was
returning immediately after trigger() completed, querying the DB before
the upstream_error record was flushed. Now polls until trace[last].Decision
matches the expected policyOutcome so the terminal record (e.g.
upstream_error written after allow) is always captured.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The audit_log check constraint only listed allow/deny/approvalRequired/
budgetExceeded. Add upstream_error and expired, plus a DO $$ migration
block that repairs existing databases (idempotent, checks whether the
constraint already covers upstream_error before altering).

Also pass toolArguments when writing upstream_error audit records so
the NOT NULL arguments column is satisfied.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
timeout is a GNU coreutils command. Use the eval_run helper function
which already has the env vars set instead of a nested bash -c.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
docker compose start returns as soon as the container starts, not when
it is healthy. localstripe-mcp has a 15s start_period. Replace
start+sleep with docker compose up --wait which blocks until healthy.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
henryqingmo and others added 3 commits May 27, 2026 23:01
…ssion warm-up

- Split resilience.yaml into per-scenario files (resilience-s{1,3}.yaml) so
  each eval_run invocation runs exactly the one case for that fault scenario,
  avoiding stale-session revalidation races when mcp is restored.
- Add localstripe seed step before scenario 3: exec python3 in eval-trigger
  to create alice@example.com with demo charges so the agent can find a
  non-refunded charge to trigger create_refund → approvalRequired → expired.
- Re-warm the gateway's upstream session after mcp restart (initialize +
  tools/list curl) so eval-trigger's connection hits a valid session.
- Bump caseRunnerHTTPTimeout 60s → 90s to cover 15s approval wait + LLM
  latency without cutting it close.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remote had 3 Gemini bot suggestions to demo-resilience.sh and serve.go.
Our branch fully supersedes those with the complete fix set that makes
all 3 resilience scenarios pass (seed data, per-scenario YAML, session
warm-up, capability cache, EVAL_SKIP_COMPOSE, async audit polling).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@henryqingmo henryqingmo merged commit 1e4518e into main May 28, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants