feat: eval runner operator UI with resilience demo by henryqingmo · Pull Request #9 · K8Harness/ToolGate

henryqingmo · 2026-05-28T08:22:17Z

Summary

Adds a browser-based operator UI (/cmd/eval-runner --serve) for running the three ToolGate resilience scenarios with real-time SSE result streaming
Fixes stack health probe using /healthz (404) instead of /mcp (200) on the gateway, which caused Gateway to always show as down
Adds warmGatewayCapCache() on eval-server startup to prime the gateway's tools/list cache while all services are healthy — required for MCP Crash scenario to pass when the upstream is later stopped
Adds README.md with per-scenario setup instructions, warmup commands, and capability cache explanation

Scenarios covered

Scenario	Fault injected	Expected audit trail
MCP Crash	`localstripe-mcp` stopped	`list_recent_charges → allow → upstream_error`
Retry Storm	MCP still down	5× `allow → budgetExceeded`
Approval Timeout	`mock-slack` stopped	`list_recent_charges → allow`, `create_refund → approvalRequired → expired`

Test plan

make build-compose-bins && docker compose up -d --wait
POSTGRES_DSN=... AGENT_URL=http://127.0.0.1:18086 go run ./cmd/eval-runner --serve evalsuite/resilience.yaml
Open http://localhost:8099, run each scenario per README setup steps, confirm all three PASS
make demo-resilience passes 3/3 headlessly

🤖 Generated with Claude Code

- Stack health was probing /healthz (404) instead of /mcp (200) on the gateway, causing it to always show as down - Add warmGatewayCapCache() called on eval-server startup to prime the gateway's tools/list cache while all services are healthy; fixes MCP Crash scenario failing when cache is cold - Write README with per-scenario setup instructions, warmup commands, and capability cache explanation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request enhances the ToolGate evaluation runner by introducing a streaming-enabled "Resilience Operator UI" and several new backend endpoints. Key additions include a redesigned frontend dashboard with preset scenarios, a dedicated Retry Storm executor, stack health probes, and SSE-based streaming for evaluation results. The review feedback highlights critical resource management improvements, including reusing a single HTTP client in the stack health handler to prevent socket exhaustion, replacing time.After with a reusable time.Ticker in the polling loop to avoid memory leaks, and checking for context cancellation during test suite streaming to terminate early if a client disconnects.

gemini-code-assist · 2026-05-28T08:23:42Z

+func makeStackHealthHandler(deps stackHealthDeps) http.HandlerFunc {
+	return func(w http.ResponseWriter, r *http.Request) {
+		w.Header().Set("Content-Type", "application/json")
+		_ = json.NewEncoder(w).Encode(stackHealthResponse{
+			Services: []stackHealthService{
+				probeHTTPService(deps.httpClient, "Gateway", "http://localhost:18080/mcp"),
+				probeTCPService("MCP", "127.0.0.1:18421"),
+				probeHTTPService(deps.httpClient, "Slack", "http://localhost:18090/healthz"),
+				probePostgresService(deps.pool),
+			},
+		})
+	}
+}


The deps.httpClient field is left uninitialized (nil) when the handler is registered in serve.go. This causes probeHTTPService to instantiate a new http.Client on every single probe call. Since the health endpoint is polled frequently, creating new clients repeatedly can lead to socket exhaustion (TIME_WAIT sockets) and resource leaks.

Initializing deps.httpClient once inside makeStackHealthHandler if it is nil ensures that a single client is reused across all requests.

func makeStackHealthHandler(deps stackHealthDeps) http.HandlerFunc { if deps.httpClient == nil { deps.httpClient = &http.Client{Timeout: stackHealthProbeTimeout} } return func(w http.ResponseWriter, r *http.Request) { w.Header().Set("Content-Type", "application/json") _ = json.NewEncoder(w).Encode(stackHealthResponse{ Services: []stackHealthService{ probeHTTPService(deps.httpClient, "Gateway", "http://localhost:18080/mcp"), probeTCPService("MCP", "127.0.0.1:18421"), probeHTTPService(deps.httpClient, "Slack", "http://localhost:18090/healthz"), probePostgresService(deps.pool), }, }) } }

gemini-code-assist · 2026-05-28T08:23:42Z

+	deadline := time.Now().Add(e.pollTimeout)
+	var trace []TraceRow
+	for {
+		trace, err = e.queryTrace(ctx, sessionID)
+		if err != nil {
+			result.Failures = []CheckFailure{{
+				Check:    "run",
+				Expected: "audit query succeeds",
+				Observed: err.Error(),
+			}}
+			return result
+		}
+		result.Trace = trace
+		if hasDecision(trace, "budgetExceeded") {
+			result.Passed = true
+			return result
+		}
+		if time.Now().After(deadline) {
+			result.Failures = []CheckFailure{{
+				Check:    "policyOutcome",
+				Expected: "budgetExceeded",
+				Observed: lastDecision(trace),
+			}}
+			return result
+		}
+
+		select {
+		case <-ctx.Done():
+			result.Failures = []CheckFailure{{
+				Check:    "run",
+				Expected: "context remains active",
+				Observed: ctx.Err().Error(),
+			}}
+			return result
+		case <-time.After(e.pollInterval):
+		}
+	}


Using time.After inside a loop can lead to memory/resource leaks because the underlying timer is not garbage collected until it fires, even if the loop terminates early or the context is canceled.

Instead, instantiate a single time.Ticker outside the loop and reuse it. This is more efficient and avoids potential leaks.

deadline := time.Now().Add(e.pollTimeout) var trace []TraceRow ticker := time.NewTicker(e.pollInterval) defer ticker.Stop() for { trace, err = e.queryTrace(ctx, sessionID) if err != nil { result.Failures = []CheckFailure{{ Check: "run", Expected: "audit query succeeds", Observed: err.Error(), }} return result } result.Trace = trace if hasDecision(trace, "budgetExceeded") { result.Passed = true return result } if time.Now().After(deadline) { result.Failures = []CheckFailure{{ Check: "policyOutcome", Expected: "budgetExceeded", Observed: lastDecision(trace), }} return result } select { case <-ctx.Done(): result.Failures = []CheckFailure{{ Check: "run", Expected: "context remains active", Observed: ctx.Err().Error(), }} return result case <-ticker.C: } }

gemini-code-assist · 2026-05-28T08:23:42Z

+func streamEvalSuite(ctx context.Context, w http.ResponseWriter, runner caseExecutor, cases []EvalCase) {
+	results := make([]CaseResult, 0, len(cases))
+	total := len(cases)
+	for index, testCase := range cases {
+		if err := writeSSE(w, "case_start", caseStartEvent{Name: testCase.Name, Index: index, Total: total}); err != nil {
+			return
+		}
+		result := runEvalCase(ctx, runner, testCase)
+		results = append(results, result)
+		if err := writeSSE(w, "case_result", caseResultEvent{Index: index, Total: total, Result: result}); err != nil {
+			return
+		}
+	}
+	_ = writeSSE(w, "summary", summarizeResults(results))
+}


The streamEvalSuite function iterates over all test cases without checking if the request context has been canceled. If a client disconnects mid-run, the server will continue executing subsequent test cases (which can be slow or resource-intensive) unnecessarily.

Adding a check for ctx.Err() != nil at the start of each iteration allows the loop to terminate early when the client disconnects.

func streamEvalSuite(ctx context.Context, w http.ResponseWriter, runner caseExecutor, cases []EvalCase) { results := make([]CaseResult, 0, len(cases)) total := len(cases) for index, testCase := range cases { if ctx.Err() != nil { return } if err := writeSSE(w, "case_start", caseStartEvent{Name: testCase.Name, Index: index, Total: total}); err != nil { return } result := runEvalCase(ctx, runner, testCase) results = append(results, result) if err := writeSSE(w, "case_result", caseResultEvent{Index: index, Total: total, Result: result}); err != nil { return } } _ = writeSSE(w, "summary", summarizeResults(results)) }

henryqingmo and others added 3 commits May 28, 2026 00:11

docs: spec eval runner operator UI

bf868f0

feat: eval runner operator UI with streaming scenario runner

b17d015

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Copilot AI review requested due to automatic review settings May 28, 2026 08:22

Copilot started reviewing on behalf of henryqingmo May 28, 2026 08:22 View session

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

henryqingmo merged commit 8a20294 into main May 28, 2026
2 of 3 checks passed

henryqingmo review requested due to automatic review settings May 28, 2026 08:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: eval runner operator UI with resilience demo#9

feat: eval runner operator UI with resilience demo#9
henryqingmo merged 3 commits into
mainfrom
eval-gate2

henryqingmo commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

gemini-code-assist Bot May 28, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

henryqingmo commented May 28, 2026

Summary

Scenarios covered

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant