Skip to content

feat: eval runner operator UI with resilience demo#9

Merged
henryqingmo merged 3 commits into
mainfrom
eval-gate2
May 28, 2026
Merged

feat: eval runner operator UI with resilience demo#9
henryqingmo merged 3 commits into
mainfrom
eval-gate2

Conversation

@henryqingmo
Copy link
Copy Markdown
Contributor

Summary

  • Adds a browser-based operator UI (/cmd/eval-runner --serve) for running the three ToolGate resilience scenarios with real-time SSE result streaming
  • Fixes stack health probe using /healthz (404) instead of /mcp (200) on the gateway, which caused Gateway to always show as down
  • Adds warmGatewayCapCache() on eval-server startup to prime the gateway's tools/list cache while all services are healthy — required for MCP Crash scenario to pass when the upstream is later stopped
  • Adds README.md with per-scenario setup instructions, warmup commands, and capability cache explanation

Scenarios covered

Scenario Fault injected Expected audit trail
MCP Crash localstripe-mcp stopped list_recent_charges → allow → upstream_error
Retry Storm MCP still down allow → budgetExceeded
Approval Timeout mock-slack stopped list_recent_charges → allow, create_refund → approvalRequired → expired

Test plan

  • make build-compose-bins && docker compose up -d --wait
  • POSTGRES_DSN=... AGENT_URL=http://127.0.0.1:18086 go run ./cmd/eval-runner --serve evalsuite/resilience.yaml
  • Open http://localhost:8099, run each scenario per README setup steps, confirm all three PASS
  • make demo-resilience passes 3/3 headlessly

🤖 Generated with Claude Code

henryqingmo and others added 3 commits May 28, 2026 00:11
- Stack health was probing /healthz (404) instead of /mcp (200) on
  the gateway, causing it to always show as down
- Add warmGatewayCapCache() called on eval-server startup to prime
  the gateway's tools/list cache while all services are healthy;
  fixes MCP Crash scenario failing when cache is cold
- Write README with per-scenario setup instructions, warmup commands,
  and capability cache explanation

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 28, 2026 08:22
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the ToolGate evaluation runner by introducing a streaming-enabled "Resilience Operator UI" and several new backend endpoints. Key additions include a redesigned frontend dashboard with preset scenarios, a dedicated Retry Storm executor, stack health probes, and SSE-based streaming for evaluation results. The review feedback highlights critical resource management improvements, including reusing a single HTTP client in the stack health handler to prevent socket exhaustion, replacing time.After with a reusable time.Ticker in the polling loop to avoid memory leaks, and checking for context cancellation during test suite streaming to terminate early if a client disconnects.

Comment on lines +30 to +42
func makeStackHealthHandler(deps stackHealthDeps) http.HandlerFunc {
return func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(stackHealthResponse{
Services: []stackHealthService{
probeHTTPService(deps.httpClient, "Gateway", "http://localhost:18080/mcp"),
probeTCPService("MCP", "127.0.0.1:18421"),
probeHTTPService(deps.httpClient, "Slack", "http://localhost:18090/healthz"),
probePostgresService(deps.pool),
},
})
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The deps.httpClient field is left uninitialized (nil) when the handler is registered in serve.go. This causes probeHTTPService to instantiate a new http.Client on every single probe call. Since the health endpoint is polled frequently, creating new clients repeatedly can lead to socket exhaustion (TIME_WAIT sockets) and resource leaks.

Initializing deps.httpClient once inside makeStackHealthHandler if it is nil ensures that a single client is reused across all requests.

func makeStackHealthHandler(deps stackHealthDeps) http.HandlerFunc {
	if deps.httpClient == nil {
		deps.httpClient = &http.Client{Timeout: stackHealthProbeTimeout}
	}
	return func(w http.ResponseWriter, r *http.Request) {
		w.Header().Set("Content-Type", "application/json")
		_ = json.NewEncoder(w).Encode(stackHealthResponse{
			Services: []stackHealthService{
				probeHTTPService(deps.httpClient, "Gateway", "http://localhost:18080/mcp"),
				probeTCPService("MCP", "127.0.0.1:18421"),
				probeHTTPService(deps.httpClient, "Slack", "http://localhost:18090/healthz"),
				probePostgresService(deps.pool),
			},
		})
	}
}

Comment on lines +297 to +333
deadline := time.Now().Add(e.pollTimeout)
var trace []TraceRow
for {
trace, err = e.queryTrace(ctx, sessionID)
if err != nil {
result.Failures = []CheckFailure{{
Check: "run",
Expected: "audit query succeeds",
Observed: err.Error(),
}}
return result
}
result.Trace = trace
if hasDecision(trace, "budgetExceeded") {
result.Passed = true
return result
}
if time.Now().After(deadline) {
result.Failures = []CheckFailure{{
Check: "policyOutcome",
Expected: "budgetExceeded",
Observed: lastDecision(trace),
}}
return result
}

select {
case <-ctx.Done():
result.Failures = []CheckFailure{{
Check: "run",
Expected: "context remains active",
Observed: ctx.Err().Error(),
}}
return result
case <-time.After(e.pollInterval):
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using time.After inside a loop can lead to memory/resource leaks because the underlying timer is not garbage collected until it fires, even if the loop terminates early or the context is canceled.

Instead, instantiate a single time.Ticker outside the loop and reuse it. This is more efficient and avoids potential leaks.

	deadline := time.Now().Add(e.pollTimeout)
	var trace []TraceRow
	ticker := time.NewTicker(e.pollInterval)
	defer ticker.Stop()
	for {
		trace, err = e.queryTrace(ctx, sessionID)
		if err != nil {
			result.Failures = []CheckFailure{{
				Check:    "run",
				Expected: "audit query succeeds",
				Observed: err.Error(),
			}}
			return result
		}
		result.Trace = trace
		if hasDecision(trace, "budgetExceeded") {
			result.Passed = true
			return result
		}
		if time.Now().After(deadline) {
			result.Failures = []CheckFailure{{
				Check:    "policyOutcome",
				Expected: "budgetExceeded",
				Observed: lastDecision(trace),
			}}
			return result
		}

		select {
		case <-ctx.Done():
			result.Failures = []CheckFailure{{
				Check:    "run",
				Expected: "context remains active",
				Observed: ctx.Err().Error(),
			}}
			return result
		case <-ticker.C:
		}
	}

Comment thread cmd/eval-runner/stream.go
Comment on lines +42 to +56
func streamEvalSuite(ctx context.Context, w http.ResponseWriter, runner caseExecutor, cases []EvalCase) {
results := make([]CaseResult, 0, len(cases))
total := len(cases)
for index, testCase := range cases {
if err := writeSSE(w, "case_start", caseStartEvent{Name: testCase.Name, Index: index, Total: total}); err != nil {
return
}
result := runEvalCase(ctx, runner, testCase)
results = append(results, result)
if err := writeSSE(w, "case_result", caseResultEvent{Index: index, Total: total, Result: result}); err != nil {
return
}
}
_ = writeSSE(w, "summary", summarizeResults(results))
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The streamEvalSuite function iterates over all test cases without checking if the request context has been canceled. If a client disconnects mid-run, the server will continue executing subsequent test cases (which can be slow or resource-intensive) unnecessarily.

Adding a check for ctx.Err() != nil at the start of each iteration allows the loop to terminate early when the client disconnects.

func streamEvalSuite(ctx context.Context, w http.ResponseWriter, runner caseExecutor, cases []EvalCase) {
	results := make([]CaseResult, 0, len(cases))
	total := len(cases)
	for index, testCase := range cases {
		if ctx.Err() != nil {
			return
		}
		if err := writeSSE(w, "case_start", caseStartEvent{Name: testCase.Name, Index: index, Total: total}); err != nil {
			return
		}
		result := runEvalCase(ctx, runner, testCase)
		results = append(results, result)
		if err := writeSSE(w, "case_result", caseResultEvent{Index: index, Total: total, Result: result}); err != nil {
			return
		}
	}
	_ = writeSSE(w, "summary", summarizeResults(results))
}

@henryqingmo henryqingmo merged commit 8a20294 into main May 28, 2026
2 of 3 checks passed
@henryqingmo henryqingmo review requested due to automatic review settings May 28, 2026 08:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant