feat: eval runner operator UI with resilience demo#9
Conversation
- Stack health was probing /healthz (404) instead of /mcp (200) on the gateway, causing it to always show as down - Add warmGatewayCapCache() called on eval-server startup to prime the gateway's tools/list cache while all services are healthy; fixes MCP Crash scenario failing when cache is cold - Write README with per-scenario setup instructions, warmup commands, and capability cache explanation Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request enhances the ToolGate evaluation runner by introducing a streaming-enabled "Resilience Operator UI" and several new backend endpoints. Key additions include a redesigned frontend dashboard with preset scenarios, a dedicated Retry Storm executor, stack health probes, and SSE-based streaming for evaluation results. The review feedback highlights critical resource management improvements, including reusing a single HTTP client in the stack health handler to prevent socket exhaustion, replacing time.After with a reusable time.Ticker in the polling loop to avoid memory leaks, and checking for context cancellation during test suite streaming to terminate early if a client disconnects.
| func makeStackHealthHandler(deps stackHealthDeps) http.HandlerFunc { | ||
| return func(w http.ResponseWriter, r *http.Request) { | ||
| w.Header().Set("Content-Type", "application/json") | ||
| _ = json.NewEncoder(w).Encode(stackHealthResponse{ | ||
| Services: []stackHealthService{ | ||
| probeHTTPService(deps.httpClient, "Gateway", "http://localhost:18080/mcp"), | ||
| probeTCPService("MCP", "127.0.0.1:18421"), | ||
| probeHTTPService(deps.httpClient, "Slack", "http://localhost:18090/healthz"), | ||
| probePostgresService(deps.pool), | ||
| }, | ||
| }) | ||
| } | ||
| } |
There was a problem hiding this comment.
The deps.httpClient field is left uninitialized (nil) when the handler is registered in serve.go. This causes probeHTTPService to instantiate a new http.Client on every single probe call. Since the health endpoint is polled frequently, creating new clients repeatedly can lead to socket exhaustion (TIME_WAIT sockets) and resource leaks.
Initializing deps.httpClient once inside makeStackHealthHandler if it is nil ensures that a single client is reused across all requests.
func makeStackHealthHandler(deps stackHealthDeps) http.HandlerFunc {
if deps.httpClient == nil {
deps.httpClient = &http.Client{Timeout: stackHealthProbeTimeout}
}
return func(w http.ResponseWriter, r *http.Request) {
w.Header().Set("Content-Type", "application/json")
_ = json.NewEncoder(w).Encode(stackHealthResponse{
Services: []stackHealthService{
probeHTTPService(deps.httpClient, "Gateway", "http://localhost:18080/mcp"),
probeTCPService("MCP", "127.0.0.1:18421"),
probeHTTPService(deps.httpClient, "Slack", "http://localhost:18090/healthz"),
probePostgresService(deps.pool),
},
})
}
}| deadline := time.Now().Add(e.pollTimeout) | ||
| var trace []TraceRow | ||
| for { | ||
| trace, err = e.queryTrace(ctx, sessionID) | ||
| if err != nil { | ||
| result.Failures = []CheckFailure{{ | ||
| Check: "run", | ||
| Expected: "audit query succeeds", | ||
| Observed: err.Error(), | ||
| }} | ||
| return result | ||
| } | ||
| result.Trace = trace | ||
| if hasDecision(trace, "budgetExceeded") { | ||
| result.Passed = true | ||
| return result | ||
| } | ||
| if time.Now().After(deadline) { | ||
| result.Failures = []CheckFailure{{ | ||
| Check: "policyOutcome", | ||
| Expected: "budgetExceeded", | ||
| Observed: lastDecision(trace), | ||
| }} | ||
| return result | ||
| } | ||
|
|
||
| select { | ||
| case <-ctx.Done(): | ||
| result.Failures = []CheckFailure{{ | ||
| Check: "run", | ||
| Expected: "context remains active", | ||
| Observed: ctx.Err().Error(), | ||
| }} | ||
| return result | ||
| case <-time.After(e.pollInterval): | ||
| } | ||
| } |
There was a problem hiding this comment.
Using time.After inside a loop can lead to memory/resource leaks because the underlying timer is not garbage collected until it fires, even if the loop terminates early or the context is canceled.
Instead, instantiate a single time.Ticker outside the loop and reuse it. This is more efficient and avoids potential leaks.
deadline := time.Now().Add(e.pollTimeout)
var trace []TraceRow
ticker := time.NewTicker(e.pollInterval)
defer ticker.Stop()
for {
trace, err = e.queryTrace(ctx, sessionID)
if err != nil {
result.Failures = []CheckFailure{{
Check: "run",
Expected: "audit query succeeds",
Observed: err.Error(),
}}
return result
}
result.Trace = trace
if hasDecision(trace, "budgetExceeded") {
result.Passed = true
return result
}
if time.Now().After(deadline) {
result.Failures = []CheckFailure{{
Check: "policyOutcome",
Expected: "budgetExceeded",
Observed: lastDecision(trace),
}}
return result
}
select {
case <-ctx.Done():
result.Failures = []CheckFailure{{
Check: "run",
Expected: "context remains active",
Observed: ctx.Err().Error(),
}}
return result
case <-ticker.C:
}
}| func streamEvalSuite(ctx context.Context, w http.ResponseWriter, runner caseExecutor, cases []EvalCase) { | ||
| results := make([]CaseResult, 0, len(cases)) | ||
| total := len(cases) | ||
| for index, testCase := range cases { | ||
| if err := writeSSE(w, "case_start", caseStartEvent{Name: testCase.Name, Index: index, Total: total}); err != nil { | ||
| return | ||
| } | ||
| result := runEvalCase(ctx, runner, testCase) | ||
| results = append(results, result) | ||
| if err := writeSSE(w, "case_result", caseResultEvent{Index: index, Total: total, Result: result}); err != nil { | ||
| return | ||
| } | ||
| } | ||
| _ = writeSSE(w, "summary", summarizeResults(results)) | ||
| } |
There was a problem hiding this comment.
The streamEvalSuite function iterates over all test cases without checking if the request context has been canceled. If a client disconnects mid-run, the server will continue executing subsequent test cases (which can be slow or resource-intensive) unnecessarily.
Adding a check for ctx.Err() != nil at the start of each iteration allows the loop to terminate early when the client disconnects.
func streamEvalSuite(ctx context.Context, w http.ResponseWriter, runner caseExecutor, cases []EvalCase) {
results := make([]CaseResult, 0, len(cases))
total := len(cases)
for index, testCase := range cases {
if ctx.Err() != nil {
return
}
if err := writeSSE(w, "case_start", caseStartEvent{Name: testCase.Name, Index: index, Total: total}); err != nil {
return
}
result := runEvalCase(ctx, runner, testCase)
results = append(results, result)
if err := writeSSE(w, "case_result", caseResultEvent{Index: index, Total: total, Result: result}); err != nil {
return
}
}
_ = writeSSE(w, "summary", summarizeResults(results))
}
Summary
/cmd/eval-runner --serve) for running the three ToolGate resilience scenarios with real-time SSE result streaming/healthz(404) instead of/mcp(200) on the gateway, which caused Gateway to always show as downwarmGatewayCapCache()on eval-server startup to prime the gateway'stools/listcache while all services are healthy — required for MCP Crash scenario to pass when the upstream is later stoppedREADME.mdwith per-scenario setup instructions, warmup commands, and capability cache explanationScenarios covered
localstripe-mcpstoppedlist_recent_charges → allow → upstream_errorallow → budgetExceededmock-slackstoppedlist_recent_charges → allow,create_refund → approvalRequired → expiredTest plan
make build-compose-bins && docker compose up -d --waitPOSTGRES_DSN=... AGENT_URL=http://127.0.0.1:18086 go run ./cmd/eval-runner --serve evalsuite/resilience.yamlmake demo-resiliencepasses 3/3 headlessly🤖 Generated with Claude Code