fix(upstream): require N consecutive health-check timeouts before marking Error by Dumbris · Pull Request #470 · smart-mcp-proxy/mcpproxy-go

Dumbris · 2026-05-15T07:40:04Z

Summary

The complement to #469. Slow remote upstreams (notably hf.co/mcp under load) routinely miss a single 5-second health-check window without actually being down. The previous code flipped the server to Error on the very first miss, which caused two visible bugs:

Tools list goes empty after every toggle. Frontend's syncAfterToolToggle() re-fetches immediately after a tool enable/disable. If a health check timed out in that window, StateView returned no tools and the UI showed "No tools available" until the next reconnect (~30-60s later).
Scary "Server Error" alert combined with fix(diagnostics): classify HTTP timeouts and string-wrapped 5xx as known codes #469's MCPX_UNKNOWN_UNCLASSIFIED paint a red banner that paid no real-world dividend — by the next 30s tick the server was Ready again.

Approach

Add a small consecutiveHealthFailures counter to the managed Client.

Transient errors (deadline exceeded, timeout, context canceled) require healthCheckFailureThreshold = 3 consecutive misses (~90 s) before flipping Error.
Hard errors (connection refused, no such host, network unreachable, connection reset, broken pipe) bypass the counter and trigger Error on the first miss — those are real outages and waiting helps no one.
A successful health check resets the counter to zero.
A fresh Connect() success resets the counter so reconnect cycles don't carry stale debt.

The isTransientHealthCheckError helper is a strict subset of the existing isConnectionError predicate — it doesn't change which errors are considered connection failures, only whether they get the multi-failure tolerance.

Test plan

go test ./internal/upstream/managed/ -race — green, including the four new cases:
- TestHealthCheck_TransientTimeoutToleratedBelowThreshold — counter ticks but state stays Ready until threshold
- TestHealthCheck_HardErrorTriggersImmediateError — connection-refused → Error on first miss
- TestHealthCheck_SuccessResetsCounter — recovery wipes the slate
- TestHealthCheck_ResetOnConnect — reconnect starts fresh
- TestIsTransientHealthCheckError — matrix of error → category
go build ./... clean

What this does NOT do

It doesn't bump the per-call timeout (still 5s) or change the tick (still 30s). The minimum perceived recovery time for a real outage is now ~90s instead of ~30s — an acceptable trade for eliminating false flap alerts on slow remote upstreams.

🤖 Generated with Claude Code

…king Error Slow remote upstreams (notably hf.co/mcp under load) routinely miss a single 5-second health-check window without actually being down. The previous code flipped the server to Error on the very first miss, which caused two visible bugs: 1. The Web UI's tools list went empty for ~30-60s every time the user toggled a tool, because the post-toggle re-fetch hit the State=Error window where StateView returns no tools. 2. Combined with the unclassified-error code (PR #469), the user saw a red "Server Error / MCPX_UNKNOWN_UNCLASSIFIED" alert that paid no real-world dividend — by the next 30s tick the server was Ready again. Add a small consecutive-failure counter to the managed Client. Transient errors (deadline exceeded, timeout, context canceled) require healthCheckFailureThreshold=3 misses (~90s) before flipping Error. Hard errors (connection refused, no such host, network unreachable, etc.) bypass the counter and trigger Error on the first miss — those are real outages and waiting helps no one. A successful health check or a fresh Connect() resets the counter to zero. Tests cover all four behaviors: tolerated transient, immediate hard, success-resets, and connect-resets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cloudflare-workers-and-pages · 2026-05-15T07:41:25Z

Deploying mcpproxy-docs with Cloudflare Pages

Latest commit:	`e57adae`
Status:	✅ Deploy successful!
Preview URL:	https://2db2bb92.mcpproxy-docs.pages.dev
Branch Preview URL:	https://fix-health-check-flap.mcpproxy-docs.pages.dev

View logs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(upstream): require N consecutive health-check timeouts before marking Error#470

fix(upstream): require N consecutive health-check timeouts before marking Error#470
Dumbris wants to merge 1 commit into
mainfrom
fix/health-check-flap

Dumbris commented May 15, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented May 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dumbris commented May 15, 2026

Summary

Approach

Test plan

What this does NOT do

Uh oh!

cloudflare-workers-and-pages Bot commented May 15, 2026

Deploying mcpproxy-docs with Cloudflare Pages

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants