fix(upstream): require N consecutive health-check timeouts before marking Error#470
Open
Dumbris wants to merge 1 commit into
Open
fix(upstream): require N consecutive health-check timeouts before marking Error#470Dumbris wants to merge 1 commit into
Dumbris wants to merge 1 commit into
Conversation
…king Error Slow remote upstreams (notably hf.co/mcp under load) routinely miss a single 5-second health-check window without actually being down. The previous code flipped the server to Error on the very first miss, which caused two visible bugs: 1. The Web UI's tools list went empty for ~30-60s every time the user toggled a tool, because the post-toggle re-fetch hit the State=Error window where StateView returns no tools. 2. Combined with the unclassified-error code (PR #469), the user saw a red "Server Error / MCPX_UNKNOWN_UNCLASSIFIED" alert that paid no real-world dividend — by the next 30s tick the server was Ready again. Add a small consecutive-failure counter to the managed Client. Transient errors (deadline exceeded, timeout, context canceled) require healthCheckFailureThreshold=3 misses (~90s) before flipping Error. Hard errors (connection refused, no such host, network unreachable, etc.) bypass the counter and trigger Error on the first miss — those are real outages and waiting helps no one. A successful health check or a fresh Connect() resets the counter to zero. Tests cover all four behaviors: tolerated transient, immediate hard, success-resets, and connect-resets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Deploying mcpproxy-docs with
|
| Latest commit: |
e57adae
|
| Status: | ✅ Deploy successful! |
| Preview URL: | https://2db2bb92.mcpproxy-docs.pages.dev |
| Branch Preview URL: | https://fix-health-check-flap.mcpproxy-docs.pages.dev |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The complement to #469. Slow remote upstreams (notably
hf.co/mcpunder load) routinely miss a single 5-second health-check window without actually being down. The previous code flipped the server toErroron the very first miss, which caused two visible bugs:syncAfterToolToggle()re-fetches immediately after a tool enable/disable. If a health check timed out in that window, StateView returned no tools and the UI showed "No tools available" until the next reconnect (~30-60s later).MCPX_UNKNOWN_UNCLASSIFIEDpaint a red banner that paid no real-world dividend — by the next 30s tick the server was Ready again.Approach
Add a small
consecutiveHealthFailurescounter to the managed Client.healthCheckFailureThreshold = 3consecutive misses (~90 s) before flipping Error.Connect()success resets the counter so reconnect cycles don't carry stale debt.The
isTransientHealthCheckErrorhelper is a strict subset of the existingisConnectionErrorpredicate — it doesn't change which errors are considered connection failures, only whether they get the multi-failure tolerance.Test plan
go test ./internal/upstream/managed/ -race— green, including the four new cases:TestHealthCheck_TransientTimeoutToleratedBelowThreshold— counter ticks but state stays Ready until thresholdTestHealthCheck_HardErrorTriggersImmediateError— connection-refused → Error on first missTestHealthCheck_SuccessResetsCounter— recovery wipes the slateTestHealthCheck_ResetOnConnect— reconnect starts freshTestIsTransientHealthCheckError— matrix of error → categorygo build ./...cleanWhat this does NOT do
It doesn't bump the per-call timeout (still 5s) or change the tick (still 30s). The minimum perceived recovery time for a real outage is now ~90s instead of ~30s — an acceptable trade for eliminating false flap alerts on slow remote upstreams.
🤖 Generated with Claude Code