Skip to content

fix(upstream): require N consecutive health-check timeouts before marking Error#470

Open
Dumbris wants to merge 1 commit into
mainfrom
fix/health-check-flap
Open

fix(upstream): require N consecutive health-check timeouts before marking Error#470
Dumbris wants to merge 1 commit into
mainfrom
fix/health-check-flap

Conversation

@Dumbris
Copy link
Copy Markdown
Member

@Dumbris Dumbris commented May 15, 2026

Summary

The complement to #469. Slow remote upstreams (notably hf.co/mcp under load) routinely miss a single 5-second health-check window without actually being down. The previous code flipped the server to Error on the very first miss, which caused two visible bugs:

  1. Tools list goes empty after every toggle. Frontend's syncAfterToolToggle() re-fetches immediately after a tool enable/disable. If a health check timed out in that window, StateView returned no tools and the UI showed "No tools available" until the next reconnect (~30-60s later).
  2. Scary "Server Error" alert combined with fix(diagnostics): classify HTTP timeouts and string-wrapped 5xx as known codes #469's MCPX_UNKNOWN_UNCLASSIFIED paint a red banner that paid no real-world dividend — by the next 30s tick the server was Ready again.

Approach

Add a small consecutiveHealthFailures counter to the managed Client.

  • Transient errors (deadline exceeded, timeout, context canceled) require healthCheckFailureThreshold = 3 consecutive misses (~90 s) before flipping Error.
  • Hard errors (connection refused, no such host, network unreachable, connection reset, broken pipe) bypass the counter and trigger Error on the first miss — those are real outages and waiting helps no one.
  • A successful health check resets the counter to zero.
  • A fresh Connect() success resets the counter so reconnect cycles don't carry stale debt.

The isTransientHealthCheckError helper is a strict subset of the existing isConnectionError predicate — it doesn't change which errors are considered connection failures, only whether they get the multi-failure tolerance.

Test plan

  • go test ./internal/upstream/managed/ -race — green, including the four new cases:
    • TestHealthCheck_TransientTimeoutToleratedBelowThreshold — counter ticks but state stays Ready until threshold
    • TestHealthCheck_HardErrorTriggersImmediateError — connection-refused → Error on first miss
    • TestHealthCheck_SuccessResetsCounter — recovery wipes the slate
    • TestHealthCheck_ResetOnConnect — reconnect starts fresh
    • TestIsTransientHealthCheckError — matrix of error → category
  • go build ./... clean

What this does NOT do

It doesn't bump the per-call timeout (still 5s) or change the tick (still 30s). The minimum perceived recovery time for a real outage is now ~90s instead of ~30s — an acceptable trade for eliminating false flap alerts on slow remote upstreams.

🤖 Generated with Claude Code

…king Error

Slow remote upstreams (notably hf.co/mcp under load) routinely miss a
single 5-second health-check window without actually being down. The
previous code flipped the server to Error on the very first miss, which
caused two visible bugs:

1. The Web UI's tools list went empty for ~30-60s every time the user
   toggled a tool, because the post-toggle re-fetch hit the State=Error
   window where StateView returns no tools.
2. Combined with the unclassified-error code (PR #469), the user saw a
   red "Server Error / MCPX_UNKNOWN_UNCLASSIFIED" alert that paid no
   real-world dividend — by the next 30s tick the server was Ready again.

Add a small consecutive-failure counter to the managed Client. Transient
errors (deadline exceeded, timeout, context canceled) require
healthCheckFailureThreshold=3 misses (~90s) before flipping Error. Hard
errors (connection refused, no such host, network unreachable, etc.)
bypass the counter and trigger Error on the first miss — those are real
outages and waiting helps no one. A successful health check or a fresh
Connect() resets the counter to zero.

Tests cover all four behaviors: tolerated transient, immediate hard,
success-resets, and connect-resets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cloudflare-workers-and-pages
Copy link
Copy Markdown

Deploying mcpproxy-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: e57adae
Status: ✅  Deploy successful!
Preview URL: https://2db2bb92.mcpproxy-docs.pages.dev
Branch Preview URL: https://fix-health-check-flap.mcpproxy-docs.pages.dev

View logs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants