fix(api): debounce DB health gate so transient ping timeouts don't 503 every /api/*#70
Merged
Merged
Conversation
…3 every /api/* request DBHealthMiddleware was flipping healthy=false on a single ping failure and serving 503 to every /api/* and /v1/* request until the next 5s poll recovered. Under SQLite (MaxOpen=1) with concurrent ingest + API load, the 5s poller's 2s ping timeout routinely lost the connection-pool race to an in-flight write — producing user-visible 503 windows even though the DB itself was perfectly healthy. Reproduction with the chaos simulator running: - 200 concurrent /api/metrics/dashboard → 503 × 200 (100% fatal) - 5 page-loads × 9 concurrent calls → 503 × 45 (100% fatal) Fix: require failureThreshold (default 3) consecutive failed pings before flipping healthy=false. A single success resets the streak counter and restores healthy immediately — recovery stays fast. Three failures × 5s interval = ~15s of real outage before the gate trips, which comfortably swallows transient pool contention but still surfaces a genuine DB outage well within the 30s readiness SLO. Post-fix on the same load: - 200 concurrent /api/metrics/dashboard → 200 × 100, 429 × 100 - 50 concurrent /api/traces → 200 × 43, 429 × 7 - 5 page-loads × 9 concurrent → 200 × 36, 429 × 9 The 429s are the configured API_RATE_LIMIT_RPS=100 policy under synthetic burst — not visible to real browser sessions arriving at human speed. Tests in internal/api/dbhealth_test.go cover: - single failure does not trip the gate (debounce) - exactly threshold consecutive failures flips healthy=false - a single success resets the streak counter - SetFailureThreshold normalises non-positive input to 1 (legacy any-failure-trips behaviour for callers that opt out) The existing TestDBHealth_TogglesOnPingFailure still passes — its 10ms-interval / 500ms-deadline gives ~50 ticks, far above the threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
`DBHealthMiddleware` was flipping `healthy=false` on a single failed ping and serving 503 to every `/api/` and `/v1/` request until the next 5s poll recovered. Under SQLite (`MaxOpen=1`) with concurrent ingest + API load, the 5s poller's 2s ping timeout routinely lost the connection-pool race to an in-flight write — producing user-visible 503 windows even though the DB was perfectly healthy.
Reproduction (with the chaos simulator running on this branch's parent commit)
Fix
Require `failureThreshold` (default 3) consecutive failed pings before flipping `healthy=false`. A single success resets the streak counter and restores `healthy=true` immediately — recovery stays fast.
`3 × 5s` poll interval = ~15s of real outage before the gate trips, which:
Post-fix on the same load
The 429s are the configured `API_RATE_LIMIT_RPS=100` policy doing its job under synthetic burst — not visible to real browser sessions arriving at human speed.
Files
The pre-existing `TestDBHealth_TogglesOnPingFailure` still passes — its 10ms interval / 500ms deadline gives ~50 ticks, well above the threshold.
Test plan
🤖 Generated with Claude Code