Skip to content

fix(api): debounce DB health gate so transient ping timeouts don't 503 every /api/*#70

Merged
aksOps merged 5 commits into
mainfrom
fix/dbhealth-consecutive-failure-hysteresis
Apr 29, 2026
Merged

fix(api): debounce DB health gate so transient ping timeouts don't 503 every /api/*#70
aksOps merged 5 commits into
mainfrom
fix/dbhealth-consecutive-failure-hysteresis

Conversation

@aksOps

@aksOps aksOps commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

Summary

`DBHealthMiddleware` was flipping `healthy=false` on a single failed ping and serving 503 to every `/api/` and `/v1/` request until the next 5s poll recovered. Under SQLite (`MaxOpen=1`) with concurrent ingest + API load, the 5s poller's 2s ping timeout routinely lost the connection-pool race to an in-flight write — producing user-visible 503 windows even though the DB was perfectly healthy.

Reproduction (with the chaos simulator running on this branch's parent commit)

Burst Result
200 concurrent `/api/metrics/dashboard` `503 × 200` (100% fatal)
50 concurrent `/api/traces` mostly 429s/503s
5 page-loads × 9 concurrent calls `503 × 45` (100% fatal)

Fix

Require `failureThreshold` (default 3) consecutive failed pings before flipping `healthy=false`. A single success resets the streak counter and restores `healthy=true` immediately — recovery stays fast.

`3 × 5s` poll interval = ~15s of real outage before the gate trips, which:

  • Swallows transient pool contention (the actual cause of the 503s)
  • Still surfaces a genuine DB outage well within the readiness SLO

Post-fix on the same load

Burst Result
200 concurrent `/api/metrics/dashboard` `200 × 100, 429 × 100`
50 concurrent `/api/traces` `200 × 43, 429 × 7`
5 page-loads × 9 concurrent `200 × 36, 429 × 9`

The 429s are the configured `API_RATE_LIMIT_RPS=100` policy doing its job under synthetic burst — not visible to real browser sessions arriving at human speed.

Files

  • `internal/api/dbhealth.go` — adds `consecutiveFails atomic.Int32` + `failureThreshold int32` to `DBHealth`. `ping()` increments the counter on failure and only calls `markHealthy(false)` when it crosses the threshold; success resets to 0. `SetFailureThreshold(n)` exposed for ops; `n <= 0` normalises to 1 (legacy any-failure-trips behaviour).
  • `internal/api/dbhealth_test.go` — four new tests:
    • single failure does not trip the gate
    • exactly threshold consecutive failures flips `healthy=false`
    • a single success resets the streak counter
    • `SetFailureThreshold` normalises non-positive input to 1

The pre-existing `TestDBHealth_TogglesOnPingFailure` still passes — its 10ms interval / 500ms deadline gives ~50 ticks, well above the threshold.

Test plan

  • `go test ./internal/api/ -run TestDBHealth -race` — 5 passed (1 existing + 4 new)
  • `go build ./...` — clean
  • Live: simulator running, no 503 windows under burst load
  • CI green before merge

🤖 Generated with Claude Code

aksOps and others added 5 commits April 28, 2026 15:44
…3 every /api/* request

DBHealthMiddleware was flipping healthy=false on a single ping failure and
serving 503 to every /api/* and /v1/* request until the next 5s poll
recovered. Under SQLite (MaxOpen=1) with concurrent ingest + API load,
the 5s poller's 2s ping timeout routinely lost the connection-pool race
to an in-flight write — producing user-visible 503 windows even though
the DB itself was perfectly healthy.

Reproduction with the chaos simulator running:
- 200 concurrent /api/metrics/dashboard → 503 × 200 (100% fatal)
- 5 page-loads × 9 concurrent calls   → 503 × 45 (100% fatal)

Fix: require failureThreshold (default 3) consecutive failed pings before
flipping healthy=false. A single success resets the streak counter and
restores healthy immediately — recovery stays fast. Three failures × 5s
interval = ~15s of real outage before the gate trips, which comfortably
swallows transient pool contention but still surfaces a genuine DB
outage well within the 30s readiness SLO.

Post-fix on the same load:
- 200 concurrent /api/metrics/dashboard → 200 × 100, 429 × 100
- 50 concurrent /api/traces             → 200 × 43,  429 × 7
- 5 page-loads × 9 concurrent           → 200 × 36,  429 × 9

The 429s are the configured API_RATE_LIMIT_RPS=100 policy under synthetic
burst — not visible to real browser sessions arriving at human speed.

Tests in internal/api/dbhealth_test.go cover:
- single failure does not trip the gate (debounce)
- exactly threshold consecutive failures flips healthy=false
- a single success resets the streak counter
- SetFailureThreshold normalises non-positive input to 1 (legacy
  any-failure-trips behaviour for callers that opt out)

The existing TestDBHealth_TogglesOnPingFailure still passes — its
10ms-interval / 500ms-deadline gives ~50 ticks, far above the threshold.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sonarqubecloud

Copy link
Copy Markdown

@aksOps aksOps merged commit f3c6131 into main Apr 29, 2026
17 checks passed
@aksOps aksOps deleted the fix/dbhealth-consecutive-failure-hysteresis branch April 29, 2026 13:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant