fix(api): debounce DB health gate so transient ping timeouts don't 503 every /api/* by aksOps · Pull Request #70 · RandomCodeSpace/otelcontext

aksOps · 2026-04-29T13:36:38Z

Summary

`DBHealthMiddleware` was flipping `healthy=false` on a single failed ping and serving 503 to every `/api/` and `/v1/` request until the next 5s poll recovered. Under SQLite (`MaxOpen=1`) with concurrent ingest + API load, the 5s poller's 2s ping timeout routinely lost the connection-pool race to an in-flight write — producing user-visible 503 windows even though the DB was perfectly healthy.

Reproduction (with the chaos simulator running on this branch's parent commit)

Burst	Result
200 concurrent `/api/metrics/dashboard`	`503 × 200` (100% fatal)
50 concurrent `/api/traces`	mostly 429s/503s
5 page-loads × 9 concurrent calls	`503 × 45` (100% fatal)

Fix

Require `failureThreshold` (default 3) consecutive failed pings before flipping `healthy=false`. A single success resets the streak counter and restores `healthy=true` immediately — recovery stays fast.

`3 × 5s` poll interval = ~15s of real outage before the gate trips, which:

Swallows transient pool contention (the actual cause of the 503s)
Still surfaces a genuine DB outage well within the readiness SLO

Post-fix on the same load

Burst	Result
200 concurrent `/api/metrics/dashboard`	`200 × 100, 429 × 100`
50 concurrent `/api/traces`	`200 × 43, 429 × 7`
5 page-loads × 9 concurrent	`200 × 36, 429 × 9`

The 429s are the configured `API_RATE_LIMIT_RPS=100` policy doing its job under synthetic burst — not visible to real browser sessions arriving at human speed.

Files

`internal/api/dbhealth.go` — adds `consecutiveFails atomic.Int32` + `failureThreshold int32` to `DBHealth`. `ping()` increments the counter on failure and only calls `markHealthy(false)` when it crosses the threshold; success resets to 0. `SetFailureThreshold(n)` exposed for ops; `n <= 0` normalises to 1 (legacy any-failure-trips behaviour).
`internal/api/dbhealth_test.go` — four new tests:
- single failure does not trip the gate
- exactly threshold consecutive failures flips `healthy=false`
- a single success resets the streak counter
- `SetFailureThreshold` normalises non-positive input to 1

The pre-existing `TestDBHealth_TogglesOnPingFailure` still passes — its 10ms interval / 500ms deadline gives ~50 ticks, well above the threshold.

Test plan

`go test ./internal/api/ -run TestDBHealth -race` — 5 passed (1 existing + 4 new)
`go build ./...` — clean
Live: simulator running, no 503 windows under burst load
CI green before merge

🤖 Generated with Claude Code

…3 every /api/* request DBHealthMiddleware was flipping healthy=false on a single ping failure and serving 503 to every /api/* and /v1/* request until the next 5s poll recovered. Under SQLite (MaxOpen=1) with concurrent ingest + API load, the 5s poller's 2s ping timeout routinely lost the connection-pool race to an in-flight write — producing user-visible 503 windows even though the DB itself was perfectly healthy. Reproduction with the chaos simulator running: - 200 concurrent /api/metrics/dashboard → 503 × 200 (100% fatal) - 5 page-loads × 9 concurrent calls → 503 × 45 (100% fatal) Fix: require failureThreshold (default 3) consecutive failed pings before flipping healthy=false. A single success resets the streak counter and restores healthy immediately — recovery stays fast. Three failures × 5s interval = ~15s of real outage before the gate trips, which comfortably swallows transient pool contention but still surfaces a genuine DB outage well within the 30s readiness SLO. Post-fix on the same load: - 200 concurrent /api/metrics/dashboard → 200 × 100, 429 × 100 - 50 concurrent /api/traces → 200 × 43, 429 × 7 - 5 page-loads × 9 concurrent → 200 × 36, 429 × 9 The 429s are the configured API_RATE_LIMIT_RPS=100 policy under synthetic burst — not visible to real browser sessions arriving at human speed. Tests in internal/api/dbhealth_test.go cover: - single failure does not trip the gate (debounce) - exactly threshold consecutive failures flips healthy=false - a single success resets the streak counter - SetFailureThreshold normalises non-positive input to 1 (legacy any-failure-trips behaviour for callers that opt out) The existing TestDBHealth_TogglesOnPingFailure still passes — its 10ms-interval / 500ms-deadline gives ~50 ticks, far above the threshold. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

sonarqubecloud · 2026-04-29T13:39:58Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

aksOps and others added 5 commits April 28, 2026 15:44

checkpoint: pre-yolo 2026-04-28T15:44:23

21c4256

checkpoint: pre-yolo 2026-04-28T23:33:41

c25c15c

checkpoint: pre-yolo 2026-04-28T23:37:28

5254f02

checkpoint: pre-yolo 2026-04-29T13:30:37

1c6595f

aksOps merged commit f3c6131 into main Apr 29, 2026
17 checks passed

aksOps deleted the fix/dbhealth-consecutive-failure-hysteresis branch April 29, 2026 13:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(api): debounce DB health gate so transient ping timeouts don't 503 every /api/*#70

fix(api): debounce DB health gate so transient ping timeouts don't 503 every /api/*#70
aksOps merged 5 commits into
mainfrom
fix/dbhealth-consecutive-failure-hysteresis

aksOps commented Apr 29, 2026

Uh oh!

sonarqubecloud Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aksOps commented Apr 29, 2026

Summary

Reproduction (with the chaos simulator running on this branch's parent commit)

Fix

Post-fix on the same load

Files

Test plan

Uh oh!

sonarqubecloud Bot commented Apr 29, 2026

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant