Skip to content

Reset monitor confidence to 1 on failure for fast outage detection#28

Merged
dolph merged 1 commit into
mainfrom
claude/fix-issue-16-monitor-cadence
May 16, 2026
Merged

Reset monitor confidence to 1 on failure for fast outage detection#28
dolph merged 1 commit into
mainfrom
claude/fix-issue-16-monitor-cadence

Conversation

@dolph
Copy link
Copy Markdown
Owner

@dolph dolph commented May 16, 2026

Summary

Destination.Monitor adjusted polling cadence via a confidence counter (1–10 minutes) that grew on success and decremented by 1 on failure. After a long healthy run, confidence saturated at 10, leading to an outage-detection lag of up to ~55 minutes in a tool whose purpose is fast outage detection:

t=0    Check ok, confidence=10, sleep 10m
t=10m  destination already down (since t=8m), undetected
       Check fail, confidence=9, sleep 9m
t=19m  Check fail, confidence=8, sleep 8m
...
t=55m  Check fail, confidence=1, sleep 1m   (finally tight polling)

That is upside-down: detection is slowest exactly when destinations have been healthy — i.e. the steady state of a well-run service.

The fix snaps confidence back to 1 immediately on any failure, so polling becomes tight as soon as a destination first appears unreachable.

} else {
    confidence = 1
}

Scope is intentionally kept to the one-line fix. The state-change counter, tighter healthy-polling cap, and jitter suggestions from #16 are valuable but orthogonal and can be follow-ups.

Test Plan

  • go vet ./... passes
  • go build ./... passes
  • go test -race ./... — only the pre-existing TestRouteToLoopback{1,2,3} failures in router_test.go remain; these reproduce on origin/main without this change and are caused by the test environment routing loopback traffic via eth0 (192.0.2.x) instead of lo. Not introduced by this PR.
  • Manual smoke: run connectivity against a destination, observe that the first failed check after a long healthy run triggers a 1-minute poll instead of decrementing from 10.

No unit test added — Monitor() is not currently unit-tested because Check() performs real network I/O and is not seam-injected. Adding a test would require a larger refactor beyond the scope of this one-line fix.

Fixes #16

https://claude.ai/code/session_01WjHPSobuzrRkjwUgjAJWMk


Generated by Claude Code

Destination.Monitor adjusted polling cadence via a confidence counter
(1-10 minutes) that grew on success and decremented on failure. After
a long healthy run confidence saturated at 10, so the first failed
check was delayed up to 10 minutes, and then 9+8+7+...+1 = 45 more
minutes of accumulating failures were needed to reach 1-minute
polling. Total detection lag from a healthy steady state: up to ~55
minutes -- exactly backwards for a tool meant to improve SRE
responsiveness.

Snap confidence back to 1 immediately on any failure so polling
becomes tight as soon as a destination first looks unreachable.

Fixes #16

https://claude.ai/code/session_01WjHPSobuzrRkjwUgjAJWMk
@dolph dolph merged commit c2d60e9 into main May 16, 2026
2 checks passed
@dolph dolph deleted the claude/fix-issue-16-monitor-cadence branch May 16, 2026 13:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

monitor cadence delays outage detection by tens of minutes after a healthy period

2 participants