Reset monitor confidence to 1 on failure for fast outage detection by dolph · Pull Request #28 · dolph/connectivity

dolph · 2026-05-16T12:55:46Z

Summary

Destination.Monitor adjusted polling cadence via a confidence counter (1–10 minutes) that grew on success and decremented by 1 on failure. After a long healthy run, confidence saturated at 10, leading to an outage-detection lag of up to ~55 minutes in a tool whose purpose is fast outage detection:

t=0    Check ok, confidence=10, sleep 10m
t=10m  destination already down (since t=8m), undetected
       Check fail, confidence=9, sleep 9m
t=19m  Check fail, confidence=8, sleep 8m
...
t=55m  Check fail, confidence=1, sleep 1m   (finally tight polling)

That is upside-down: detection is slowest exactly when destinations have been healthy — i.e. the steady state of a well-run service.

The fix snaps confidence back to 1 immediately on any failure, so polling becomes tight as soon as a destination first appears unreachable.

} else {
    confidence = 1
}

Scope is intentionally kept to the one-line fix. The state-change counter, tighter healthy-polling cap, and jitter suggestions from #16 are valuable but orthogonal and can be follow-ups.

Test Plan

go vet ./... passes
go build ./... passes
go test -race ./... — only the pre-existing TestRouteToLoopback{1,2,3} failures in router_test.go remain; these reproduce on origin/main without this change and are caused by the test environment routing loopback traffic via eth0 (192.0.2.x) instead of lo. Not introduced by this PR.
Manual smoke: run connectivity against a destination, observe that the first failed check after a long healthy run triggers a 1-minute poll instead of decrementing from 10.

No unit test added — Monitor() is not currently unit-tested because Check() performs real network I/O and is not seam-injected. Adding a test would require a larger refactor beyond the scope of this one-line fix.

Fixes #16

https://claude.ai/code/session_01WjHPSobuzrRkjwUgjAJWMk

Generated by Claude Code

Destination.Monitor adjusted polling cadence via a confidence counter (1-10 minutes) that grew on success and decremented on failure. After a long healthy run confidence saturated at 10, so the first failed check was delayed up to 10 minutes, and then 9+8+7+...+1 = 45 more minutes of accumulating failures were needed to reach 1-minute polling. Total detection lag from a healthy steady state: up to ~55 minutes -- exactly backwards for a tool meant to improve SRE responsiveness. Snap confidence back to 1 immediately on any failure so polling becomes tight as soon as a destination first looks unreachable. Fixes #16 https://claude.ai/code/session_01WjHPSobuzrRkjwUgjAJWMk

dolph merged commit c2d60e9 into main May 16, 2026
2 checks passed

dolph deleted the claude/fix-issue-16-monitor-cadence branch May 16, 2026 13:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reset monitor confidence to 1 on failure for fast outage detection#28

Reset monitor confidence to 1 on failure for fast outage detection#28
dolph merged 1 commit into
mainfrom
claude/fix-issue-16-monitor-cadence

dolph commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dolph commented May 16, 2026

Summary

Test Plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants