Reset monitor confidence to 1 on failure for fast outage detection#28
Merged
Conversation
Destination.Monitor adjusted polling cadence via a confidence counter (1-10 minutes) that grew on success and decremented on failure. After a long healthy run confidence saturated at 10, so the first failed check was delayed up to 10 minutes, and then 9+8+7+...+1 = 45 more minutes of accumulating failures were needed to reach 1-minute polling. Total detection lag from a healthy steady state: up to ~55 minutes -- exactly backwards for a tool meant to improve SRE responsiveness. Snap confidence back to 1 immediately on any failure so polling becomes tight as soon as a destination first looks unreachable. Fixes #16 https://claude.ai/code/session_01WjHPSobuzrRkjwUgjAJWMk
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Destination.Monitoradjusted polling cadence via aconfidencecounter (1–10 minutes) that grew on success and decremented by 1 on failure. After a long healthy run,confidencesaturated at 10, leading to an outage-detection lag of up to ~55 minutes in a tool whose purpose is fast outage detection:That is upside-down: detection is slowest exactly when destinations have been healthy — i.e. the steady state of a well-run service.
The fix snaps
confidenceback to1immediately on any failure, so polling becomes tight as soon as a destination first appears unreachable.Scope is intentionally kept to the one-line fix. The state-change counter, tighter healthy-polling cap, and jitter suggestions from #16 are valuable but orthogonal and can be follow-ups.
Test Plan
go vet ./...passesgo build ./...passesgo test -race ./...— only the pre-existingTestRouteToLoopback{1,2,3}failures inrouter_test.goremain; these reproduce onorigin/mainwithout this change and are caused by the test environment routing loopback traffic viaeth0(192.0.2.x) instead oflo. Not introduced by this PR.connectivityagainst a destination, observe that the first failed check after a long healthy run triggers a 1-minute poll instead of decrementing from 10.No unit test added —
Monitor()is not currently unit-tested becauseCheck()performs real network I/O and is not seam-injected. Adding a test would require a larger refactor beyond the scope of this one-line fix.Fixes #16
https://claude.ai/code/session_01WjHPSobuzrRkjwUgjAJWMk
Generated by Claude Code