fix: re-gossip dead/suspect accusations on stale alive and unknown node#345
fix: re-gossip dead/suspect accusations on stale alive and unknown node#345emam07 wants to merge 2 commits into
Conversation
During mass cluster restarts a node can get permanently stuck as dead
in peers' views because the refutation mechanism never triggers. Two
bugs prevent the accusation from reaching the restarted node:
1. aliveNode() silently dropped stale alive messages (inc <= current)
even when the local state was dead/suspect. It now re-queues the
dead/suspect message so the restarted node can receive and refute it.
2. deadNode() and suspectNode() silently dropped messages about nodes
not yet in the local map. Freshly-joined nodes during a mass restart
act as a gossip black hole. They now forward the message so it can
propagate through nodes with incomplete cluster views.
Adds [INFO] logs when nodes are marked suspect/dead for observability.
Four new tests cover both bug scenarios directly.
Fixes hashicorp#311
|
Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement Learn more about why HashiCorp requires a CLA and what the CLA includes Have you signed the CLA already but the status is still pending? Recheck it. |
|
CLA check is passing over here, you can proceed with reviewing the PR and further processes. |
|
The two failing tests (TestShuffleNodes and TestMemberList_ProbeNode_Awareness_MissedNack) are pre-existing flaky tests unrelated to this PR. This branch only modifies state.go — neither failing test exercises that code path:
Could you re-run the failed job? Happy to investigate further if it reproduces consistently. |
|
@emam07 just a heads up that this is on my review queue. I'll try to get to it soon. |
Problem
During mass cluster restarts, nodes can get permanently stuck as dead
in peers' views. The refutation mechanism never triggers because the
accusation (dead/suspect message) never reaches the restarted node.
Root cause — two silent drops in state.go:
Bug 1 (
aliveNode, line ~1076): When a node receivesalive(node2, inc=1)but already holdsdead(node2, inc=100), itreturns silently. It does not re-gossip the dead accusation back toward
the restarted node, so the node never calls
refute().Bug 2 (
deadNode/suspectNode): When a dead or suspect messagearrives at a node that has never heard of the target (common in
freshly-joined nodes during a mass restart), it is silently dropped
instead of forwarded. This creates a gossip black hole that prevents
the accusation from propagating through nodes with incomplete views.
Both bugs together mean the restarted node broadcasts
alive(inc=1)indefinitely but no node ever sends back the
dead(inc=100)accusationit needs to refute. The node stays dead in affected peers' views
permanently — until an accidental TCP push/pull sync fixes it.
Reported in #311.
Fix
aliveNode(): when a stale alive message is received for a dead orsuspect node, re-queue the accusation for gossip so the restarted
node can receive it and refute.
deadNode()/suspectNode(): forward dead/suspect messages forunknown nodes rather than dropping them. The
TransmitLimitedQueuealready bounds retransmissions to
RetransmitMult × log(N+1).[INFO]log lines when nodes are marked suspect or dead foroperator visibility.
Tests
Four new tests in
state_test.go:TestMemberList_AliveNode_ReGossipsDeadAccusationTestMemberList_AliveNode_ReGossipsSuspectAccusationTestMemberList_DeadNode_UnknownNode_ForwardsMessageTestMemberList_SuspectNode_UnknownNode_ForwardsMessageAll existing tests pass.