Skip to content

fix: re-gossip dead/suspect accusations on stale alive and unknown node#345

Open
emam07 wants to merge 2 commits into
hashicorp:masterfrom
emam07:fix/incarnation-mass-restart
Open

fix: re-gossip dead/suspect accusations on stale alive and unknown node#345
emam07 wants to merge 2 commits into
hashicorp:masterfrom
emam07:fix/incarnation-mass-restart

Conversation

@emam07

@emam07 emam07 commented May 31, 2026

Copy link
Copy Markdown

Problem

During mass cluster restarts, nodes can get permanently stuck as dead
in peers' views. The refutation mechanism never triggers because the
accusation (dead/suspect message) never reaches the restarted node.

Root cause — two silent drops in state.go:

Bug 1 (aliveNode, line ~1076): When a node receives
alive(node2, inc=1) but already holds dead(node2, inc=100), it
returns silently. It does not re-gossip the dead accusation back toward
the restarted node, so the node never calls refute().

Bug 2 (deadNode / suspectNode): When a dead or suspect message
arrives at a node that has never heard of the target (common in
freshly-joined nodes during a mass restart), it is silently dropped
instead of forwarded. This creates a gossip black hole that prevents
the accusation from propagating through nodes with incomplete views.

Both bugs together mean the restarted node broadcasts alive(inc=1)
indefinitely but no node ever sends back the dead(inc=100) accusation
it needs to refute. The node stays dead in affected peers' views
permanently — until an accidental TCP push/pull sync fixes it.

Reported in #311.

Fix

  • aliveNode(): when a stale alive message is received for a dead or
    suspect node, re-queue the accusation for gossip so the restarted
    node can receive it and refute.
  • deadNode() / suspectNode(): forward dead/suspect messages for
    unknown nodes rather than dropping them. The TransmitLimitedQueue
    already bounds retransmissions to RetransmitMult × log(N+1).
  • Added [INFO] log lines when nodes are marked suspect or dead for
    operator visibility.

Tests

Four new tests in state_test.go:

Test What it proves
TestMemberList_AliveNode_ReGossipsDeadAccusation Stale alive re-queues dead accusation
TestMemberList_AliveNode_ReGossipsSuspectAccusation Stale alive re-queues suspect accusation
TestMemberList_DeadNode_UnknownNode_ForwardsMessage Dead msg forwarded for unknown node
TestMemberList_SuspectNode_UnknownNode_ForwardsMessage Suspect msg forwarded for unknown node

All existing tests pass.

  During mass cluster restarts a node can get permanently stuck as dead
  in peers' views because the refutation mechanism never triggers. Two
  bugs prevent the accusation from reaching the restarted node:

  1. aliveNode() silently dropped stale alive messages (inc <= current)
     even when the local state was dead/suspect. It now re-queues the
     dead/suspect message so the restarted node can receive and refute it.

  2. deadNode() and suspectNode() silently dropped messages about nodes
     not yet in the local map. Freshly-joined nodes during a mass restart
     act as a gossip black hole. They now forward the message so it can
     propagate through nodes with incomplete cluster views.

  Adds [INFO] logs when nodes are marked suspect/dead for observability.
  Four new tests cover both bug scenarios directly.

  Fixes hashicorp#311
@emam07 emam07 requested a review from a team as a code owner May 31, 2026 19:48
@hashicorp-cla-app

hashicorp-cla-app Bot commented May 31, 2026

Copy link
Copy Markdown

CLA assistant check
All committers have signed the CLA.

@hashicorp-cla-app

Copy link
Copy Markdown

CLA assistant check

Thank you for your submission! We require that all contributors sign our Contributor License Agreement ("CLA") before we can accept the contribution. Read and sign the agreement

Learn more about why HashiCorp requires a CLA and what the CLA includes

Have you signed the CLA already but the status is still pending? Recheck it.

@ritikrajdev

Copy link
Copy Markdown

CLA check is passing over here, you can proceed with reviewing the PR and further processes.

@emam07

emam07 commented Jun 5, 2026

Copy link
Copy Markdown
Author

The two failing tests (TestShuffleNodes and TestMemberList_ProbeNode_Awareness_MissedNack) are pre-existing flaky tests unrelated to this PR.

This branch only modifies state.go — neither failing test exercises that code path:

  • TestShuffleNodes (util_test.go) tests shuffleNodes() in util.go, which uses rand.Shuffle. With 8 elements there is a ~1/40320 probability the shuffled
    order matches the original — a known statistical flake.
  • TestMemberList_ProbeNode_Awareness_MissedNack is a timing-sensitive test that already uses iretry.Run() and is known to be flaky on loaded CI runners.

Could you re-run the failed job? Happy to investigate further if it reproduces consistently.

@tgross tgross self-requested a review June 5, 2026 13:10
@tgross tgross self-assigned this Jun 5, 2026
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Jun 5, 2026
@tgross

tgross commented Jun 5, 2026

Copy link
Copy Markdown
Member

@emam07 just a heads up that this is on my review queue. I'll try to get to it soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants