Problem
When the passive connection's broker sends a Lame Duck Mode (LDM) signal, we force the passive to reconnect to a healthy broker before the active connection needs to steal its socket.
However, ApPassiveServerPool.nextServer() only skips the active server — it does not skip the current passive server (the one that just sent LDM). During the broker's drain window the LDM broker continues to accept new connections, so nextServer() may return it again, causing a redundant reconnect cycle — worst case, the active steals a socket that is already dying.
Concrete failure sequence (3 nodes: B1=active, B3=passive+LDM)
connectSucceeded(B3) → pool randomizes → entryList = [B3, B1, B2]
ApPassiveServerPool.nextServer() → B3 ≠ active(B1) → returns B3 (the dying broker)
- Passive reconnects to the same draining broker → receives LDM again
- If active steals passive's socket during this window, it inherits a dying connection
Proposed API addition
// In ApPassiveServerPool
/**
* Marks the given URI as a lame-duck server.
* nextServer() and peekNextServer() will skip this URI (in addition to the active server)
* until clearLameDuckServer() is called.
*/
public void setLameDuckServer(NatsUri uri) { ... }
/**
* Clears any previously-set lame-duck server so it becomes eligible for selection again.
*/
public void clearLameDuckServer() { ... }
Update nextServer() and peekNextServer() to skip both activeServerRef AND lameDuckServerRef.
Current workaround
We demote the LDM server in pool ordering before triggering the passive reconnect, which reduces the chance of it being selected again. However, this is a positional heuristic — it relies on ordering behaviour rather than an explicit skip rule. If the pool ordering changes (e.g. due to a shuffle on the next connectSucceeded()), the protection may not hold. A first-class setLameDuckServer() API that guarantees the server is skipped until explicitly cleared is the correct and reliable fix.
Problem
When the passive connection's broker sends a Lame Duck Mode (LDM) signal, we force the passive to reconnect to a healthy broker before the active connection needs to steal its socket.
However,
ApPassiveServerPool.nextServer()only skips the active server — it does not skip the current passive server (the one that just sent LDM). During the broker's drain window the LDM broker continues to accept new connections, sonextServer()may return it again, causing a redundant reconnect cycle — worst case, the active steals a socket that is already dying.Concrete failure sequence (3 nodes: B1=active, B3=passive+LDM)
connectSucceeded(B3)→ pool randomizes →entryList = [B3, B1, B2]ApPassiveServerPool.nextServer()→ B3 ≠ active(B1) → returns B3 (the dying broker)Proposed API addition
Update
nextServer()andpeekNextServer()to skip bothactiveServerRefANDlameDuckServerRef.Current workaround
We demote the LDM server in pool ordering before triggering the passive reconnect, which reduces the chance of it being selected again. However, this is a positional heuristic — it relies on ordering behaviour rather than an explicit skip rule. If the pool ordering changes (e.g. due to a shuffle on the next
connectSucceeded()), the protection may not hold. A first-classsetLameDuckServer()API that guarantees the server is skipped until explicitly cleared is the correct and reliable fix.