Skip to content

Proposal 1 — ApPassiveServerPool: Add setLameDuckServer() API #3

@scottf

Description

@scottf

Problem

When the passive connection's broker sends a Lame Duck Mode (LDM) signal, we force the passive to reconnect to a healthy broker before the active connection needs to steal its socket.

However, ApPassiveServerPool.nextServer() only skips the active server — it does not skip the current passive server (the one that just sent LDM). During the broker's drain window the LDM broker continues to accept new connections, so nextServer() may return it again, causing a redundant reconnect cycle — worst case, the active steals a socket that is already dying.

Concrete failure sequence (3 nodes: B1=active, B3=passive+LDM)

  1. connectSucceeded(B3) → pool randomizes → entryList = [B3, B1, B2]
  2. ApPassiveServerPool.nextServer() → B3 ≠ active(B1) → returns B3 (the dying broker)
  3. Passive reconnects to the same draining broker → receives LDM again
  4. If active steals passive's socket during this window, it inherits a dying connection

Proposed API addition

// In ApPassiveServerPool
/**
 * Marks the given URI as a lame-duck server.
 * nextServer() and peekNextServer() will skip this URI (in addition to the active server)
 * until clearLameDuckServer() is called.
 */
public void setLameDuckServer(NatsUri uri) { ... }

/**
 * Clears any previously-set lame-duck server so it becomes eligible for selection again.
 */
public void clearLameDuckServer() { ... }

Update nextServer() and peekNextServer() to skip both activeServerRef AND lameDuckServerRef.

Current workaround

We demote the LDM server in pool ordering before triggering the passive reconnect, which reduces the chance of it being selected again. However, this is a positional heuristic — it relies on ordering behaviour rather than an explicit skip rule. If the pool ordering changes (e.g. due to a shuffle on the next connectSucceeded()), the protection may not hold. A first-class setLameDuckServer() API that guarantees the server is skipped until explicitly cleared is the correct and reliable fix.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions