Skip to content

fix(discovery): do not advance last_notified for an unmatched trusted writer (SPDP silent-isolation backport)#4

Merged
amitsingh21 merged 1 commit into
rapyuta/2.1.4-foxy-sedp-fixfrom
rapyuta/2.1.4-foxy-spdp-rediscovery-fix
May 27, 2026
Merged

fix(discovery): do not advance last_notified for an unmatched trusted writer (SPDP silent-isolation backport)#4
amitsingh21 merged 1 commit into
rapyuta/2.1.4-foxy-sedp-fixfrom
rapyuta/2.1.4-foxy-spdp-rediscovery-fix

Conversation

@amitsingh21

Copy link
Copy Markdown

Summary

Backports the upstream Fast-DDS v2.6.7 (ROS 2 Humble) guard into this Foxy 2.1.4 line: in StatelessReader::change_received, do not advance last_notified for a trusted (SPDP/SEDP framework) writer that is not currently matched.

One-line change in effect:

if (!thereIsUpperRecordOf(...)) {
    bool update_notified = true;
    if (m_trustedWriterEntityId == change->writerGUID.entityId)
        update_notified = (writer is in matched_writers_);
    if (received_change(...)) { ...; if (update_notified) update_last_notified(...); ... }
}

Root cause — production "silent isolation"

Observed on the fleet (SCL bot156 bot↔edge; bot23↔elevator response leg): after a WiFi blip > lease, a participant is silently isolated — beacons flow but it is never re-discovered, endpoints never re-pair, recoverable only by restarting the process. gdb confirmed the signature: StatelessReader::processDataMsg keeps firing while PDPListener::onNewCacheChangeAdded stays at 0 (SPDP dropped before the listener).

Mechanism:

  • SPDP DATA(p) is best-effort and re-announced with a frozen sequence number (PDP history is KEEP_LAST(1) keyed by participant).
  • 2.1.4 advances last_notified unconditionally. An SPDP/SEDP sample accepted during the unmatched window (right after a lease-driven remove_remote_participant) writes a history_record keyed by the raw writer GUID.
  • That record is normally migrated to the persistence GUID by add_persistence_guid during assignRemoteEndpoints. But PDPListener::onNewCacheChangeAdded releases the reader mutex around createParticipantProxyData()/assignRemoteEndpoints(), and the migrate can be skipped (participant-proxy pool exhaustion → createParticipantProxyData returns nullptr; or a concurrent lease-timer removal). The raw-GUID record is then orphaned.
  • Every later frozen-sequence re-announce now satisfies thereIsUpperRecordOf() and is dropped before PDPListener → no re-discovery → silent isolation.

Fix

Skip update_last_notified for an unmatched trusted writer. With no raw-GUID write there is nothing to strand, so every ordering between the receive thread and the lease-timer thread resolves cleanly, and re-announces from an unmatched participant are reprocessed until it re-discovers. Matched writers still dedup normally → no steady-state cost (important on the high-fan-in side, e.g. an elevator/edge tracking ~200 peers).

Validation (lab reproducer)

  • Mechanism: under forced proxy-pool pressure, an over-limit participant strands on the unpatched lib — SPDP ACCEPT=1, DROP≈60 (gated forever). With the guard: ACCEPT on every announcement, DROP=0 (never gated, retries).
  • Data recovery: when a proxy slot frees, the guarded (never-gated) participant re-discovers and data resumes; the unpatched (gated) participant stays dead.

Scope / risk

  • Single file (StatelessReader.cpp), behavior change limited to trusted framework writers (SPDP/SEDP); user-data readers unaffected.
  • Exact backport of code shipping since v2.6.7 across Humble/Iron/Jazzy/Rolling → low risk.
  • Stacks on the SEDP NACK/GAP deadlock backport already on rapyuta/2.1.4-foxy-sedp-fix (different, writer-side flavor) — together they cover both observed post-blip failure modes.

Production validation to follow via canary (patched multiarch SECURITY=ON lib on a wedge-prone bot).

… writer

Backport of the upstream guard introduced in Fast-DDS v2.6.7 (ROS 2 Humble) to
this Foxy 2.1.4 line.

Root cause (production "silent isolation"; SCL bot156 and bot23<->elevator wedge):
SPDP DATA(p) is best-effort and re-announced with a frozen sequence number (the
PDP reader history is KEEP_LAST(1) keyed by participant). In 2.1.4
StatelessReader::change_received advances last_notified unconditionally, so an
SPDP/SEDP sample accepted during the unmatched window (just after a lease-driven
remove_remote_participant) writes a history_record keyed by the raw writer GUID.
If that record is not migrated to the persistence GUID -- PDPListener releases the
reader mutex around createParticipantProxyData()/assignRemoteEndpoints(), and the
migrate can be skipped (e.g. participant-proxy pool exhaustion, or a concurrent
lease-timer removal) -- the raw-GUID record is orphaned. Every subsequent
frozen-sequence re-announce then satisfies thereIsUpperRecordOf() and is dropped
before reaching PDPListener, so the participant is never re-discovered, its
endpoints never re-pair, and it stays silently isolated until process restart.
Confirmed in production via gdb (StatelessReader::processDataMsg fires while
PDPListener::onNewCacheChangeAdded count stays at 0).

Fix: for a trusted (framework) writer that is not currently in matched_writers_,
skip update_last_notified. With no raw-GUID write there is nothing to strand;
every ordering between the receive thread and the lease-timer thread then resolves
cleanly, and re-announces from an unmatched participant are reprocessed until it is
re-discovered. Matched writers still dedup normally, so there is no steady-state cost.

Validated with a lab reproducer: a participant whose proxy slot is unavailable
strands and is permanently gated on the unpatched library (SPDP ACCEPT=1, DROP~60),
whereas with the guard it is never gated (ACCEPT on every announcement) and
re-discovers the moment a slot frees, restoring data; the unpatched peer stays dead.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@amitsingh21 amitsingh21 merged commit 5c7bd4a into rapyuta/2.1.4-foxy-sedp-fix May 27, 2026
4 checks passed
@amitsingh21 amitsingh21 deleted the rapyuta/2.1.4-foxy-spdp-rediscovery-fix branch May 27, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant