fix(discovery): do not advance last_notified for an unmatched trusted writer (SPDP silent-isolation backport)#4
Merged
amitsingh21 merged 1 commit intoMay 27, 2026
Conversation
… writer Backport of the upstream guard introduced in Fast-DDS v2.6.7 (ROS 2 Humble) to this Foxy 2.1.4 line. Root cause (production "silent isolation"; SCL bot156 and bot23<->elevator wedge): SPDP DATA(p) is best-effort and re-announced with a frozen sequence number (the PDP reader history is KEEP_LAST(1) keyed by participant). In 2.1.4 StatelessReader::change_received advances last_notified unconditionally, so an SPDP/SEDP sample accepted during the unmatched window (just after a lease-driven remove_remote_participant) writes a history_record keyed by the raw writer GUID. If that record is not migrated to the persistence GUID -- PDPListener releases the reader mutex around createParticipantProxyData()/assignRemoteEndpoints(), and the migrate can be skipped (e.g. participant-proxy pool exhaustion, or a concurrent lease-timer removal) -- the raw-GUID record is orphaned. Every subsequent frozen-sequence re-announce then satisfies thereIsUpperRecordOf() and is dropped before reaching PDPListener, so the participant is never re-discovered, its endpoints never re-pair, and it stays silently isolated until process restart. Confirmed in production via gdb (StatelessReader::processDataMsg fires while PDPListener::onNewCacheChangeAdded count stays at 0). Fix: for a trusted (framework) writer that is not currently in matched_writers_, skip update_last_notified. With no raw-GUID write there is nothing to strand; every ordering between the receive thread and the lease-timer thread then resolves cleanly, and re-announces from an unmatched participant are reprocessed until it is re-discovered. Matched writers still dedup normally, so there is no steady-state cost. Validated with a lab reproducer: a participant whose proxy slot is unavailable strands and is permanently gated on the unpatched library (SPDP ACCEPT=1, DROP~60), whereas with the guard it is never gated (ACCEPT on every announcement) and re-discovers the moment a slot frees, restoring data; the unpatched peer stays dead. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Backports the upstream Fast-DDS v2.6.7 (ROS 2 Humble) guard into this Foxy
2.1.4line: inStatelessReader::change_received, do not advancelast_notifiedfor a trusted (SPDP/SEDP framework) writer that is not currently matched.One-line change in effect:
Root cause — production "silent isolation"
Observed on the fleet (SCL
bot156bot↔edge;bot23↔elevator response leg): after a WiFi blip > lease, a participant is silently isolated — beacons flow but it is never re-discovered, endpoints never re-pair, recoverable only by restarting the process. gdb confirmed the signature:StatelessReader::processDataMsgkeeps firing whilePDPListener::onNewCacheChangeAddedstays at 0 (SPDP dropped before the listener).Mechanism:
DATA(p)is best-effort and re-announced with a frozen sequence number (PDP history isKEEP_LAST(1)keyed by participant).last_notifiedunconditionally. An SPDP/SEDP sample accepted during the unmatched window (right after a lease-drivenremove_remote_participant) writes ahistory_recordkeyed by the raw writer GUID.add_persistence_guidduringassignRemoteEndpoints. ButPDPListener::onNewCacheChangeAddedreleases the reader mutex aroundcreateParticipantProxyData()/assignRemoteEndpoints(), and the migrate can be skipped (participant-proxy pool exhaustion →createParticipantProxyDatareturnsnullptr; or a concurrent lease-timer removal). The raw-GUID record is then orphaned.thereIsUpperRecordOf()and is dropped before PDPListener → no re-discovery → silent isolation.Fix
Skip
update_last_notifiedfor an unmatched trusted writer. With no raw-GUID write there is nothing to strand, so every ordering between the receive thread and the lease-timer thread resolves cleanly, and re-announces from an unmatched participant are reprocessed until it re-discovers. Matched writers still dedup normally → no steady-state cost (important on the high-fan-in side, e.g. an elevator/edge tracking ~200 peers).Validation (lab reproducer)
ACCEPT=1, DROP≈60(gated forever). With the guard:ACCEPTon every announcement,DROP=0(never gated, retries).Scope / risk
StatelessReader.cpp), behavior change limited to trusted framework writers (SPDP/SEDP); user-data readers unaffected.rapyuta/2.1.4-foxy-sedp-fix(different, writer-side flavor) — together they cover both observed post-blip failure modes.Production validation to follow via canary (patched multiarch
SECURITY=ONlib on a wedge-prone bot).