feat(gateway): expose entity context on /faults/stream SSE events#400
Merged
Conversation
16d686d to
e4e5b7b
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Adds an optional x-medkit SOVD payload-extension object (entity_type, entity_id) to events streamed from GET /api/v1/faults/stream, letting consumers jump directly to /{entity_type}/{entity_id}/bulk-data/rosbags/{fault_code} instead of HEAD-probing entities. Resolution is snapshotted at fault arrival time so a discovery refresh cannot retroactively change the reported entity, and the lookup mirrors the node_to_app mapping plus runtime-fallback heuristics already used by gateway_node / ros2_runtime_introspection.
Changes:
- Introduce
QueuedEvent(id + event + resolvedEntityContext) replacing thepair<id, event>in the SSE buffer, and resolve entity context before the queue lock inon_fault_event. - Implement
SSEFaultHandler::resolve_entity_context: manifest/hybrid viacache.resolve_node_to_app(with and without leading slash), runtime fallback to FQN last segment, and collision-disambiguated<ns_prefix>_<name>— only when the App exists in the cache. - Document the new extension in the package README, top-level REST docs, and changelog; add 7 unit tests covering snapshot semantics, manifest/runtime/collision/no-leading-slash variants, and the omit cases.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
src/ros2_medkit_gateway/include/ros2_medkit_gateway/http/handlers/sse_fault_handler.hpp |
Adds EntityContext/QueuedEvent types, new resolve_entity_context() declaration, and switches the event deque element type. |
src/ros2_medkit_gateway/src/http/handlers/sse_fault_handler.cpp |
Snapshots entity context at enqueue, emits x-medkit in format_sse_event, implements manifest + runtime fallback resolution with has_app validation. |
src/ros2_medkit_gateway/test/test_sse_fault_handler.cpp |
Adds 7 GTests covering match/no-match, runtime fallback, manifest mapping (with/without leading slash), collision-disambiguated app, snapshot-at-enqueue, and empty reporting_sources. |
src/ros2_medkit_gateway/README.md |
Documents the new x-medkit payload extension and updated example payload for the SSE stream endpoint. |
src/ros2_medkit_gateway/CHANGELOG.rst |
Adds the Unreleased feature entry referencing #380. |
docs/api/rest.rst |
Updates the global SOVD extensions section with the new field on /faults/stream. |
e4e5b7b to
73e55e4
Compare
Each `GET /api/v1/faults/stream` event payload now carries an optional
`x-medkit` SOVD payload-extension object with `entity_type` and
`entity_id` fields when the gateway can resolve the fault's first
reporting source back to a SOVD entity. Consumers can hit
`/{entity_type}/{entity_id}/bulk-data/rosbags/{fault_code}` directly
instead of enumerating apps + components and HEAD-probing each one.
Nested under `x-medkit` per the SOVD payload-extension convention
(matches `x-medkit.aggregation_level`, `x-medkit.phase`, etc.). Flat
`x-medkit-*` names are reserved for endpoint paths (`/x-medkit-graph`)
and error codes, not payload fields.
Resolution chain (mirrors gateway_node's node_resolver lambda for the
manifest path):
- Manifest / hybrid: cache's node-to-app linking index, both with and
without the leading slash on the FQN.
- Runtime fallback: FQN's last segment, validated against
`cache.has_app()` so we never point consumers at a 404.
- Runtime collision fallback: `<ns_prefix>_<name>` form (slashes in
the namespace replaced with `_`), matching the disambiguation rule
in ros2_runtime_introspection.cpp for nodes that share a name
across namespaces.
- No match: `x-medkit` object omitted entirely; an `RCLCPP_DEBUG`
trace records the unresolved source so operators can diagnose
consumers stuck on the discovery fallback path. Backward-compatible
for existing SOVD consumers.
Entity resolution is snapshotted in `on_fault_event` (before the
queue lock is acquired) and stored as `std::optional<EntityContext>`
on the buffered event. A discovery refresh between enqueue and
stream-out cannot retroactively flip the entity reported to
consumers, and the format path stays lock-free with respect to the
entity cache.
73e55e4 to
13373e3
Compare
mfaferek93
reviewed
May 26, 2026
Address feedback on #400: - Comment that reporting_sources.front() picks the lexicographically-first reporter (set ordering), not a defined owner - any co-reporter's rosbag is still fetchable, but consumers should not assume ownership. - Comment that entity_type is hardcoded "apps" because apps are the leaf reporters; components own faults transitively. Manifest-only components without bound nodes have no match and fall back to discovery by design. - Add StreamPrefersManifestMatchOverRuntimeFallback test that pins the ordering contract: when both a manifest node_to_app mapping and a collision-matching runtime app exist for the same FQN, manifest wins. - Set per-test fault_manager.namespace so each TEST_F gets a unique events topic (/test<N>/fault_manager/events). Eliminates the shared /fault_manager/events that previously forced serial execution.
The four multi-gateway aggregation tests bypassed create_gateway_node() to build their own launch_ros.actions.Node instances and inherited the launch default sigterm_timeout=5/sigkill_timeout=5. Under TSan the gateway teardown sequence (mdns stop, REST server stop, transport shutdown, plugin shutdown) routinely exceeds 5s, so launch escalates SIGINT to SIGTERM and then to SIGKILL, leaving the process with exit -9 which is not in ALLOWED_EXIT_CODES. test_leaf_collision_aggregation tripped on this today; the other three are latent. Apply the same 30s/15s windows as create_gateway_node() so all four test families survive sanitizer slowdowns.
TSan flagged a Write/Read race on the rclcpp::Event control block between the per-context graph listener thread (calling NodeGraph::notify_graph_change) and our automatic member destruction in ~GatewayNode. Reproduced today in test_gateway_node where many GatewayNode instances are created and destroyed in sequence while the per-process GraphListener keeps running. Two-pronged fix: - Explicitly cancel and reset graph_check_timer_, backstop_timer_ and graph_event_ at the end of the ~GatewayNode body, before any other member runs its destructor. This drains the timer (its [this] lambda reads graph_event_) and drops our shared_ptr while we still control the ordering, instead of relying on declaration-order destruction. - Add a TSan suppression for the residual library-side window inside rclcpp::node_interfaces::NodeGraph::notify_graph_change, matching the existing approach for other rclcpp-internal races (signal handler, logging, Context shutdown). We do not own that code path; the fix above closes our half of the race.
bburda
added a commit
that referenced
this pull request
May 26, 2026
Address feedback on #400: - Comment that reporting_sources.front() picks the lexicographically-first reporter (set ordering), not a defined owner - any co-reporter's rosbag is still fetchable, but consumers should not assume ownership. - Comment that entity_type is hardcoded "apps" because apps are the leaf reporters; components own faults transitively. Manifest-only components without bound nodes have no match and fall back to discovery by design. - Add StreamPrefersManifestMatchOverRuntimeFallback test that pins the ordering contract: when both a manifest node_to_app mapping and a collision-matching runtime app exist for the same FQN, manifest wins. - Set per-test fault_manager.namespace so each TEST_F gets a unique events topic (/test<N>/fault_manager/events). Eliminates the shared /fault_manager/events that previously forced serial execution.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Pull Request
Summary
GET /api/v1/faults/streamevent payloads now carry an optionalx-medkitSOVD payload-extension object withentity_typeandentity_idfields when the gateway can resolve the fault's firstreporting_sourcesentry back to a SOVD entity. Consumers can then hit/{entity_type}/{entity_id}/bulk-data/rosbags/{fault_code}directly, replacing the O(N) HEAD-probe workaround that external integrations need today.Nested under
x-medkitper the SOVD payload-extension convention (matches existingx-medkit.aggregation_level,x-medkit.phase, etc.). Flatx-medkit-*names are reserved for endpoint paths (/x-medkit-graph) and error codes, not payload fields.Resolution mirrors the existing
TriggerFaultSubscriber/LogManagerpattern:/).x-medkitobject omitted entirely (backward-compatible for existing SOVD consumers).Resolution is snapshotted in
on_fault_eventand stored alongside the buffered event (newQueuedEventstruct), so a discovery refresh between enqueue and stream-out cannot retroactively flip the entity reported to consumers, and the format path stays lock-free wrt the entity cache.Example payload:
{ "event_type": "fault_confirmed", "fault": {"fault_code": "MOTOR_OVERHEAT", "...": "..."}, "timestamp": 1735830000.123, "x-medkit": {"entity_type": "apps", "entity_id": "motor_controller"} }Issue
Type
Testing
test_sse_fault_handler.cppcovering: manifest-app match, runtime-fallback match, no-match (entity absent from cache), and emptyreporting_sources.Checklist