Skip to content

Iterate full activity vector in ApiActivityInfoExchange to fix HIP-graph kernel loss#859

Open
magaonka-amd wants to merge 1 commit into
ROCm:mainfrom
magaonka-amd:fix/rocm-profiler-front-only-loss
Open

Iterate full activity vector in ApiActivityInfoExchange to fix HIP-graph kernel loss#859
magaonka-amd wants to merge 1 commit into
ROCm:mainfrom
magaonka-amd:fix/rocm-profiler-front-only-loss

Conversation

@magaonka-amd
Copy link
Copy Markdown

@magaonka-amd magaonka-amd commented May 13, 2026

Summary

Kamil from our team noticed missing kernels in the trace while running maxtext.
In xla/backends/profiler/gpu/rocm_collector.cc, the second half of ApiActivityInfoExchange read only activity_iter.second.front() from the vector keyed by correlation_id.
This is a problem when command buffer is ON and kernel launch can also happen via hipGraphLaunch a single graph launch produces multiple kernel-dispatch records all sharing one correlation_id, so the current XLA code drops all but the first.

Symptom (what users see)

  • xplane device:GPU:N planes report far fewer kernels than rocprofiler-sdk actually delivered. At MaxText llama2_7b scale: 1.27M instead of 3.78M (66% undercount).
  • Back-to-back runs of the same workload report different kernel counts and different sets of unique kernel names because the surviving .front() element depends on async HSA queue completion ordering.

Cause

std::vector<RocmTracerEvent>::front() reads the first element; the remaining 1..N-1 elements are silently discarded by the loop body.

Test plan

Benchmark

with this XLA patch applied. MaxText llama2_7b, steps=100, profiler=xplane, XLA_FLAGS=--xla_gpu_rocm_max_trace_events=1000000000:

Metric Before fix After fix
device_event_count (xplane) 1,270,346 3,776,488
Unique kernel kinds 117 146 (+29 kinds)
Back-to-back run-to-run Δ varies 0 (bit-exact)
host_plane_event_count 4,745,666 4,744,882 (approx same)
Per-GPU events GPU0=164k, GPU1-7=158k each GPU0=480k, GPU1-7=471k each

Regression test

Added RocmCollectorTest.MultipleActivitiesPerCorrelationIdAllExported in rocm_collector_test.cc. Inserts one api_event + three activity events all sharing one correlation_id (the hipGraphLaunch shape), runs Flush() + Export(), and asserts all three kernel records appear in the xplane device plane.

@magaonka-amd magaonka-amd force-pushed the fix/rocm-profiler-front-only-loss branch from c489258 to 205b429 Compare May 13, 2026 18:14
@magaonka-amd magaonka-amd added the claude-review Request a Claude AI code review for this PR label May 13, 2026
Comment on lines 691 to 695
"Could not find the counterpart HIP API.",
activity_event.correlation_id);
PrintRocmTracerEvent(activity_event, ". Dropped!");
continue;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (minor): When api_event == nullptr, the warning is logged per activity event inside the inner loop. In the hipGraph scenario (many kernel dispatches sharing one correlation_id), a single missing API event will produce N identical warning lines.

Consider moving the api_event == nullptr check above the inner loop (log once per correlation_id and continue to skip the whole vector), or use LOG_FIRST_N to cap the noise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done , good catch

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved ✓ — addressed in this revision. The api_event == nullptr check is now above the inner loop (log once per correlation_id, then continue).

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@magaonka-amd , this is a really good catch.
I recall when I was doing this "std::vector::front() reads the first element; the remaining 1..N-1 elements are silently discarded by the loop body", my observation was that hipGraphLaunch api bundles all the events into a single one.

Comment on lines +669 to 675
// Make sure for all activity events we have API callback events.
//
// `activity_iter.second` is a vector keyed by correlation_id; a single
// hipGraphLaunch can produce many kernel-dispatch records sharing one
// correlation_id. Iterate the whole vector; the api_event lookup is
// invariant across it and hoisted out of the inner loop.
for (auto& activity_iter : activity_ops_events_map_) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise: Good comment explaining the motivation and the loop-invariant hoisting. The structure of the refactored code is clean — api_event lookup once per correlation_id, then iterate activity events.

Testing note: The existing rocm_collector_test.cc only covers one activity event per correlation_id. Before merging, it would be valuable to add a test that inserts multiple activity events with the same correlation_id and verifies all are present in the aggregated output. This is the core scenario the PR fixes and would provide regression coverage.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved ✓ — addressed in this revision. The MultipleActivitiesPerCorrelationIdAllExported test now covers the multi-activity-per-correlation_id scenario with three kernel records sharing one correlation_id.


if (api_event == auxiliary_api_events_map_.end()) {
for (auto& activity_event : activity_iter.second) {
if (api_event == nullptr) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Now that this inner loop iterates all activity events per correlation_id (not just .front()), the aggregated_events.reserve(api_events_map_.size()) on line 626 underestimates the final size. Under the hipGraphLaunch workload this PR targets (3.78M device events per the description), this will cause frequent re-allocations.

Consider a tighter reserve, e.g.:

Suggested change
if (api_event == nullptr) {
for (auto& activity_event : activity_iter.second) {

(No change to this line — the suggestion is for line 626, which is outside the diff. Consider adding before the outer loop:)

size_t activity_count = 0;
for (const auto& [_, v] : activity_ops_events_map_) {
  activity_count += v.size();
}
aggregated_events.reserve(api_events_map_.size() + activity_count);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolved ✓ — addressed in this revision. The reserve now computes total_activities across all activity vectors and reserves api_events_map_.size() + total_activities.

@claude
Copy link
Copy Markdown

claude Bot commented May 13, 2026

Review Summary

The fix is correct and well-structured: iterating the full activity vector per correlation_id (instead of only .front()) and hoisting the API event lookup out of the inner loop.

Three inline comments posted:

  • Reserve underestimate (line 688): aggregated_events.reserve() on line 626 doesn't account for the now-expanded iteration; will cause re-allocations under hipGraph workloads.
  • Log flooding (lines 691-695): Missing-API-event warning fires per activity event inside the inner loop; could produce many identical lines for one correlation_id.
  • Test coverage (lines 669-675): Existing tests only cover one activity event per correlation_id. A multi-activity-per-correlation_id test would provide regression coverage for this fix.

🤖 Generated with Claude Code

@github-actions github-actions Bot removed the claude-review Request a Claude AI code review for this PR label May 13, 2026
@magaonka-amd magaonka-amd force-pushed the fix/rocm-profiler-front-only-loss branch 2 times, most recently from ae1ce97 to 3a024cf Compare May 14, 2026 17:13
@magaonka-amd magaonka-amd added the claude-review Request a Claude AI code review for this PR label May 14, 2026
@claude
Copy link
Copy Markdown

claude Bot commented May 14, 2026

Re-review Summary

All 3 previous review findings have been addressed in this revision:

  • Per-activity warning logging: api_event == nullptr check moved above the inner loop — logs once per correlation_id instead of per activity. ✅
  • Test coverage: MultipleActivitiesPerCorrelationIdAllExported test added, directly validating the multi-activity-per-correlation_id regression scenario. ✅
  • Reserve underestimate: Reserve now pre-computes total_activities across all activity vectors and sizes appropriately. ✅

No new issues found. The fix is clean and well-scoped — LGTM.

🤖 Generated with Claude Code

@github-actions github-actions Bot removed the claude-review Request a Claude AI code review for this PR label May 14, 2026
@magaonka-amd magaonka-amd changed the title [WIP] Iterate full activity vector in ApiActivityInfoExchange to fix HIP-graph kernel loss Iterate full activity vector in ApiActivityInfoExchange to fix HIP-graph kernel loss May 14, 2026
@magaonka-amd
Copy link
Copy Markdown
Author

@cj401-amd if changes look okay to you let me know. I'll open PR upstream for this.

@magaonka-amd magaonka-amd force-pushed the fix/rocm-profiler-front-only-loss branch 3 times, most recently from f5e0bcb to ac960b5 Compare May 20, 2026 16:22
…aph kernel loss

The second half of ApiActivityInfoExchange in xla/backends/profiler/gpu/rocm_collector.cc keyed activities by correlation_id and read only activity_iter.second.front().
This is problem with when hipGraphLaunch is involved because single graph launch can have multiple kernels all with same correlation ID. and all that gets recorded only once in xplane because it only reads front()

Without this change whenever command buffer is ON xla profiler drops events and events in final trace will be missing.

- Iterate the entire activity vector instead of only .front().
- Hoist the API-event lookup out of the inner loop (invariant across the vector since all activities share the correlation_id).

Added RocmCollectorTest.MultipleActivitiesPerCorrelationIdAllExported in rocm_collector_test.cc. Inserts one api_event + three activity events all sharing one correlation_id (the hipGraphLaunch shape), runs Flush() + Export(), and asserts all three kernel records appear in the xplane device plane.
@magaonka-amd magaonka-amd force-pushed the fix/rocm-profiler-front-only-loss branch from ac960b5 to 22d1781 Compare May 20, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants