Iterate full activity vector in ApiActivityInfoExchange to fix HIP-graph kernel loss#859
Iterate full activity vector in ApiActivityInfoExchange to fix HIP-graph kernel loss#859magaonka-amd wants to merge 1 commit into
Conversation
c489258 to
205b429
Compare
| "Could not find the counterpart HIP API.", | ||
| activity_event.correlation_id); | ||
| PrintRocmTracerEvent(activity_event, ". Dropped!"); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
issue (minor): When api_event == nullptr, the warning is logged per activity event inside the inner loop. In the hipGraph scenario (many kernel dispatches sharing one correlation_id), a single missing API event will produce N identical warning lines.
Consider moving the api_event == nullptr check above the inner loop (log once per correlation_id and continue to skip the whole vector), or use LOG_FIRST_N to cap the noise.
There was a problem hiding this comment.
Resolved ✓ — addressed in this revision. The api_event == nullptr check is now above the inner loop (log once per correlation_id, then continue).
There was a problem hiding this comment.
@magaonka-amd , this is a really good catch.
I recall when I was doing this "std::vector::front() reads the first element; the remaining 1..N-1 elements are silently discarded by the loop body", my observation was that hipGraphLaunch api bundles all the events into a single one.
| // Make sure for all activity events we have API callback events. | ||
| // | ||
| // `activity_iter.second` is a vector keyed by correlation_id; a single | ||
| // hipGraphLaunch can produce many kernel-dispatch records sharing one | ||
| // correlation_id. Iterate the whole vector; the api_event lookup is | ||
| // invariant across it and hoisted out of the inner loop. | ||
| for (auto& activity_iter : activity_ops_events_map_) { |
There was a problem hiding this comment.
praise: Good comment explaining the motivation and the loop-invariant hoisting. The structure of the refactored code is clean — api_event lookup once per correlation_id, then iterate activity events.
Testing note: The existing rocm_collector_test.cc only covers one activity event per correlation_id. Before merging, it would be valuable to add a test that inserts multiple activity events with the same correlation_id and verifies all are present in the aggregated output. This is the core scenario the PR fixes and would provide regression coverage.
There was a problem hiding this comment.
Resolved ✓ — addressed in this revision. The MultipleActivitiesPerCorrelationIdAllExported test now covers the multi-activity-per-correlation_id scenario with three kernel records sharing one correlation_id.
|
|
||
| if (api_event == auxiliary_api_events_map_.end()) { | ||
| for (auto& activity_event : activity_iter.second) { | ||
| if (api_event == nullptr) { |
There was a problem hiding this comment.
nit: Now that this inner loop iterates all activity events per correlation_id (not just .front()), the aggregated_events.reserve(api_events_map_.size()) on line 626 underestimates the final size. Under the hipGraphLaunch workload this PR targets (3.78M device events per the description), this will cause frequent re-allocations.
Consider a tighter reserve, e.g.:
| if (api_event == nullptr) { | |
| for (auto& activity_event : activity_iter.second) { |
(No change to this line — the suggestion is for line 626, which is outside the diff. Consider adding before the outer loop:)
size_t activity_count = 0;
for (const auto& [_, v] : activity_ops_events_map_) {
activity_count += v.size();
}
aggregated_events.reserve(api_events_map_.size() + activity_count);There was a problem hiding this comment.
Resolved ✓ — addressed in this revision. The reserve now computes total_activities across all activity vectors and reserves api_events_map_.size() + total_activities.
Review SummaryThe fix is correct and well-structured: iterating the full activity vector per correlation_id (instead of only Three inline comments posted:
🤖 Generated with Claude Code |
ae1ce97 to
3a024cf
Compare
Re-review SummaryAll 3 previous review findings have been addressed in this revision:
No new issues found. The fix is clean and well-scoped — LGTM. 🤖 Generated with Claude Code |
|
@cj401-amd if changes look okay to you let me know. I'll open PR upstream for this. |
f5e0bcb to
ac960b5
Compare
…aph kernel loss The second half of ApiActivityInfoExchange in xla/backends/profiler/gpu/rocm_collector.cc keyed activities by correlation_id and read only activity_iter.second.front(). This is problem with when hipGraphLaunch is involved because single graph launch can have multiple kernels all with same correlation ID. and all that gets recorded only once in xplane because it only reads front() Without this change whenever command buffer is ON xla profiler drops events and events in final trace will be missing. - Iterate the entire activity vector instead of only .front(). - Hoist the API-event lookup out of the inner loop (invariant across the vector since all activities share the correlation_id). Added RocmCollectorTest.MultipleActivitiesPerCorrelationIdAllExported in rocm_collector_test.cc. Inserts one api_event + three activity events all sharing one correlation_id (the hipGraphLaunch shape), runs Flush() + Export(), and asserts all three kernel records appear in the xplane device plane.
ac960b5 to
22d1781
Compare
Summary
Kamil from our team noticed missing kernels in the trace while running maxtext.
In
xla/backends/profiler/gpu/rocm_collector.cc, the second half ofApiActivityInfoExchangeread onlyactivity_iter.second.front()from the vector keyed by correlation_id.This is a problem when command buffer is ON and kernel launch can also happen via
hipGraphLauncha single graph launch produces multiple kernel-dispatch records all sharing one correlation_id, so the current XLA code drops all but the first.Symptom (what users see)
device:GPU:Nplanes report far fewer kernels than rocprofiler-sdk actually delivered. At MaxText llama2_7b scale: 1.27M instead of 3.78M (66% undercount)..front()element depends on async HSA queue completion ordering.Cause
std::vector<RocmTracerEvent>::front()reads the first element; the remaining 1..N-1 elements are silently discarded by the loop body.Test plan
Benchmark
with this XLA patch applied. MaxText llama2_7b,
steps=100,profiler=xplane,XLA_FLAGS=--xla_gpu_rocm_max_trace_events=1000000000:Regression test
Added
RocmCollectorTest.MultipleActivitiesPerCorrelationIdAllExportedinrocm_collector_test.cc. Inserts one api_event + three activity events all sharing one correlation_id (the hipGraphLaunch shape), runsFlush()+Export(), and asserts all three kernel records appear in the xplane device plane.