Skip to content

Implement ROCm delay kernel#882

Open
Eetusjo wants to merge 1 commit into
mainfrom
ci_rocm_delay_kernel
Open

Implement ROCm delay kernel#882
Eetusjo wants to merge 1 commit into
mainfrom
ci_rocm_delay_kernel

Conversation

@Eetusjo
Copy link
Copy Markdown

@Eetusjo Eetusjo commented May 28, 2026

No description provided.

@Eetusjo Eetusjo added the claude-review Request a Claude AI code review for this PR label May 28, 2026
Comment thread xla/stream_executor/rocm/delay_kernel.h Outdated
Comment thread xla/stream_executor/rocm/rocm_executor.cc Outdated
Comment thread xla/stream_executor/rocm/delay_kernel_rocm.cu.cc Outdated
Comment thread xla/stream_executor/rocm/delay_kernel_rocm.cu.cc Outdated
Comment thread xla/stream_executor/rocm/rocm_timer_test.cc
Comment thread xla/stream_executor/rocm/rocm_executor.cc
@claude
Copy link
Copy Markdown

claude Bot commented May 28, 2026

Review Summary

Solid implementation that mirrors the CUDA delay kernel for ROCm. The architecture gating (MI300+) and HIP serialization env-var checks are well thought out. The RocmTimer lifecycle management (destructor cleanup, timeout detection) looks correct.

Key findings (see inline comments):

  • perf: delay_kernel_supported is recomputed on every timer creation — should be cached during init like the CUDA side
  • correctness: delay kernel test runs unconditionally regardless of GPU arch — may need a skip guard for pre-MI300 hardware
  • minor: missing #include <cstdlib>, copyright year mismatch, include ordering, and clock rate comment accuracy

@github-actions github-actions Bot removed the claude-review Request a Claude AI code review for this PR label May 28, 2026
@Eetusjo Eetusjo force-pushed the ci_rocm_delay_kernel branch from 46f30de to a6995a8 Compare May 29, 2026 08:06
@Eetusjo Eetusjo added the claude-review Request a Claude AI code review for this PR label May 29, 2026
#include "rocm/include/hip/hip_runtime.h"
#include "rocm/include/hip/hip_version.h"
#include "rocm/rocm_config.h"
#include <unistd.h>
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: #include <unistd.h> was moved here from the C system headers section and is now sandwiched between rocm/rocm_config.h and the xla/ headers. Per Google style, C system headers should appear in the system headers block (after the corresponding header and C++ standard library includes). It was previously in the right place (before <algorithm>); the fix for <cstdlib> seems to have displaced it.

Suggested change
#include <unistd.h>
#include "xla/stream_executor/activate_context.h"

(Move <unistd.h> back up to the C system headers section near line 18, alongside the other <c...> includes.)

Comment on lines +42 to +44
if (target_not_reached) {
*semaphore = GpuSemaphoreState::kTimedOut;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The CUDA version has a helpful comment here explaining why kTimedOut is written back on timeout:

// We are exiting due to the timeout. Signal this back to the host so that
// we can emit a warning, as it probably indicates suboptimal usage.

Adding a similar comment would improve readability and parity.

Comment on lines +85 to +86
LOG_FIRST_N(WARNING, 5)
<< "Delay kernel timed out, measured time has sub-optimal accuracy.";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The CUDA version logs at ERROR level with actionable guidance ("There may be a missing warmup execution, please investigate in Nsight Systems."). Consider upgrading the severity and adding ROCm-equivalent guidance, e.g.:

Suggested change
LOG_FIRST_N(WARNING, 5)
<< "Delay kernel timed out, measured time has sub-optimal accuracy.";
LOG_FIRST_N(ERROR, 1) << "Delay kernel timed out, measured time has "
"sub-optimal accuracy. There may be a missing "
"warmup execution, please investigate in rocprof.";

@claude
Copy link
Copy Markdown

claude Bot commented May 29, 2026

Re-review Summary

Good progress — 4 of 6 previous findings have been addressed (copyright year, capability caching, timeout comment, test skip guard, <cstdlib> include). The include-order nit in delay_kernel_rocm.cu.cc remains from last round.

New findings (3 inline comments):

  • rocm_executor.cc: <unistd.h> displaced from system headers block
  • delay_kernel_rocm.cu.cc: Missing explanatory comment on timeout path (parity with CUDA)
  • rocm_timer.cc: Timeout warning could use ERROR severity and actionable guidance like the CUDA version

All new findings are minor nits/parity suggestions — no correctness issues found.

@github-actions github-actions Bot removed the claude-review Request a Claude AI code review for this PR label May 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant