Skip to content

Reduce iota_test execution time: shared pjrt client + automatic shard number#851

Closed
mfrancepillois wants to merge 2 commits into
rocm-jaxlib-v0.9.2from
ci_maxime_iota_shared_pjrt_client_rocm
Closed

Reduce iota_test execution time: shared pjrt client + automatic shard number#851
mfrancepillois wants to merge 2 commits into
rocm-jaxlib-v0.9.2from
ci_maxime_iota_shared_pjrt_client_rocm

Conversation

@mfrancepillois
Copy link
Copy Markdown

@mfrancepillois mfrancepillois commented May 8, 2026

Motivation

The iota_test was very slow on AMD targets (compared to NVDIA) for 2 main reasons:

  • The number of shards was fixed to 50, which led to resource contention and unnecessary context switches when running on node that does not have that amount of GPU available. => We therefore defined the shard number automatically based on the number of available GPU. On H100 node this reduces the execution time of iota from about 700 seconds to ~200 seconds.
  • The pjrt client was destroyed and recreated for each of the 4500 tests that make up the iota_test. This task in ROCm, which calls hipCtxDestroy() + hipCtxCreate(), was ~40× slower than with CUDA (see table below). => We therefore modified the code to create only 1 pjrt client per process and share it among the tests within that process.
Phase H100 AMD MI300X
Previous client teardown + new client init (pre-BFC log) ~35ms total ~963ms total
BFC allocator re-setup (8 GPUs) ~0.3ms ~0.1ms
Per-test GPU lifecycle cost 35ms 1009ms

This PR re-enables iota_test in AMD CIs

@mfrancepillois mfrancepillois added the claude-review Request a Claude AI code review for this PR label May 8, 2026
@mfrancepillois mfrancepillois force-pushed the ci_maxime_iota_shared_pjrt_client_rocm branch from 7218753 to e602e57 Compare May 8, 2026 16:51
@mfrancepillois mfrancepillois marked this pull request as ready for review May 8, 2026 17:00
@mfrancepillois mfrancepillois marked this pull request as draft May 9, 2026 07:41
@mfrancepillois mfrancepillois removed the claude-review Request a Claude AI code review for this PR label May 9, 2026
@mfrancepillois mfrancepillois marked this pull request as ready for review May 9, 2026 15:43
@mfrancepillois mfrancepillois added the claude-review Request a Claude AI code review for this PR label May 9, 2026
Copy link
Copy Markdown
Collaborator

@i-chaochen i-chaochen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this PR should be against to the main branch instead of 0.9.2, isn't?

@mfrancepillois
Copy link
Copy Markdown
Author

this PR should be against to the main branch instead of 0.9.2, isn't?

Another PR has been open against the main rocm branch: #852
This PR can therefore be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

claude-review Request a Claude AI code review for this PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants