Implement cache for pjrt client streams#861
Conversation
Review SummaryThis PR adds a process-level cache for Key findings (details in inline comments):
🤖 Generated with Claude Code |
There was a problem hiding this comment.
Thanks for this work!
Just to make sure IIUC, from your table, does it mean only ROCm will suffer this high cost by large number of destoryed/recreated on pjrt client? because seems this irrational pjrt lifecycle changes doesn't hurt NV at all.... if it's this case, I'm not sure it's enough to convince upstream to make this change, or we could just have this changes on ROCm only?
| Phase | H100 | AMD MI300X |
|---|---|---|
| Previous client teardown + new client init (pre-BFC log) | ~35ms total | ~963ms total |
| BFC allocator re-setup (8 GPUs) | ~0.3ms | ~0.1ms |
| Per-test GPU lifecycle cost | 35ms | 1009ms |
| const LocalDeviceId local_device_id_; | ||
| const LocalChipId local_hardware_id_; | ||
| const std::unique_ptr<LocalDeviceState> local_device_state_; | ||
| std::unique_ptr<LocalDeviceState> local_device_state_; |
There was a problem hiding this comment.
I guess we can only know whether this is ok by upstream review.
That’s true. The significant cost associated with creating and destroying streams only applies to the HIP side (I assume that stream creation and destruction have already been optimised within the CUDA driver). So, this optimisation really only benefits us... |
Then I guess the best is to have this on rocm-only? |
Motivation
The iota_test was very slow on AMD targets (compared to NVDIA) because the pjrt client was destroyed and recreated for each of the 4500 tests that make up the
iota_test. This task in ROCm is ~40× slower than with CUDA (see table below).The main cause of slowdowns when creating and destroying a pjrt client lies in the creation and destruction of streams.
This PR therefore implements a cache to avoid creating and destroying the stream for each client/test, and to allow existing streams to be reused (after they have been cleared) for the next test in the process.
Note that the stream and associated meta data are cleared before being placed in the cache to ensure reusing the stream can be reused safely. If a previous test failed due to a hardware failure preventing the stream from being cleared safely, it is destroyed and recreated for the next client/test.
On MI350: