xla/pjrt: honor XLA_PYTHON_CLIENT_ALLOCATOR=platform#856
Conversation
| case GpuAllocatorConfig::Kind::kPlatform: { | ||
| LOG(INFO) << "Using platform allocator."; | ||
| if (allocator_config.collective_memory_size != 0) { | ||
| LOG(WARNING) | ||
| << "collective_memory_size is non-zero, but allocator kind is set " | ||
| "to \"platform\". Collective memory will not be allocated."; | ||
| } | ||
| // Returning null will cause the client to use the default backend | ||
| // allocator. | ||
| return nullptr; | ||
| std::vector<se::StreamExecutor*> executors; | ||
| executors.reserve(addressable_devices.size()); | ||
| for (const auto& [ordinal, device] : addressable_devices) { | ||
| executors.push_back(device->executor()); | ||
| } | ||
| return std::make_unique<se::StreamExecutorAddressAllocator>(platform, | ||
| executors); | ||
| } | ||
|
|
||
| case GpuAllocatorConfig::Kind::kVmm: { | ||
| #if GOOGLE_CUDA |
There was a problem hiding this comment.
nit: Like the kVmm case, kPlatform now returns early, which means it skips the collective memory allocator and host allocator setup at lines 1570–1591. The existing warning about collective_memory_size is good, but consider adding a brief comment or LOG(WARNING) noting that kPlatform does not support alternate memory spaces (collective, host, temp buffer) through MultiDeviceAdapter. This would help users who hit failures when using collective operations or host memory allocations with kPlatform.
There was a problem hiding this comment.
LOG(WARNING) on every kplatform... hmm not needed I think
| TEST(StreamExecutorGpuClientTest, PlatformAllocatorIsSynchronousPassthrough) { | ||
| GpuClientOptions options; | ||
| options.allocator_config.kind = GpuAllocatorConfig::Kind::kPlatform; | ||
| options.allowed_devices = {0}; | ||
|
|
||
| TF_ASSERT_OK_AND_ASSIGN(auto client, GetStreamExecutorGpuClient(options)); | ||
|
|
||
| auto* pjrt_se_client = | ||
| tensorflow::down_cast<PjRtStreamExecutorClient*>(client.get()); | ||
| EXPECT_NE(dynamic_cast<se::StreamExecutorAddressAllocator*>( | ||
| pjrt_se_client->allocator()), | ||
| nullptr); | ||
| EXPECT_EQ(dynamic_cast<se::MultiDeviceAdapter*>(pjrt_se_client->allocator()), | ||
| nullptr); | ||
| } | ||
|
|
||
| #if GOOGLE_CUDA | ||
| TEST(StreamExecutorGpuClientTest, VmmAllocatorCanBeSet) { | ||
| GpuClientOptions options; |
There was a problem hiding this comment.
nit: The test is well-structured and mirrors the existing VmmAllocatorCanBeSet pattern nicely. Two optional improvements:
-
Interaction with command-buffer override: There's a pre-existing override at
se_gpu_pjrt_client.cc:1493–1498that silently converts any non-kVmmkind tokVmmwhenxla_gpu_command_buffer_update_modeis notALWAYS_UPDATE. This test passes because the default isALWAYS_UPDATE, but could break if that default ever changes. Consider explicitly settingxla_gpu_command_buffer_update_modein the test's debug options, or adding a comment noting the dependency. -
E2E coverage: Unlike
VmmAllocatorE2ETestwhich runs an HLO program, this test only checks the allocator type. An additional E2E test that allocates/deallocates a buffer through the platform allocator would provide stronger confidence that the passthrough semantics work end-to-end.
There was a problem hiding this comment.
added comment.
i dont think hlo test is needed.
Review SummaryOverall: Looks good. This is a well-motivated bug fix. The old Two minor suggestions posted inline:
No blocking issues found. 🤖 Generated with Claude Code |
05c8d1b to
4d8398a
Compare
i-chaochen
left a comment
There was a problem hiding this comment.
IIUC NV will also have this same issue, isn't? or they don't because they are using VMM?
| executors.push_back(device->executor()); | ||
| } | ||
| return std::make_unique<se::StreamExecutorAddressAllocator>(platform, | ||
| executors); |
There was a problem hiding this comment.
openxla#7963 (comment) have you checked their previous fix?
There was a problem hiding this comment.
let me check this and get back to you thank you.
There was a problem hiding this comment.
IUC NV will also have this same issue, isn't --> Yes NV also has this problem
openxla#7963 (comment) -- here concern is about constructing
MultiDeviceAdapter object which we dont do here. I think we are okay in this.
There was a problem hiding this comment.
Thanks for clarification. I assume you found out this from jax upstream pytest? If it's also failed NV side why it doesn't appear on their NV CI?
There was a problem hiding this comment.
yes in pytest run I came across this issue.
In pytest rocm started seeing flaky autotuner problems , after debugging we realized BFC allocator was getting called and when BFCAllocator does asynchronous deallocation there is a race condition.
all these problems got exposed by openxla@426087bc1d recent commit from G.
so my argument here is we have two problems:
- BFC getting called when it shouldn't have because pytest run from JAX explicitly asks for platform allocator : so this PR is for that
- race condition in when BFC allocator is used : this is something we need to debug ( ruturaj is on it , I'm also looking bit into it ).
now coming to how NV survives this: even when platform allocator not used NV doesnt suffer from race issues that we are having. why exactly it doesn't race is not clear to me yet. I tried playing with CUDA reproducer on our H100 machine. my results were not really conclusive.
so in short answer:
does issue 1 exist in CUDA? yes.
does it lead to UT fails in CUDA ? No
does issue 1 lead to UT fails in ROCm ? Yes.
There was a problem hiding this comment.
AFAIK platform allocator also depends on BFC to do memory/fragmentation management. The only diff is the initial stage of BFC will pre-allocate most of memory while platform allocator doesn't. So I still think we need to firstly figure out the root cause of race condition in BFC allocator (if we think the issue is from there)
| for (const auto& [ordinal, device] : addressable_devices) { | ||
| executors.push_back(device->executor()); | ||
| } | ||
| return std::make_unique<se::StreamExecutorAddressAllocator>(platform, |
There was a problem hiding this comment.
Why is this early return safe, and why is GetGpuHostAllocator not required?
There was a problem hiding this comment.
Here I want to provide similar option as kvmm , if user explicitly sets platform allocator is needed , we intentionally want to bypass BFC wrapped alternative allocator.
IIUC GetGpuHostAllocator provides memory pool and when user asks for kplatform he wants to opt out from the pooled memory.
There was a problem hiding this comment.
Makes sense. Thanks for explaining.
|
@i-chaochen is it okay to open this PR upstream? |
|
Thanks for the explain. Yes, please |
|
openxla#42627 opened PR upstream. thanks everyone for the feedback. |
Updates LLVM usage to match [a225aafbd1a4](llvm/llvm-project@a225aafbd1a4) PiperOrigin-RevId: 922174432
…to use a fully-parameterised SpmdPartitioningTest. PiperOrigin-RevId: 922192214
Make GpuAllocatorConfig::Kind::kPlatform actually deliver a synchronous passthrough allocator instead of returning nullptr. Previously the env var XLA_PYTHON_CLIENT_ALLOCATOR=platform was silently overridden by the BFC allocator that Backend::Backend() builds unconditionally for CUDA/ROCm platforms. Adds a regression test PlatformAllocatorIsSynchronousPassthrough that asserts the resulting client's allocator is a StreamExecutorAddressAllocator and not a MultiDeviceAdapter (which would indicate the silent BFC fallback).
…iption_test DWYU flagged //xla/runtime:process_id as unused by se_gpu_topology_description_test. Confirmed: the test source contains no reference to any process_id / ProcessId symbol and does not include "xla/runtime/process_id.h", so the dep is dead.
a8a2432 to
8536283
Compare
clang-tidy-cuda's misc-include-cleaner flagged xla/stream_executor/device_address.h as not used directly inside PlatformAllocatorIsSynchronousPassthrough, and dwyu independently flagged that both device_address and stream had no matching BUILD dep on //xla/pjrt/gpu:se_gpu_pjrt_client_test. These two includes were carried over from the original commit when the monolithic se_gpu_pjrt_client_test.cc still owned the multi-GPU tests that needed them. After the file split (openxla/xla 591da9d) my new single-GPU test only uses StreamExecutorAddressAllocator and MultiDeviceAdapter; nothing from stream.h or device_address.h. Pruning the two includes silences both clang-tidy and dwyu without changing any logic. Verified locally with --config=rocm on gfx950: bazel build //xla/pjrt/gpu:se_gpu_pjrt_client_test -> Build completed successfully No functional change.
📝 Summary of Changes
GpuAllocatorConfig::Kind::kPlatforminGetStreamExecutorGpuDeviceAllocator()now returns an explicit
StreamExecutorAddressAllocator(synchronouspassthrough) instead of
nullptr. This restores the user-visible meaning ofXLA_PYTHON_CLIENT_ALLOCATOR=platform, which was being silently overridden bythe BFC allocator that
Backend::Backend()builds unconditionally forCUDA/ROCm platforms.
🎯 Justification
Setting
XLA_PYTHON_CLIENT_ALLOCATOR=platformis supposed to deliver asynchronous passthrough to
cudaMalloc/hipMalloc. After commit426087bc1d(PR#41761, "Port XLA Backend to use
BFC allocator"),
Backend::Backend()unconditionally builds aMultiDeviceAdapter-over-tsl::BFCAllocatorfor CUDA/ROCm platforms andaccepts no
GpuAllocatorConfig. Combined with PJRT's pre-existingpjrt_stream_executor_client.cc:311-315fallback(
if (owned_allocator_ == nullptr) allocator_ = client_->backend().memory_allocator();),users who set
kPlatformget the BFC anyway. The env var is silently ignored.Steps to reproduce (pre-fix)
Pre-fix output:
The order is the giveaway: the BFC named
XLA_backend_<n>_bfc(constructedin
xla/service/backend.cc:164) is the one growing to serve userallocations, even though PJRT logs
"Using platform allocator."for the sameworkload. The env var was a no-op on this code path.
Post-fix output (same script, same env): the
Extending allocation by ... for XLA_backend_<n>_bfc.line under load is gone. Only the synchronouspassthrough is exercised, which is what
kPlatformadvertises.🚀 Kind of Contribution
🐛 Bug Fix
🧪 Unit Tests
Adds
PlatformAllocatorIsSynchronousPassthroughinxla/pjrt/gpu/se_gpu_pjrt_client_test.cc, mirroring the existingVmmAllocatorCanBeSetpattern. Asserts:allocator()IS aStreamExecutorAddressAllocator.allocator()is NOT aMultiDeviceAdapter— thenegative assertion catches a future regression where someone re-introduces
the silent fallback to Backend's BFC.