Skip to content

Add support for NCCL in communication_object#195

Draft
msimberg wants to merge 138 commits into
ghex-org:masterfrom
msimberg:nccl-2
Draft

Add support for NCCL in communication_object#195
msimberg wants to merge 138 commits into
ghex-org:masterfrom
msimberg:nccl-2

Conversation

@msimberg

@msimberg msimberg commented Dec 22, 2025

Copy link
Copy Markdown
Collaborator

Nothing to see here yet.

This requires ghex-org/oomph#55 and #190.

Updates communication_object to use the start_group/end_group functionality from oomph/NCCL, as well as taking is_stream_aware into account.

Also does a minor refactoring of packer and communication object helper functions so that the different stages are a bit easier to follow:

  • optional: sync before packing
  • pack
  • send
  • recv
  • unpack
  • optional: sync after unpacking

This replaces #185.

Comment thread cmake/ghex_external_dependencies.cmake Outdated
Comment thread include/ghex/packer.hpp Outdated
Comment thread include/ghex/packer.hpp Outdated
Comment thread include/ghex/packer.hpp Outdated
Comment thread include/ghex/packer.hpp Outdated
Comment thread test/unstructured/test_user_concepts.cpp Outdated
Comment thread include/ghex/communication_object.hpp Outdated
msimberg added 28 commits June 25, 2026 18:17
# Conflicts:
#	ext/oomph
#	test/bindings/python/test_structured_pattern.py
- Update oomph submodule to include self-send/recv fix
- Update test helpers to match new error message
- Self-send/recv now works within NCCL groups
- Add NCCL_PXN_DISABLE=1 to test environment to avoid PXN warnings with host buffers
- Convert TODO comment in cubed_sphere test to explanatory comment
- This is a known NCCL limitation when using CPU memory
The communication_object_ipr was not using start_group()/end_group()
around send/recv operations. This caused self-send/recv to fail with
NCCL backend since self-send/recv requires an active group.

This fix wraps post_recvs() and pack() (which calls send()) in a
group, matching the behavior of the regular communication_object.
- Update spack nccl.yaml to use oomph commit 3d7f658 (self-send/recv fix)
- Increase parallel Python test timeout to 1200s (was 600s)
- Fixes test_parallel_4_nccl failures:
  * simple_regular_domain and local_rma: error message mismatch
  * py_unstructured_domain_descriptor_parallel: timeout
Remove GTEST_SKIP statements for NCCL backend in structured tests:
- test_regular_domain.cpp: 6 tests enabled
- test_cubed_sphere_exchange.cpp: 2 tests enabled

These tests now work with NCCL thanks to the self-send/recv fix
(commit 3d7f658 in oomph). The tests already have exception handlers
for NCCL-specific errors (thread_safe, self-send/recv outside groups).

Threaded variants will still fail with NCCL (thread_safe not supported)
but the exception handlers will catch and report these appropriately.
The structured tests reveal a pre-existing bug in the NCCL backend's
handling of periodic boundaries in the decomposed dimension (x).

Symptoms:
- All 6 regular_domain tests fail with incorrect x-coordinates in halo regions
- y and z coordinates are correct
- Error is systematic across all type combinations and architectures

Root cause:
- NCCL backend or packer has a bug in how it handles halo exchanges
  for the decomposed dimension with periodic boundaries
- This bug was hidden because the tests were previously skipped

Action:
- Re-add GTEST_SKIP statements for all structured tests
- Document this as a known issue requiring further investigation
- Self-send/recv fix (commit 3d7f658) was correct and necessary,
  but revealed this deeper issue

This unblocks CI while the structured grid bug is investigated separately.
- Enable exchange_host_host and exchange_host_host_vector tests
- Add detailed debug output to check_values showing rank, domain_id,
  local/global indices, found vs expected values, halos, and domain bounds
- Skip split/threaded variants for NCCL (known limitations)
- This is for CI investigation only
- Enable regular_domain tests (host_host, host_host_vector, device_device,
  host_device) with run() and run_split() variants
- Enable cubed_sphere tests (cubed_sphere, cubed_sphere_vector)
- Threaded variants remain disabled for NCCL (documented limitation)
- Remove debug output from check_values
NCCL doesn't support tags, so messages are matched by order. When two
communicators share the same NCCL comm and send concurrently to the same
rank, messages can get mixed up. This is a fundamental NCCL limitation.

- Skip run_split() variants for NCCL (same as threaded variants)
- Keep run() enabled (single communicator, no interference)
- Cubed sphere tests remain enabled (they use single communicator)
NCCL doesn't support tags, so messages are matched by FIFO order per rank
pair. When exchanging fields across architectures (CPU and GPU), the
communication_object processes all CPU fields first, then all GPU fields.
This ordering mismatch causes messages to be received incorrectly.

- Skip exchange_host_device and exchange_host_device_vector for NCCL
- Same-architecture exchanges (host_host, device_device) work correctly
- Document this as a known NCCL limitation
The _nix_build/ directory contains build artifacts that should not be
tracked in git. Remove it from tracking and add to .gitignore.

This fixes the clang-format CI failures (7,450 violations were all in
third-party and generated files under _nix_build/).
Previous commit (ecbc76d) only updated .gitignore but didn't remove
the files from the index. This commit properly removes all 139 files
under _nix_build/ from git tracking.
The UCX backend was hanging in py_unstructured_domain_descriptor_parallel
when run after structured Python tests. Root cause: Python test fixtures
were not explicitly cleaning up MPI communicators and context objects,
leaving UCX/MPI state that affected subsequent tests.

Fix:
- Use pytest yield pattern in context, mpi_cart_comm, and cart_context fixtures
- Explicitly delete context objects and call gc.collect() to force cleanup
- Call Free() on MPI Cartesian communicators to release UCX resources

This ensures proper cleanup between tests and prevents UCX state accumulation
that was causing deadlocks in subsequent parallel tests.

Tested: Full test suite passes 5 consecutive runs without hangs.
- Remove all debug print statements and file logging from test_unstructured_domain_descriptor.py
- Add comments explaining why explicit cleanup is necessary in Python test fixtures
- Explicit cleanup prevents UCX/MPI state accumulation that can cause subsequent tests to hang
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants