Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
138 commits
Select commit Hold shift + click to select a range
99fe0a0
Try 1d block for pack/unpack
msimberg Oct 24, 2025
527d590
Add dumb nccl implementation
msimberg Oct 15, 2025
78879bb
Add back cuda event class
msimberg Oct 24, 2025
f314a1c
Add TODO for nccl in cmake
msimberg Oct 24, 2025
ab0dfd0
Clean up nccl parts
msimberg Oct 24, 2025
4b5833f
Small fix to stream syncing with nccl
msimberg Oct 24, 2025
ee1b851
Update test to disable cpu exchange with nccl
msimberg Oct 24, 2025
186965d
Merge branch 'async-mpi-2' into nccl-2
msimberg Nov 26, 2025
2744fec
Add FindNCCL.cmake
msimberg Nov 26, 2025
618e7eb
Update communication object for nccl integration
msimberg Dec 3, 2025
b7f573d
Update oomph
msimberg Dec 3, 2025
aede6ab
Update oomph
msimberg Dec 18, 2025
38cb29b
Update oomph
msimberg Dec 22, 2025
bacc1a4
Remove debug print
msimberg Dec 22, 2025
fe958f2
Minor cleanup
msimberg Dec 22, 2025
0fc74be
Merge commit 'f7273c2a232a0bb37cf869c9ee33c688387cf41b' into nccl-2
msimberg Dec 22, 2025
49b9bf7
Format files
msimberg Dec 22, 2025
34fc8d0
Update oomph
msimberg Dec 22, 2025
a51de83
Merge remote-tracking branch 'philip-paul-mueller/phimuell__async-mpi…
msimberg Dec 22, 2025
0614d9b
Update oomph
msimberg Dec 22, 2025
dd3257b
Merge remote-tracking branch 'origin/master' into nccl-2
msimberg Dec 22, 2025
83a1c7e
Remove NCCL macros
msimberg Dec 22, 2025
f112e4f
Sync with master
msimberg Dec 22, 2025
15e7c05
Update oomph
msimberg Jan 6, 2026
0d91ac9
Merge remote-tracking branch 'philip-paul-mueller/phimuell__async-mpi…
msimberg Jan 6, 2026
b9f4d55
Make cmake configuration error out if GPU support isn't enabled when …
msimberg Jan 6, 2026
26f4f3e
Remove unused functions in packer
msimberg Jan 6, 2026
baf0dc6
Remove dummy callback todos
msimberg Jan 6, 2026
3f7709f
Small updates
msimberg Jan 6, 2026
d1b41ea
Small fixes to async mode
msimberg Jan 6, 2026
173abd7
Clean up communication object exchange implementations
msimberg Jan 6, 2026
d8c8631
Refactor packing/unpacking etc.
msimberg Jan 7, 2026
9e695c3
More pack/unpack cleanup
msimberg Jan 7, 2026
6367449
Format files
msimberg Jan 7, 2026
606e4d0
Un-disable a test with NCCL
msimberg Jan 7, 2026
ac9f1c1
Minor cleanup
msimberg Jan 7, 2026
4a491fd
Update oomph
msimberg Jan 7, 2026
ffbb066
Update oomph
msimberg Jan 8, 2026
5dee034
Minor formatting, unused variable warnings etc.
msimberg Jan 8, 2026
da0f77c
Fix compilation with hip
msimberg Jan 8, 2026
289a761
Formatting
msimberg Jan 8, 2026
fdcd4ac
Formatting
msimberg Jan 8, 2026
4aaaadf
Merge remote-tracking branch 'philip-paul-mueller/phimuell__async-mpi…
msimberg Jan 8, 2026
993393d
Remove wrong assertion
msimberg Jan 8, 2026
7459c63
Update some tests for NCCL
msimberg Jan 8, 2026
5a65526
Disable more tests with NCCL
msimberg Jan 8, 2026
e6cec86
Merge tag 'v0.5.0' into nccl-2
msimberg Mar 25, 2026
c28a432
Add comment about NCCL PXN to cubed_sphere tests
msimberg Jan 8, 2026
21bb1c0
Fix compilation
msimberg Mar 25, 2026
f5bbf49
Update oomph
msimberg Mar 25, 2026
451c51d
Remove unnecessary find_package(NCCL)
msimberg Mar 25, 2026
2719c94
Add back comment
msimberg Mar 25, 2026
5f6e151
Rethrow exceptions with bare throw
msimberg Mar 25, 2026
793c766
Remove outdated comment
msimberg Mar 25, 2026
ec275a8
Use high priority CUDA streams
msimberg Mar 28, 2026
705fa70
Add CSCS CI configuration
msimberg May 4, 2026
430a0c4
Merge branch 'cscs-ci' into nccl-2
msimberg May 4, 2026
cee6d3f
Use oomph-nccl spack-packages branch
msimberg May 4, 2026
4a7c324
Use stable spack-packages commit in CI
msimberg May 4, 2026
9ce266f
Set cuda_arch in spack envs
msimberg May 4, 2026
d61dc94
Use bundled gtest in CI
msimberg May 4, 2026
65db90b
Explicitly init submodules
msimberg May 4, 2026
4509f11
Disable libfabric tests
msimberg May 4, 2026
83c8fb3
Merge branch 'cscs-ci' into nccl-2
msimberg May 4, 2026
e7d5d74
Add nccl backend to CSCS CI
msimberg May 4, 2026
51146c9
Add nccl backend spack env
msimberg May 4, 2026
11bd546
Use oomph nccl development branch
msimberg May 4, 2026
7279d68
Fix package attributes
msimberg May 4, 2026
8246b0d
Run all 2, 4, 6 ranks tests in CI
msimberg May 4, 2026
e86e0f2
Clean up extends in cscs ci config
msimberg May 6, 2026
91ce13c
Do more robust waiting for symlinked Testing directory to be created
msimberg May 6, 2026
c735ee2
Use -L for checking if Testing exists
msimberg May 6, 2026
ce4f3f4
Don't specify number of slurm nodes to allow 6-rank tests to run on t…
msimberg May 7, 2026
35c4b7d
Use node-local temp ctest Testing directory
msimberg May 7, 2026
9ff914b
Remove unnecessary podman login
msimberg May 7, 2026
dca2e06
Add +python to cscs ci spack spec
msimberg May 7, 2026
591a169
Run tests in spack build-env
msimberg May 7, 2026
4627171
Merge branch 'cscs-ci' into nccl-2
msimberg May 7, 2026
696ac24
Update ci config after merge
msimberg May 7, 2026
bb3947f
Fix nccl config
msimberg May 7, 2026
6c72655
Merge remote-tracking branch 'origin/master' into nccl-2
msimberg Jun 1, 2026
8c80cf1
Fix stale comment in stub cuda_event operator bool
msimberg Jun 1, 2026
097612b
Add +python variant to NCCL spack spec
msimberg Jun 1, 2026
4011e49
Apply clang-format
msimberg Jun 1, 2026
3f01e5f
Apply clang-format
msimberg Jun 1, 2026
4809c28
Update spack-packages commit
msimberg Jun 1, 2026
303628c
Add more missing hip functions
msimberg Jun 1, 2026
bb09a33
Update python tests for nccl
msimberg Jun 1, 2026
9cbd86a
Explicitly set cuda arch
msimberg Jun 1, 2026
0490732
Add NCCL_DEBUG=WARN to ci
msimberg Jun 1, 2026
80bcc14
Debug prints
msimberg Jun 1, 2026
fe0cbec
Disable thread-safe tests with NCCL backend
msimberg Jun 2, 2026
2e642b0
Use latest nccl-context branch
msimberg Jun 2, 2026
71d6009
Cleanup and small fixes
msimberg Jun 2, 2026
1e757cd
Update oomph commit
msimberg Jun 2, 2026
9eb63ec
Update oomph commit in ci
msimberg Jun 2, 2026
9bc836d
Formatting
msimberg Jun 3, 2026
401478b
Bump oomph
msimberg Jun 3, 2026
7248d1b
Debug logs for nccl
msimberg Jun 3, 2026
c7f8c97
Skip cubed_sphere test for nccl because of self-exchange
msimberg Jun 3, 2026
51df2d9
Increase test timeout
msimberg Jun 3, 2026
2a47a6a
Format files again
msimberg Jun 3, 2026
9ad517a
Increase slurm time limit
msimberg Jun 3, 2026
9a5c2a0
Fix NCCL_DEBUG
msimberg Jun 3, 2026
4a94652
Verbose test output
msimberg Jun 3, 2026
c831232
Try older pytorch base image
msimberg Jun 3, 2026
855a86f
Actually verbose ctest
msimberg Jun 3, 2026
fb546a9
Remove TODO
msimberg Jun 3, 2026
9a4a635
Bump oomph
msimberg Jun 3, 2026
64b2741
Skip structured tests with nccl
msimberg Jun 3, 2026
3b47e58
Bump oomph to latest nccl-context (35811d3)
msimberg Jun 25, 2026
b7b0d33
Merge remote-tracking branch 'origin/master' into nccl-2
msimberg Jun 25, 2026
1a38761
Increase test timeout to 600s and add --output-on-failure
msimberg Jun 25, 2026
967d2ac
Bump oomph and update self-send/recv error message
msimberg Jun 25, 2026
350ba6e
Add NCCL_PXN_DISABLE=1 to CI and document PXN limitation
msimberg Jun 25, 2026
52fa8ea
Add group support to communication_object_ipr
msimberg Jun 25, 2026
0134d3a
Fix NCCL CI: update oomph commit and increase Python test timeout
msimberg Jun 25, 2026
2bd0367
Fix oomph version name in spack config for NCCL backend
msimberg Jun 25, 2026
0526c9c
Enable structured tests for NCCL backend
msimberg Jun 25, 2026
7afd427
Re-disable structured tests for NCCL backend
msimberg Jun 26, 2026
a93fc58
wip: enable structured tests with NCCL backend for investigation
msimberg Jun 26, 2026
f07cbad
Enable all structured tests for NCCL backend
msimberg Jun 26, 2026
6e2670c
Skip run_split() for NCCL backend in structured tests
msimberg Jun 26, 2026
d2e1036
Skip mixed-architecture exchanges for NCCL backend
msimberg Jun 26, 2026
ecbc76d
Remove accidentally committed _nix_build/ directory
msimberg Jun 26, 2026
8147380
Actually remove _nix_build/ from git tracking
msimberg Jun 26, 2026
4a69b3e
Add debug logging to Python unstructured domain descriptor test
msimberg Jun 26, 2026
537e56a
Skip UCX backend tests with thread_safe=false to work around indeterm…
msimberg Jun 26, 2026
edb622c
Skip all UCX backend parallel Python tests to work around indetermini…
msimberg Jun 26, 2026
fde32ea
Add file-based debug logging to understand UCX hang
msimberg Jun 26, 2026
132ec63
Remove UCX skip workaround to confirm issue still exists
msimberg Jun 26, 2026
5052086
Fix UCX hang by using old code path for UCX/MPI backends, new path fo…
msimberg Jun 26, 2026
a7cb0b3
Fix compilation: add this-> prefix to pack_and_send() calls
msimberg Jun 26, 2026
a05433a
Add back pack_and_send() methods for UCX/MPI backends
msimberg Jun 26, 2026
8bd5589
Restore old pack() functions that take (Map, Requests, Communicator) …
msimberg Jun 26, 2026
bdd4ede
Fix UCX hang by adding explicit cleanup to Python test fixtures
msimberg Jun 27, 2026
f937af8
Remove debug logging and add explanatory comments for UCX hang fix
msimberg Jun 27, 2026
b435d46
Update README to include NCCL as available transport backend
msimberg Jun 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .cscs-ci/container/build.Containerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ RUN spack -e ci build-env ghex -- \
-DGHEX_USE_BUNDLED_GTEST=ON \
-DGHEX_USE_BUNDLED_OOMPH=OFF \
-DGHEX_USE_BUNDLED_GRIDTOOLS=OFF \
-DCMAKE_CUDA_ARCHITECTURES=90 \
-DGHEX_USE_GPU=ON \
-DGHEX_GPU_TYPE=NVIDIA \
-DGHEX_BUILD_PYTHON_BINDINGS=ON \
Expand Down
54 changes: 49 additions & 5 deletions .cscs-ci/default.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@ include:
- remote: 'https://gitlab.com/cscs-ci/recipes/-/raw/master/templates/v2/.ci-ext.yml'

variables:
BASE_IMAGE: jfrog.svc.cscs.ch/docker-group-csstaff/alps-images/ngc-pytorch:26.01-py3-alps4-dev
BASE_IMAGE: jfrog.svc.cscs.ch/docker-group-csstaff/alps-images/ngc-pytorch:26.01-py3-alps3
SPACK_SHA: v1.1.1
SPACK_PACKAGES_SHA: a010c65289743f900bdbbfb840e4d1876c24e93f # develop on 2025-05-08
SPACK_PACKAGES_SHA: 8ea120fe82c02737dddef32451edf88929f266ff # https://github.com/msimberg/spack-packages/tree/oomph-nccl

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: use commit on develop.

FF_TIMESTAMPS: true

.build_deps_template:
Expand All @@ -25,6 +25,12 @@ variables:
reports:
dotenv: base-${BACKEND}.env

build_deps_nccl:
variables:
BACKEND: nccl
extends:
- .build_deps_template

build_deps_mpi:
extends: .build_deps_template
variables:
Expand Down Expand Up @@ -53,6 +59,14 @@ build_deps_libfabric:
reports:
dotenv: build-${BACKEND}.env

build_nccl:
extends: .build_template
variables:
BACKEND: nccl
needs:
- job: build_deps_nccl
artifacts: true

build_mpi:
extends: .build_template
variables:
Expand Down Expand Up @@ -81,7 +95,7 @@ build_libfabric:
extends: .container-runner-clariden-gh200
variables:
SLURM_GPUS_PER_TASK: 1
SLURM_TIMELIMIT: '5:00'
SLURM_TIMELIMIT: '15:00'
SLURM_PARTITION: normal
SLURM_MPI_TYPE: pmix
SLURM_NETWORK: disable_rdzv_get
Expand All @@ -90,13 +104,15 @@ build_libfabric:
PMIX_MCA_psec: native
PMIX_MCA_gds: "^shmem2"
USE_MPI: NO
NCCL_DEBUG: trace
NCCL_PXN_DISABLE: 1

.test_serial_template:
extends: .test_template_base
variables:
SLURM_NTASKS: 1
script:
- spack -e ci build-env ghex -- ctest --test-dir /ghex/build -L "serial" --output-on-failure --timeout 60 --parallel 8
- spack -e ci build-env ghex -- ctest --test-dir /ghex/build -L "serial" --verbose --output-on-failure --timeout 600 --parallel 8

.test_parallel_template:
extends: .test_template_base
Expand All @@ -105,12 +121,18 @@ build_libfabric:
# writing inside the container.
- if [[ "${SLURM_LOCALID}" == 0 ]]; then rm -rf /ghex/build/Testing; mkdir /tmp/Testing; ln -s /tmp/Testing /ghex/build/Testing; fi
- until [[ -L /ghex/build/Testing ]]; do sleep 1; done
- spack -e ci build-env ghex -- ctest --test-dir /ghex/build -L "parallel-ranks-${SLURM_NTASKS}" --output-on-failure --timeout 60
- spack -e ci build-env ghex -- ctest --test-dir /ghex/build -L "parallel-ranks-${SLURM_NTASKS}" --verbose --output-on-failure --timeout 600

.test_parallel_job:
extends: .test_parallel_template
image: $BUILD_IMAGE

.test_parallel_nccl:
extends: .test_parallel_job
needs:
- job: build_nccl
artifacts: true

.test_parallel_mpi:
extends: .test_parallel_job
needs:
Expand All @@ -129,6 +151,28 @@ build_libfabric:
# - job: build_libfabric
# artifacts: true

test_serial_nccl:
extends: .test_serial_template
needs:
- job: build_nccl
artifacts: true
image: $BUILD_IMAGE

test_parallel_2_nccl:
extends: .test_parallel_nccl
variables:
SLURM_NTASKS: 2

test_parallel_4_nccl:
extends: .test_parallel_nccl
variables:
SLURM_NTASKS: 4

test_parallel_6_nccl:
extends: .test_parallel_nccl
variables:
SLURM_NTASKS: 6

test_serial_mpi:
extends: .test_serial_template
needs:
Expand Down
5 changes: 5 additions & 0 deletions .cscs-ci/spack/libfabric.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,8 @@ spack:
view: false
concretizer:
unify: true
packages:
oomph:
require: '@git.365fab9af9538694668fc8516750bfb0e96b6be2=main'
package_attributes:
git: "https://github.com/msimberg/oomph.git"
5 changes: 5 additions & 0 deletions .cscs-ci/spack/mpi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,8 @@ spack:
view: false
concretizer:
unify: true
packages:
oomph:
require: '@git.365fab9af9538694668fc8516750bfb0e96b6be2=main'
package_attributes:
git: "https://github.com/msimberg/oomph.git"
11 changes: 11 additions & 0 deletions .cscs-ci/spack/nccl.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
spack:
specs:
- ghex@master backend=nccl +cuda cuda_arch=90a +python
view: false
concretizer:
unify: true
packages:
oomph:
require: '@git.3d7f65888e10ba1689257184005b12e53c907738=main'
package_attributes:
git: "https://github.com/msimberg/oomph.git"
5 changes: 5 additions & 0 deletions .cscs-ci/spack/ucx.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,8 @@ spack:
view: false
concretizer:
unify: true
packages:
oomph:
require: '@git.365fab9af9538694668fc8516750bfb0e96b6be2=main'
package_attributes:
git: "https://github.com/msimberg/oomph.git"
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,4 @@ doc_src/_build/
__pycache__
_build
.venv*/
_nix_build/
2 changes: 2 additions & 0 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,8 @@ if(GHEX_USE_BUNDLED_OOMPH)
set_target_properties(oomph_libfabric PROPERTIES INSTALL_RPATH "${rpath_origin}")
elseif (GHEX_TRANSPORT_BACKEND STREQUAL "UCX")
set_target_properties(oomph_ucx PROPERTIES INSTALL_RPATH "${rpath_origin}")
elseif (GHEX_TRANSPORT_BACKEND STREQUAL "NCCL")
set_target_properties(oomph_nccl PROPERTIES INSTALL_RPATH "${rpath_origin}")
else()
set_target_properties(oomph_mpi PROPERTIES INSTALL_RPATH "${rpath_origin}")
endif()
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,7 +58,7 @@ make test
| `GHEX_PYTHON_LIB_PATH=` | `<path>` | `${CMAKE_INSTALL_PREFIX}/<python-site-packages-path>` | Installation directory for GHEX's Python package
| `GHEX_WITH_TESTING=` | `{ON, OFF}` | `OFF` | Build unit tests
| `GHEX_USE_XPMEM=` | `{ON, OFF}` | `OFF` | Use Xpmem
| `GHEX_TRANSPORT_BACKEND=` | `{MPI, UCX, LIBFABRIC}` | `MPI` | Choose transport backend
| `GHEX_TRANSPORT_BACKEND=` | `{MPI, UCX, LIBFABRIC, NCCL}` | `MPI` | Choose transport backend

### Pip Install

Expand All @@ -75,7 +75,7 @@ python -m pip install ghex
| `GHEX_USE_GPU=` | `{ON, OFF}` | `OFF` | Enable GPU
| `GHEX_GPU_TYPE=` | `{AUTO, NVIDIA, AMD}` | `AUTO` | Choose GPU type
| `GHEX_GPU_ARCH=` | list of archs | `"60;70;75;80"`/ `"gfx900;gfx906"` | GPU architecture
| `GHEX_TRANSPORT_BACKEND=` | `{MPI, UCX, LIBFABRIC}` | `MPI` | Choose transport backend
| `GHEX_TRANSPORT_BACKEND=` | `{MPI, UCX, LIBFABRIC, NCCL}` | `MPI` | Choose transport backend


## Contributing
Expand Down
14 changes: 12 additions & 2 deletions cmake/ghex_external_dependencies.cmake
Original file line number Diff line number Diff line change
Expand Up @@ -43,8 +43,8 @@ endif()
# ---------------------------------------------------------------------
# oomph setup
# ---------------------------------------------------------------------
set(GHEX_TRANSPORT_BACKEND "MPI" CACHE STRING "Choose the backend type: MPI | UCX | LIBFABRIC")
set_property(CACHE GHEX_TRANSPORT_BACKEND PROPERTY STRINGS "MPI" "UCX" "LIBFABRIC")
set(GHEX_TRANSPORT_BACKEND "MPI" CACHE STRING "Choose the backend type: MPI | UCX | LIBFABRIC | NCCL")
set_property(CACHE GHEX_TRANSPORT_BACKEND PROPERTY STRINGS "MPI" "UCX" "LIBFABRIC" "NCCL")
cmake_dependent_option(GHEX_USE_BUNDLED_OOMPH "Use bundled oomph." ON "GHEX_USE_BUNDLED_LIBS" OFF)
if(GHEX_USE_BUNDLED_OOMPH)
set(OOMPH_GIT_SUBMODULE OFF CACHE BOOL "")
Expand All @@ -53,6 +53,11 @@ if(GHEX_USE_BUNDLED_OOMPH)
set(OOMPH_WITH_LIBFABRIC ON CACHE BOOL "Build with LIBFABRIC backend")
elseif(GHEX_TRANSPORT_BACKEND STREQUAL "UCX")
set(OOMPH_WITH_UCX ON CACHE BOOL "Build with UCX backend")
elseif(GHEX_TRANSPORT_BACKEND STREQUAL "NCCL")
set(OOMPH_WITH_NCCL ON CACHE BOOL "Build with NCCL backend")
if(NOT GHEX_USE_GPU)
message(FATAL_ERROR "GHEX_TRANSPORT_BACKEND=NCCL requires GHEX_USE_GPU=ON but GHEX_USE_GPU=OFF")
endif()
endif()
if(GHEX_USE_GPU)
set(HWMALLOC_ENABLE_DEVICE ON CACHE BOOL "True if GPU support shall be enabled")
Expand All @@ -70,6 +75,9 @@ if(GHEX_USE_BUNDLED_OOMPH)
if(TARGET oomph_ucx)
add_library(oomph::oomph_ucx ALIAS oomph_ucx)
endif()
if(TARGET oomph_nccl)
add_library(oomph::oomph_nccl ALIAS oomph_nccl)
endif()
if(TARGET oomph_libfabric)
add_library(oomph::oomph_libfabric ALIAS oomph_libfabric)
endif()
Expand All @@ -82,6 +90,8 @@ function(ghex_link_to_oomph target)
target_link_libraries(${target} PRIVATE oomph::oomph_libfabric)
elseif (GHEX_TRANSPORT_BACKEND STREQUAL "UCX")
target_link_libraries(${target} PRIVATE oomph::oomph_ucx)
elseif (GHEX_TRANSPORT_BACKEND STREQUAL "NCCL")
target_link_libraries(${target} PRIVATE oomph::oomph_nccl)
else()
target_link_libraries(${target} PRIVATE oomph::oomph_mpi)
endif()
Expand Down
2 changes: 1 addition & 1 deletion ext/oomph
Loading
Loading