-
Notifications
You must be signed in to change notification settings - Fork 16
Add support for NCCL in communication_object
#195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
msimberg
wants to merge
138
commits into
ghex-org:master
Choose a base branch
from
msimberg:nccl-2
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
138 commits
Select commit
Hold shift + click to select a range
99fe0a0
Try 1d block for pack/unpack
msimberg 527d590
Add dumb nccl implementation
msimberg 78879bb
Add back cuda event class
msimberg f314a1c
Add TODO for nccl in cmake
msimberg ab0dfd0
Clean up nccl parts
msimberg 4b5833f
Small fix to stream syncing with nccl
msimberg ee1b851
Update test to disable cpu exchange with nccl
msimberg 186965d
Merge branch 'async-mpi-2' into nccl-2
msimberg 2744fec
Add FindNCCL.cmake
msimberg 618e7eb
Update communication object for nccl integration
msimberg b7f573d
Update oomph
msimberg aede6ab
Update oomph
msimberg 38cb29b
Update oomph
msimberg bacc1a4
Remove debug print
msimberg fe958f2
Minor cleanup
msimberg 0fc74be
Merge commit 'f7273c2a232a0bb37cf869c9ee33c688387cf41b' into nccl-2
msimberg 49b9bf7
Format files
msimberg 34fc8d0
Update oomph
msimberg a51de83
Merge remote-tracking branch 'philip-paul-mueller/phimuell__async-mpi…
msimberg 0614d9b
Update oomph
msimberg dd3257b
Merge remote-tracking branch 'origin/master' into nccl-2
msimberg 83a1c7e
Remove NCCL macros
msimberg f112e4f
Sync with master
msimberg 15e7c05
Update oomph
msimberg 0d91ac9
Merge remote-tracking branch 'philip-paul-mueller/phimuell__async-mpi…
msimberg b9f4d55
Make cmake configuration error out if GPU support isn't enabled when …
msimberg 26f4f3e
Remove unused functions in packer
msimberg baf0dc6
Remove dummy callback todos
msimberg 3f7709f
Small updates
msimberg d1b41ea
Small fixes to async mode
msimberg 173abd7
Clean up communication object exchange implementations
msimberg d8c8631
Refactor packing/unpacking etc.
msimberg 9e695c3
More pack/unpack cleanup
msimberg 6367449
Format files
msimberg 606e4d0
Un-disable a test with NCCL
msimberg ac9f1c1
Minor cleanup
msimberg 4a491fd
Update oomph
msimberg ffbb066
Update oomph
msimberg 5dee034
Minor formatting, unused variable warnings etc.
msimberg da0f77c
Fix compilation with hip
msimberg 289a761
Formatting
msimberg fdcd4ac
Formatting
msimberg 4aaaadf
Merge remote-tracking branch 'philip-paul-mueller/phimuell__async-mpi…
msimberg 993393d
Remove wrong assertion
msimberg 7459c63
Update some tests for NCCL
msimberg 5a65526
Disable more tests with NCCL
msimberg e6cec86
Merge tag 'v0.5.0' into nccl-2
msimberg c28a432
Add comment about NCCL PXN to cubed_sphere tests
msimberg 21bb1c0
Fix compilation
msimberg f5bbf49
Update oomph
msimberg 451c51d
Remove unnecessary find_package(NCCL)
msimberg 2719c94
Add back comment
msimberg 5f6e151
Rethrow exceptions with bare throw
msimberg 793c766
Remove outdated comment
msimberg ec275a8
Use high priority CUDA streams
msimberg 705fa70
Add CSCS CI configuration
msimberg 430a0c4
Merge branch 'cscs-ci' into nccl-2
msimberg cee6d3f
Use oomph-nccl spack-packages branch
msimberg 4a7c324
Use stable spack-packages commit in CI
msimberg 9ce266f
Set cuda_arch in spack envs
msimberg d61dc94
Use bundled gtest in CI
msimberg 65db90b
Explicitly init submodules
msimberg 4509f11
Disable libfabric tests
msimberg 83c8fb3
Merge branch 'cscs-ci' into nccl-2
msimberg e7d5d74
Add nccl backend to CSCS CI
msimberg 51146c9
Add nccl backend spack env
msimberg 11bd546
Use oomph nccl development branch
msimberg 7279d68
Fix package attributes
msimberg 8246b0d
Run all 2, 4, 6 ranks tests in CI
msimberg e86e0f2
Clean up extends in cscs ci config
msimberg 91ce13c
Do more robust waiting for symlinked Testing directory to be created
msimberg c735ee2
Use -L for checking if Testing exists
msimberg ce4f3f4
Don't specify number of slurm nodes to allow 6-rank tests to run on t…
msimberg 35c4b7d
Use node-local temp ctest Testing directory
msimberg 9ff914b
Remove unnecessary podman login
msimberg dca2e06
Add +python to cscs ci spack spec
msimberg 591a169
Run tests in spack build-env
msimberg 4627171
Merge branch 'cscs-ci' into nccl-2
msimberg 696ac24
Update ci config after merge
msimberg bb3947f
Fix nccl config
msimberg 6c72655
Merge remote-tracking branch 'origin/master' into nccl-2
msimberg 8c80cf1
Fix stale comment in stub cuda_event operator bool
msimberg 097612b
Add +python variant to NCCL spack spec
msimberg 4011e49
Apply clang-format
msimberg 3f01e5f
Apply clang-format
msimberg 4809c28
Update spack-packages commit
msimberg 303628c
Add more missing hip functions
msimberg bb09a33
Update python tests for nccl
msimberg 9cbd86a
Explicitly set cuda arch
msimberg 0490732
Add NCCL_DEBUG=WARN to ci
msimberg 80bcc14
Debug prints
msimberg fe0cbec
Disable thread-safe tests with NCCL backend
msimberg 2e642b0
Use latest nccl-context branch
msimberg 71d6009
Cleanup and small fixes
msimberg 1e757cd
Update oomph commit
msimberg 9eb63ec
Update oomph commit in ci
msimberg 9bc836d
Formatting
msimberg 401478b
Bump oomph
msimberg 7248d1b
Debug logs for nccl
msimberg c7f8c97
Skip cubed_sphere test for nccl because of self-exchange
msimberg 51df2d9
Increase test timeout
msimberg 2a47a6a
Format files again
msimberg 9ad517a
Increase slurm time limit
msimberg 9a5c2a0
Fix NCCL_DEBUG
msimberg 4a94652
Verbose test output
msimberg c831232
Try older pytorch base image
msimberg 855a86f
Actually verbose ctest
msimberg fb546a9
Remove TODO
msimberg 9a4a635
Bump oomph
msimberg 64b2741
Skip structured tests with nccl
msimberg 3b47e58
Bump oomph to latest nccl-context (35811d3)
msimberg b7b0d33
Merge remote-tracking branch 'origin/master' into nccl-2
msimberg 1a38761
Increase test timeout to 600s and add --output-on-failure
msimberg 967d2ac
Bump oomph and update self-send/recv error message
msimberg 350ba6e
Add NCCL_PXN_DISABLE=1 to CI and document PXN limitation
msimberg 52fa8ea
Add group support to communication_object_ipr
msimberg 0134d3a
Fix NCCL CI: update oomph commit and increase Python test timeout
msimberg 2bd0367
Fix oomph version name in spack config for NCCL backend
msimberg 0526c9c
Enable structured tests for NCCL backend
msimberg 7afd427
Re-disable structured tests for NCCL backend
msimberg a93fc58
wip: enable structured tests with NCCL backend for investigation
msimberg f07cbad
Enable all structured tests for NCCL backend
msimberg 6e2670c
Skip run_split() for NCCL backend in structured tests
msimberg d2e1036
Skip mixed-architecture exchanges for NCCL backend
msimberg ecbc76d
Remove accidentally committed _nix_build/ directory
msimberg 8147380
Actually remove _nix_build/ from git tracking
msimberg 4a69b3e
Add debug logging to Python unstructured domain descriptor test
msimberg 537e56a
Skip UCX backend tests with thread_safe=false to work around indeterm…
msimberg edb622c
Skip all UCX backend parallel Python tests to work around indetermini…
msimberg fde32ea
Add file-based debug logging to understand UCX hang
msimberg 132ec63
Remove UCX skip workaround to confirm issue still exists
msimberg 5052086
Fix UCX hang by using old code path for UCX/MPI backends, new path fo…
msimberg a7cb0b3
Fix compilation: add this-> prefix to pack_and_send() calls
msimberg a05433a
Add back pack_and_send() methods for UCX/MPI backends
msimberg 8bd5589
Restore old pack() functions that take (Map, Requests, Communicator) …
msimberg bdd4ede
Fix UCX hang by adding explicit cleanup to Python test fixtures
msimberg f937af8
Remove debug logging and add explanatory comments for UCX hang fix
msimberg b435d46
Update README to include NCCL as available transport backend
msimberg File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,11 @@ | ||
| spack: | ||
| specs: | ||
| - ghex@master backend=nccl +cuda cuda_arch=90a +python | ||
| view: false | ||
| concretizer: | ||
| unify: true | ||
| packages: | ||
| oomph: | ||
| require: '@git.3d7f65888e10ba1689257184005b12e53c907738=main' | ||
| package_attributes: | ||
| git: "https://github.com/msimberg/oomph.git" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -48,3 +48,4 @@ doc_src/_build/ | |
| __pycache__ | ||
| _build | ||
| .venv*/ | ||
| _nix_build/ | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Submodule gridtools
updated
5 files
| +12 −12 | README.md | |
| +1 −1 | docs_src/manuals/getting_started/code/CMakeLists.txt | |
| +0 −76 | include/gridtools/common/demangle.hpp | |
| +2 −2 | include/gridtools/stencil/dump.hpp | |
| +1 −1 | version.txt |
Submodule oomph
updated
45 files
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO: use commit on develop.