Scheduled Halo Exchange#980
Conversation
|
cscs-ci run default |
|
cscs-ci run extra |
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run extra |
**NOTE:** This commit still follows the old nomoclature, where `None` means default stream. Most likely this will change such that `None` means "not using `schedule_*()` functions and another sigelton is used for it.
|
cscs-ci run default |
- There are now two protocols that describes how to extract the underlying address. They are probably at the wrong location. - `stream=None` no longer means "default stream" but is not equivalent to "do not use scheduled version". - To indicate the default stream the singelton `DefaultStream` is used. The `cupy.cuda.Stream.null` singelton was not used, because it would require that `cupy` is present. - However, use the default stream is still the default behaviour.
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run extra |
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run extra |
|
There is a failing in See this test PR: #982 |
|
cscs-ci run default |
|
cscs-ci run dace |
|
cscs-ci run default |
|
cscs-ci run distributed |
|
cscs-ci run default |
|
cscs-ci run distributed |
| """ | ||
|
|
||
|
|
||
| class Block: |
There was a problem hiding this comment.
| class Block: | |
| class BlockType: |
to avoid accidentally passing Block instead of BLOCK. Or alternatively make call it _Block and use type[BLOCK] as annotation? Not sure which option is best, but currently it's just too tempting to pass Block...
There was a problem hiding this comment.
Actually we need to make this a proper Singelton otherwise we might have prblems if someone does Block()
There was a problem hiding this comment.
Maybe this? @egparedes
class BlockType:
_instance = None
def __new__(cls):
if cls._instance is None:
cls._instance = super().__new__(cls)
return cls._instance
BLOCK = BlockType()
There was a problem hiding this comment.
According to SO this should be the correct way, although the SO answer is way more fancy, but do we need that?
There was a problem hiding this comment.
I have implemented it, but improvements are appreciated.
Co-authored-by: Mikael Simberg <mikael.simberg@iki.fi>
|
cscs-ci run default |
|
cscs-ci run distributed |
|
cscs-ci run default |
|
cscs-ci run distributed |
|
Mandatory Tests Please make sure you run these tests via comment before you merge!
Optional Tests To run benchmarks you can use:
To run tests and benchmarks with the DaCe backend you can use:
To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:
For more detailed information please look at CI in the EXCLAIM universe. |
request was from an early state. we'll address further cleanup in future PRs.
* main: (29 commits) Scheduled Halo Exchange (#980) Add missing metrics fields to `test_parallel_grid_manager.py` test (#1114) Muphys: Lowering with single precision (#1101) Add single-rank lsq pseudoinv factory test (#1099) Cleanup Diffusion config (#1060) Fortran bindings: fix numpy allocation and cleanups (#1112) fix: fix gt4py metrics extractor in the StencilTest benchmarking (#1111) py2fgen: don't recompile if unchanged (#1110) CI for standalone_driver (#1070) Update mpi4py and pymetis groups to make them optional (#1100) Bump mshick/add-pr-comment from 2 to 3 (#1109) Use inout fields for full_muphys as well (#1108) Update GPU configuration for graupel (#1104) Move the mask of _q_t_update outside in graupel (#1093) Update gt4py to v1.1.7 (#1105) cleanup for ugly if condition of single node default in lsq coeffs (#1103) Domain decomposition and halo construction (#540) Muphys: Add flag to wait for graupel completion (#1095) Give each gt4py program a return type hint (#1087) Turn data download off for distributed CI (#1092) ...
* main: Scheduled Halo Exchange (#980) Add missing metrics fields to `test_parallel_grid_manager.py` test (#1114) Muphys: Lowering with single precision (#1101) Add single-rank lsq pseudoinv factory test (#1099) Cleanup Diffusion config (#1060) Fortran bindings: fix numpy allocation and cleanups (#1112) fix: fix gt4py metrics extractor in the StencilTest benchmarking (#1111) py2fgen: don't recompile if unchanged (#1110) CI for standalone_driver (#1070) Update mpi4py and pymetis groups to make them optional (#1100) Bump mshick/add-pr-comment from 2 to 3 (#1109) Use inout fields for full_muphys as well (#1108) Update GPU configuration for graupel (#1104) Move the mask of _q_t_update outside in graupel (#1093) Update gt4py to v1.1.7 (#1105) cleanup for ugly if condition of single node default in lsq coeffs (#1103)
In [PR#980](#980) introduced streams into the halo exchanges. For this also `DEFAULT_STREAM`, which models the default stream and implements the [CUDA Stream Protocol](https://nvidia.github.io/cuda-python/cuda-core/latest/interoperability.html#cuda-stream-protocol). However, the original implementation identified as protocol version `1` instead of version `0`. Because of a related bug in [GHEX](ghex-org/GHEX#202) this error was hidden. This PR fixes the Python implementation and also updates GHEX.
This PR introduces the scheduled exchange feature from GHEX into ICON4Py.
These exchange allows to call the exchange function before all work has been completed, i.e. the exchange will wait until the previous work is done. A similar feature is the "scheduled wait", that allows to initiate the receive without the need to wait on its completion.
In addition to this the function also renamed the functions related to halo exchange:
exchange()was renamed tostart().wait()was renamed tofinish()(that might now return before the transfer has fully concluded).exchange_and_wait()was renamed toexchange().All of these functions now accepts the an argument called
stream, which defaults toDEFAULT_STREAM. It is indicate how synchronization with the stream should be performed.In case of
start()it means that the actual exchange should not start until all work previously submitted tostreamhas finished. Forfinish()it means that further work, submitted tostream, should not start until the exchange has ended. Forfinish()it is also possible to specifyBLOCK, which means thatfinish()waits until the transfer has fully finished.The orchestrator was not updated, but the change were made in such a way that it continues to work in diffusion, although using the original, blocking behaviour.
Note:
The CI fails for
cscs/extra, but it also does this for currentmain, see See this test PR: #982