Optimize compute_zdiff_gradp#1272
Draft
msimberg wants to merge 12 commits into
Draft
Conversation
…mpute_zdiff_gradp Replaces O(nedges * nlev^2) nested jk/jk1 search loops with O(nedges * nlev * log(nlev)) np.searchsorted on reversed z_ifc slices. Eliminates per-iteration boolean array allocation and array_ns.where calls. Phase 1: batch searchsorted for all z_me[jk] per edge per side. Phase 2: single searchsorted for z_aux2, applied via boolean mask. Largest grid (exclaim_ch_r04b09_dsl): 2.95s -> 0.93s (3.2x)
…zdiff_gradp Removes the (nedges, 2, nlev) vertidx_gradp allocation and the final vertoffset_gradp subtraction step. Computes vertoffset_gradp offsets directly in the per-edge loop. Initializes vertoffset_gradp as zeros instead of broadcasting jk_field.
- Pre-compute z_ifc_asc = z_ifc[:, ::-1] to avoid per-edge reversal - Extract e2c_0, e2c_1 column arrays to avoid per-edge indexing - Unroll for side in range(2) into explicit side 0 and side 1 - Hoist jk_field_slice and z_me_slice out of side processing
The zdiff_gradp computation is now fast enough (0.92s vs 2.95s baseline on largest grid) that the factory test no longer needs the cpu_only guard. The vwind_impl_wgt speed concern is also resolved.
Replaces 62K per-edge searchsorted calls with 4 batched 2D calls using the row-offset trick. Detects CuPy vs NumPy at runtime. All data stays on-device (no per-iteration D2H transfers). Largest grid (exclaim_ch_r04b09_dsl): 2.95s -> 0.74s (4.0x)
Replaces _get_xp / _cp / _np with the array_ns returned by data_alloc.array_namespace(), which already provides backend- agnostic searchsorted, clip, where, take_along_axis, etc.
- Remove unnecessary z_me_m masking (invalid jk values are masked out by valid_jk after the search) - Remove .copy() from z_ifc_asc (fancy indexing creates new arrays) - Compute fill_high from z_ifc directly (fewer elements than z_ifc_e0) - Remove zdiff_gradp exchange calls from function, delegate to factory via do_exchange=True
The z_me halo exchange inside compute_zdiff_gradp was redundant: z_mc already has valid halo values from the factory dependency chain (Z_MC provider exchanges its halo). Since z_me is computed locally from z_mc[e2c], it inherits correct halo values without an additional exchange. - Remove exchange parameter from compute_zdiff_gradp signature - Remove functools.partial wrapping in factory registration - Remove dims and decomposition imports (no longer needed) - Update unit test call site
Contributor
Author
|
cscs-ci run distributed |
|
Mandatory Tests Please make sure you run these tests via comment before you merge!
Optional Tests To run benchmarks you can use:
To run tests and benchmarks with the DaCe backend you can use:
To run test levels ignored by the default test suite (mostly simple datatest for static fields computations) you can use:
For more detailed information please look at CI in the EXCLAIM universe. |
Contributor
Author
|
cscs-ci run distributed |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is a POC optimized mostly by an LLM, not meant for merging as-is.
compute_zdiff_gradpis very slow on GPUs at least in the distributed metrics tests. On CPU the ZDIFF_GRADP/VERTOFFSET_GRADP test runs in a few seconds, on GPU ~300 seconds. These two fields take by far the longest to on the distributed tests (not counting standalone driver).