CCZ4: Optimize performance on GPUs #172
Conversation
|
This PR modifies the following files which are ignored by .lint-ignore: Please consider removing the corresponding patterns from .lint-ignore so that these files can be linted. |
Cpp-Linter Report
|
| double &out_lapse, Tensor<1, double> &out_shift, | ||
| double &out_Theta, | ||
| Tensor<1, double> &out_Z3) const; | ||
| void interpolate_tp_vars( |
There was a problem hiding this comment.
I don't think we have ported these files yet, so should we include them here?
| { | ||
| matter_term_Gamma += -16.0 * M_PI * m_G_Newton * vars.lapse() * | ||
| h_UU[i][j] * emtensor.j[j]; | ||
| h_UU(i, j) * emtensor.j(j); |
There was a problem hiding this comment.
Seeing this I wonder if we should change j to J in the emtensor... it isn't then very standard notation but I don't really like this j being confused with indexing.
There was a problem hiding this comment.
Maybe we could just use a different index variable here? I think it's always written as emtensor.j(idx) so it's not that ambiguous.
|
|
||
| this->rhs_equation(rhs_cell_data, vars, d1, d2, advec); | ||
| // calculate the vaccuum solution | ||
| this->compute_chi_and_h_ij(ix, iy, iz, rhs_state, state); |
There was a problem hiding this comment.
I guess it depends how we do the scalar example, but maybe we want to split these out too and just add the emtensor_rhs as another kernel?
There was a problem hiding this comment.
I think I want to have the () operator as a kind of wrapper with everything in it and then split out the equations in the scalar field example. Because there is no scalar field example right now, there isn't an example for users to see how to use the fissoned kernels so I want to leave this in there for now.
There was a problem hiding this comment.
I agree that we should split out the equations in the scalar field example when it is ported but I'm also fine with keeping this wrapper for the moment.
I don't think we can have a separate kernel for the emtensor_rhs as it's not independent from the other kernels. Note that the matter terms only come into the RHS for Aij, Theta, Gamma and the gauge so we only need to split it up into two parts (one for Aij, Theta and Gamma and one for the gauge). We should modify the interface here in this PR.
|
Lots of small comments and suggestions for naming but overall it looks great! I like the new Tensor arrays, and all the now consistent () brackets. I'm even happy to lose the D1Vars things. |
1e4dd2c to
5e8f13e
Compare
|
This PR modifies the following files which are ignored by .lint-ignore: Please consider removing the corresponding patterns from .lint-ignore so that these files can be linted. |
5e8f13e to
b3d15fe
Compare
|
This PR modifies the following files which are ignored by .lint-ignore: Please consider removing the corresponding patterns from .lint-ignore so that these files can be linted. |
b3d15fe to
619044a
Compare
|
This PR modifies the following files which are ignored by .lint-ignore: Please consider removing the corresponding patterns from .lint-ignore so that these files can be linted. |
|
As discussed during the telecon on Thursday April 16, the precision requirement of the unit tests could probably be raised to |
619044a to
811b545
Compare
|
This PR modifies the following files which are ignored by .lint-ignore: Please consider removing the corresponding patterns from .lint-ignore so that these files can be linted. |
Note that the functions have been overloaded such that the old way of using the Tensor object is still available.
This will replace instances of the Tensor class with the appropriate ArrayND class where N = 1,2 or 3 in the BinaryBH example, Matter classes and unit tests. It does not replace Tensors in classes that haven't been ported yet.
Previously the CCZ4 RHS was calculated by launching a single ParallelFor region. This commit will split that region into a separate ParallelFor for each term in the CCZ4 equation.
[skip-ci] Specifically, this will join together the calculation of Gamma, Aij, Theta, K into a single ParallelFor instance and chi and h_ij in another. This is to avoid recomputing the Christoffel symbols and Ricci Z which seems to be taking the longest time.
For example, d^2 h_ij/dydx is the same as d^2 h_ij/dxdy so we only keep one copy. The indexing is done in the same way as for the symmetric tensors so only the unique components are referenced.
Also test the effect of symmeterizing the second order derivatives of other variables.
One uses AMReX ND Arrays for both the derivatives and Christoffel symbols. The other uses Tensor objects to store both. Switch between either to test the speedup/slowdown.
This removes the pre-calculation of chris_LLU which was slowing down the computation of ricci_hat.
This will convert the remaining Tensor objects to amrex::Array1D, amrex::Array2D or amrex::Array3D objects thus making the code internally consistent.
This will bring up to date CCZ4RHS, CCZ4Geometry and FourthOrderDerivatives for the new macros, and Array4 dimension indexes introduced in b0c7edf.
This should remove the remaining Tensor objects used in the calculation of the RHS.
Convert direct references to cell_data values to use CCZ4Vars wrapper instead.
In previous commits, both the Tensor class and AMReX arrays were available, now only AMReX arrays are used.
This commit will propagate the changes from the BinaryBH/CCZ4 equations to the Matter classes. Note that applying the gauge and the KO dissipation are now split into two different functions because these are split in the Matter RHS (gauge first then dissipation).
Instead of writing e.g. amrex::Array1D<amrex::Real, 0, 3> we can now write TensorArray::Rank1 and so on for the other ranks. Also this will remove all remaining traces of the Tensor class, including the tests and the derivatives.
There was a problem hiding this comment.
Thanks for this PR, Juliana and sorry for the large number of comments. I think most of them should be straightforward to fix.
One comment I want to highlight is that I think the tensor arguments of amrex::ArrayND are the inclusive bounds (including the upper bound) so maybe it should be e.g. amrex::Array1D<amrex::Real, 0, AMREX_SPACEDIM - 1> instead of AMREX_SPACEDIM in the third argument. I added this comment specifically in Tensor.hpp but these types are used elsewhere not via TensorArray aliases.
| amrex::AnyCTO( | ||
| amrex::TypeList<amrex::CompileTimeOptions<USE_BSSN, USE_CCZ4>>{}, | ||
| {simParams().formulation}, | ||
| [&](auto cto_func) { amrex::ParallelFor(a_rhs, cto_func); }, |
There was a problem hiding this comment.
Maybe it would be worth adding a brief comment explaining the point of using AnyCTO?
| enum formulations : int | ||
| { | ||
| USE_CCZ4 = 0, | ||
| USE_BSSN = 1 | ||
| }; |
There was a problem hiding this comment.
Is it necessary to define this here? Can you use CCZ4RHS::USE_CCZ4 and CCZ4RHS::USE_BSSN below? If that doesn't work, maybe just move this enum to CCZ4RHS.hpp and remove the enum defined within the CCZ4RHS class definition.
| enum covariantZ4 : int | ||
| { | ||
| YES, | ||
| NO | ||
| }; |
There was a problem hiding this comment.
Similarly, it would be better if this could be moved to the CCZ4RHS class.
| {use_bssn, use_covariantZ4}, | ||
| [&](auto cto_func) { amrex::ParallelFor(a_rhs, cto_func); }, | ||
| [=] AMREX_GPU_DEVICE(int box_no, int ix, int iy, int iz, | ||
| auto formulation, auto control) |
There was a problem hiding this comment.
Let's make these names consistent
| auto formulation, auto control) | |
| auto formulation, auto covariant_Z4) |
|
|
||
| amrex::ParmParse pp; | ||
|
|
||
| int use_bssn{0}; |
There was a problem hiding this comment.
Can you rename this to formulation or something similar rather than one of the formulation options?
There was a problem hiding this comment.
Here's somewhere you can add split matter RHS kernels!
There was a problem hiding this comment.
This will be a pain as add_emtensor_rhs is a protected member of CCZ4RHSWithMatter. I would prefer to use the () operator and then split the kernels in an example. I think the same thing should be done with the CCZ4RHSTest by the way for consistency.
There was a problem hiding this comment.
Could you also update the Mathematica script so that it generates in this format?
| amrex::Real chi = 0.0; | ||
| Tensor<1, amrex::Real> Gamma; | ||
| Tensor<2, amrex::Real> h; | ||
| TensorArray::Rank1 Gamma{}; |
There was a problem hiding this comment.
(Whole file comment) Can we make sure we're testing everything in CCZ4Geometry? I haven't cross-checked but I don't think this test has been changed much since all the changes there.
| current_ccz4_rhs.compute_chi_and_h_ij( | ||
| ix, iy, iz, current_out_array, in_c_array); | ||
| current_ccz4_rhs.compute_A_ij_and_Theta_and_Gamma< | ||
| formulation, use_covariantZ4>(ix, iy, iz, current_out_array, | ||
| in_c_array); | ||
| current_ccz4_rhs.apply_gauge(ix, iy, iz, current_out_array, | ||
| in_c_array); | ||
| current_ccz4_rhs.apply_dissipation( | ||
| ix, iy, iz, current_out_array, in_c_array); |
There was a problem hiding this comment.
Can you split them out into separate kernel invocations as you've done in the BinaryBH example?
There was a problem hiding this comment.
Can we add in tests for the other functions in FourthOrderDerivatives e.g. diff1_vector, diff1_tensor, etc?
|
FYI Since I am a bit swamped at the moment I have asked @SamuelBrady to do a more detailed review from the user perspective, and I'll just have another quick look once everything is final. |
SamuelBrady
left a comment
There was a problem hiding this comment.
Looks good and seems clear overall, just left a few new comments.
The main things that I think could cause confusion are the naming of Rank1Sym and Rank2Sym (which are rank 2 and rank 4 tensors), and knowing when to use symmetric vs regular tensors (which only makes sense once you know how they're stored). I think we could also have a clearer system for when we pass CellData vs Vars to other objects.
| #define VAR_IDX0(i, j) VAR_IDX(0, (i), (j)) | ||
|
|
||
| constexpr operator arr_t &() { return arr; } | ||
| #define SPACETIME_DIM GR_SPACEDIM + 1 |
There was a problem hiding this comment.
| #define SPACETIME_DIM GR_SPACEDIM + 1 | |
| #define SPACETIMEDIM GR_SPACEDIM + 1 |
There was a problem hiding this comment.
I think SPACETIMEDIM is a bit hard to read because of the all caps...I think it has to be SPACETIME_DIM or SPACE_TIME_DIM.
| constexpr operator const arr_t &() const { return arr; } | ||
| }; | ||
| // Number of unique indices after accounting for symmetry | ||
| #define UNIQUE_IDX 6 |
There was a problem hiding this comment.
Also important to define in terms of AMREX_SPACEDIM if it's going to be used in loops, as suggested in other comments
|
|
||
| // Apply gauage (no derivatives needed here!) | ||
| AMREX_GPU_DEVICE AMREX_FORCE_INLINE void | ||
| apply_gauge(int ix, int iy, int iz, |
There was a problem hiding this comment.
I'm not sure about the name apply_gauge, since the gauge itself is also being updated. Maybe something like include_gauge would be clearer
There was a problem hiding this comment.
Is this the rhs for the gauge? If so I would call it calculate_gauge_rhs() or something like that. I see the comment that no derivatives are needed, but that may not always be the case, so we may need to generalise it (but this can be done in another PR as commented elsewhere).
| d2(icomp, 0) = diff2(var_ptr, strides[0]); | ||
| d2(icomp, 3) = diff2(var_ptr, strides[1]); | ||
| d2(icomp, 5) = diff2(var_ptr, strides[2]); | ||
|
|
||
| d2(icomp, 1) = mixed_diff2(var_ptr, strides[0], strides[1]); | ||
| d2(icomp, 2) = mixed_diff2(var_ptr, strides[0], strides[2]); | ||
| d2(icomp, 4) = mixed_diff2(var_ptr, strides[1], strides[2]); |
There was a problem hiding this comment.
Could be clearer to use d2(icomp, VAR_IDX0(...)) here (and in other similar blocks), assuming it will be resolved at compile time
This will restore the original naming convention for the functions in FourthOrderDerivatives, e.g. diff1_scalar, diff2_tensor It also removes some of the test functions that I had previously written to test the performance. It will also replace some magic numbers: 3 -> AMREX_SPACEDIM
This will remove the branching in the RHS that depends on which formulation (CCZ4 vs BSSN) and whether or not a covariant Z4 is selected at runtime. It uses AMReX's CTO for any function (not just ParallelFor). The () operator of CCZ4RHSWithMatter is now also templated over the formulation number and the usage of covariant Z4.
The CCZ4RHS unit test fails at the level of ~1e-12 but this is dependent on the compiler e.g. ok for Intel oneAPI 2025.3.1 but fails for GNU and LLVM. Also, I think the error tolerance has been set arbitarily for the other tests also e.g. CCZ4GeometryTest is set to 1e-14 but BSSNMatterTest is 1e-10, even though the latter is also a test of the RHS.
This introduces a new namespace SpaceTimeTensor in which the dimensions are 0 - GR_SPACEDIM+1. I've also cleaned up some of the naming so the number of unique entries in a 3 x 3 symmetric array is now called UNIQUE_IDX.
I was allocating too many elements by 1 because the array limits were inclusive, so 0 -> AMREX_SPACEDIM is now 0 -> AMREX_SPACEDIM-1
This will consolidate changes calls to CCZ4Geometry functions to only use the CCZ4Vars wrapper (instead of CellData) and remove unused overloaded functions. As a result, usage of CCZ4D1Vars in Weyl4 and Constraints were removed.
This swaps the remaining references to cell_data in favour of the CCZ4Vars wrapper.
This will allow passing the wrappers from the ScalarField class, ScalarFieldD1Vars, ScalarFieldD2Vars, ScalarFieldAdvecVars to enable (future) support for multiple scalar fields. To discourage users from using the CCZ4 derivative wrappers, the ScalarField class no longer inherits from the CCZ4 wrappers and these wrappers have been deleted. This will also update the way tensors are referenced in the CCZ4Geometry Unit Test, the BSSNMatter Unit Test and Wey4WithMatter Unit Test.
811b545 to
0d364a0
Compare
|
This PR modifies the following files which are ignored by .lint-ignore: Please consider removing the corresponding patterns from .lint-ignore so that these files can be linted. |
I've changed the function signatures to allow for passing the necessary components to calculate the derivatives inside the Matter RHS functions.
|
This PR modifies the following files which are ignored by .lint-ignore: Please consider removing the corresponding patterns from .lint-ignore so that these files can be linted. |
This PR builds on top of the recent changes due to refactoring the storing and loading of the data container
vars.Major changes:
tensorclass has been removed in favour of AMReXArraysbecause the latter are stored as 1D objects (even though the user interface supports Fortran multidimensional indexing) which is more performant on GPUs. Also each derivative was being stored as a tensor object of tensors so this switch should reduce the memory footprint as well. This closes Refactor Tensor class #132.Still to do:
amrex::Arrays to all parts of the code: I have writtenTensorArray::Rank1foramrex::Array1D<amrex::Real, 0, 3>but this needs to be done consistently.ifstatements for switching between BSSN and CCZ4 as requested (closes Remove branching with BSSN versus CCZ4 #145)