[newchem-cpp] transcribe `lookup_cool_rates0d` 1/2 by mabruzzo · Pull Request #378 · grackle-project/grackle

mabruzzo · 2025-08-20T01:50:25Z

This PR was originally proposed as brittonsmith#31

This PR must be reviewed after #375

This was the step 1 of 2 for transcribing lookup_cool_rates0d.

the first 3 commits tweaked the fortran code. Prior to this PR some constant values were passed directly as arguments (in these cases, we used values of 0 or 1). Essentially, we introduced a few local variables to temporarily store these values and then passed the variables as arguments (this overcomes a limitation of the transcription tool)
then we performed the initial transcription
After doing the initial transcription, and reducing some most work was effort was devoted to refactoring all of the local variable so we could replace the calls to FORTRAN_NAME(cool1d_multi_g), FORTRAN_NAME(lookup_cool_rates1d_g), FORTRAN_NAME(rate_timestep_g) with the C++ wrappers (that use the reduced arg lists)
- for example, historically the routine stored made a copy of every species mass density (and the internal energy) and stored the copies within local variables. Since all of our C++ wrappers expect this information to be specified within an instance of grackle_field_data, this needed to change:
  - several commits were storing the copies within the pointers specified by an instance of grackle_field_data (distinct from the instance passed to Grackle's API).
  - The local variables previously dedicated to holding the local copies were converted to references to the local copies. This let us avoid breaking the 1200+ lines of logic at the end of the function. Said logic was clearly copied from step_rate_g and then heavily modified. That logic is more directly addressed in the next PR
- I also replaced >100 local variables with instances of the ColRecRxnRateCollection, PhotoRxnRateCollection, and GrainSpeciesCollection types since the C++ use these types to pass around large bundles of quantities (there was a bunch of search-and-replace done to avoid breaking logic at the end of the function)
- I replaced a couple of local variables with an instance of IndexRange
I also removed a bunch of unused local variables

The fact that the cool1d_multi_g, lookup_cool_rates1d_g, and rate_timestep_g routines are only called from C++ source files AND the fact they are only called through the C++ wrappers (i.e. grackle::impl::fortran_wrapper::cool1d_multi_g, grackle::impl::fortran_wrapper::lookup_cool_rates1d_g, and grackle::impl::rate_timestep_g) is an important milestone. It means that they can be replaced with transcribed function.

The resulting code in this PR is still very messy. But, I choose to make this PR at this point because it was a point when @ChristopherBignamini could start working from this branch in order to transcribe cool1d_multi_g in parallel with my efforts. In fact, it was originally my intention to break this particular PR into smaller (slightly more logical) pieces, but I prioritized "getting done" in order to stop being a bottleneck for @ChristopherBignamini

Another PR has been posted to further cleanup lookup_cool_rates0d.

…transcription

… for transcription

…now use the standard GrainSpeciesCollection type.

…teCollection type.

This includes all collisional and recombination rates the behave "normally". We now use the standard ColRecRxnRateCollection type. This involved a lot of search-and-replace

…N_NAME with wrappers

…with calls to C++ wrappers

…ld be addressed

…' into gen2024transcribe/lookup_cool_rates0d

mabruzzo · 2025-08-27T13:24:09Z

Ugh, ...

The tests (freefall metal chemistry) are failing when I build this locally on macOS. I have confirmed that the bug is introduced in one of the first 5 commits of this PR... (it is NOT an issue in #375).

Unfortunately, the fact that we merged in both

the fix to related to the iter argument that lookup_cool_rates_0d passes to cool_1d_multi_g
H2 self shielding length scale factor fix #325 (the fix to lookup_cool_rates_1d, which is called within lookup_cool_rates_0d)

AFTER I wrote the original PR is making this EXTREMELY annoying to debug.

I'm starting to think that the only way to solve this is to redo the transcription of lookup_cool_rates_1d to track down the underlying issue... (and then once I have a working C++ version, I can come back here and fix things).

mabruzzo · 2025-09-01T20:20:11Z

@brittonsmith you should move ahead with your review of this PR, and ignore the failing tests on macOS. I'll expand more on this down below.

Looking into the failing tests on macOS has been a giant PITA. The bottom line is there is some kind of undefined behavior. I strongly suspect we're using an uninitialized variable, since that has been a common issue that we've dealt with multiple times with Gen's branch.

Evidence that the code has undefined behavior:

The first clue is the fact that when we transcribe Fortran -> C++ we are getting consistent results linux, but not on macOS¹. This suggests that there is some undefined behavior in the transcribed C++ code.
The next clue comes from a bunch of digging that I had done. I started to redo the transcription by working directly off of an up-to-date version of newchem-cpp.
- As I did this, the error reappeared in re-transcribed version of the C++ code. I confirmed that this isn't error I introduced when resolving merge conflicts...
- So then I wondered, did we introduce the issue during transcription. I had the brainstorm testing this by calling the transcribed C++ version of ``lookup_cool_rates1dand then immediately calling the Fortran versions oflookup_cool_rates1d`.
  - The plan was to initially comment out the entirety of the C++ version. Then, commit-by-comment I would uncomment parts of the C++ version and comment-out corresponding regions of the Fortran version.
  - Because of how monolithic the logic is, this is somewhat non-trivial, but I got things working...
  - Here is a branch of the code where I did this. The test currently passes in this version of the code.
  - The tests start failing if we comment out this line.
    - The above link shows a line in the fortran version of routine that is responsible for recording the value of
      $$\frac{\partial}{\partial t}\rho_{{\rm H}_2^+}$$
      in the 1D dspdot array. As the code is currently written, the linked line of Fortran code is overwriting the value previously stored by the C++ version of the function (the C++ line that does this can be found here).²
    - Initially, this suggested to me that one of the rate coefficients used to compute
      $$\frac{\partial}{\partial t}\rho_{{\rm H}_2^+}$$
      in the C++ version may not be properly initialized or something. But then, when I started printing out and comparing the exact values (out to 16 significant digits) computed by C++ & Fortran, I was finding that the values were identical
    - While I was doing this, I realized that the undefined behavior is present in the original Fortran code. If we go back to use the value of
      $$\frac{\partial}{\partial t}\rho_{{\rm H}_2^+}$$
      computed by Fortran, and we uncomment this Fortran line (to write the computed derivative value to stdout) the tests will start failing
    - For posterity, you use the following command to run the failing test
      $ pytest -s --answer-dir=./gold-standards-313/nccv4 -k freefall-metal_dust_chemistry_variants-2

To summarize: the undefined behavior triggering the failing test is present in the Fortran logic.³ Since the tests are all passing in CI, I think we should move ahead with this PR, and maybe bump the gold-standard after this PR is merged (I'm fairly confident that the C++ transcription "preserves logic"). We obviously need to track down this undefined behavior, but we have a far better chance of doing that in C++ than in Fortran (there are far fewer variables...). With that said, the "right answer" is probably to try running valgrind...⁴

This could be multiple things. First, it could be minor differences between gcc & clang. It could also be differences in implementation-defined behavior (presumably linked to malloc) in the standard C library that macOS uses vs the glibc. This issue also seems like the sort of thing that could change between compiler versions and between libc versions... ↩
In other words, commenting out the linked Fortran line means Grackle will use the C++ version of the logic. ↩
Of course, it's plausible that transcription may introduce additional undefined behavior, but that seems somewhat unlikely. ↩
It's been ~8 months since I last ran valgrind on the gen2024 branch (way before we introduced the tests that are now failing)... It's going to take me some effort to figure how to run it again since this is all in a python test (I need to use a manually compiled python version and I need to filter out a bunch of spurious python-related warnings) ↩

mabruzzo · 2025-09-02T19:18:36Z

@brittonsmith, I ran src/python/tests/test_models.py::test_model[freefall-metal_dust_chemistry_variants-2-0] under valgrind (with memcheck) today. I was using my debug build of cpython & a debug build of grackle (the latest version of the newchem-cpp branch¹). It takes 2 hours 10 minutes to run this single test under valgrind. And... there valgrind doesn't identify any errors.

It's worth noting that I was running valgrind on linux even though I first encountered the issues on macOS. That's because valgrind isn't compatible with ARM-based Macs.

So, I think this means we should move forward and create gold-standard-nccv5 after this PR is merged. This is for a few reasons:

I feel like I have exhausted our ability to easily debug this problem. And there is 0 chance that we will be able to catch the issue just from reading the code (honestly, we'll have a much better chance after the code has been transcribed and we will no longer have to keep track of 100s of variables).
Importantly, it is only an issue on macOS...
This is a major bottleneck to a lot of other PRs
Finally, as I recall, we never really validated the src/python/tests/test_models.py::test_model[freefall-metal_dust_chemistry_variants-*] tests against "known answers" (they primarily exist to catch changes). Plus these tests iteratively call grackle many, many times (so a minor discrepancies will probably grow). Consequently, it's not obvious to me that the failing test means that we are actually "breaking Grackle"

For posterity, the commit hash is cbb406b ↩

brittonsmith

I've looked at the differences for the failing tests using the --model-comparison-dir option. Generally speaking, the differences are very small. The freefall model we uses has a variable timestep that accounts for the effective equation of state. Significant differences will result in a different number of timesteps making it impossible to measure a relative difference (we're not interpolating). Therefore, it is very encouraging that this only happens in a few cases and even those I cannot see a difference in any of the plots with my eye.

In the rest of the cases, the differences are quite small. The majority of them (say 8/10 or so) have differences for species densities of the order ~1e-13 with an occasional jump to ~1e-9 for a couple species. In one case, the differences reach ~1e-3, but even then not high enough to result in different timestepping. In this case, the maximum relative difference in the temperature was 2e-3. I didn't see anything systematic. I'm happy to merge this and move on.

mabruzzo added 18 commits February 12, 2025 09:10

lookup_cool_rates0d: tweak (1/3) to aid transcription

90defda

lookup_cool_rates0d: tweak (2/3) to aid transcription

07993a7

lookup_cool_rates0d: tweak (3/3) to aid transcription

ccc44c3

lookup_cool_rates0d: initial transcription attempt

58edfe0

removing lookup_rates_cool0d.F from the CMakeLists.txt file

ffbff34

lookup_cool_rates0d: some ground work for cleanup

f045458

looup_cool_rates0d: redefine local vars as refs to pack.fields entries

058d94a

lookup_cool_rates0d: removed some local variables defined purely for …

ba77aeb

…transcription

lookup_cool_rates0d: removed remaining local variables defined purely…

cd094b3

… for transcription

time_deriv_0d: remove a lot of unused temporary variables

ce962f5

lookup_cool_rates0d aggregate all occurences of kd<dust-species>. We …

9972c5e

…now use the standard GrainSpeciesCollection type.

lookup_cool_rates0d: aggregate kddshield into the standard PhotoRxnRa…

9743904

…teCollection type.

lookup_cool_rates0d aggregate all "normal" rxn rates

6fedd90

This includes all collisional and recombination rates the behave "normally". We now use the standard ColRecRxnRateCollection type. This involved a lot of search-and-replace

lookup_cool_rates0d: some additional cleanup before we replace FORTRA…

878e0e7

…N_NAME with wrappers

lookup_cool_rates0d: use fortran_wrapper::cool1d_multi_g

9181ef7

lookup_cool_rates0d: replace remaing invocations of fortran routines …

8882678

…with calls to C++ wrappers

lookup_cool_rates0d: record a todo list of remaining things that shou…

c398ab7

…ld be addressed

Merge branch 'gen2024transcribe/step_rate_newton_raphson-cleanup-1of2…

1e224e6

…' into gen2024transcribe/lookup_cool_rates0d

mabruzzo added this to New Chemistry and C++ Transcription Aug 20, 2025

mabruzzo added the refactor internal reorganization or code simplification with no behavior changes label Aug 20, 2025

mabruzzo mentioned this pull request Aug 20, 2025

transcribe lookup_cool_rates0d 1/2 brittonsmith/grackle#31

Closed

mabruzzo added 2 commits August 19, 2025 21:57

address compiler warning (rm unused vars)

1e93a10

apply clang-format

50dbb00

mabruzzo mentioned this pull request Aug 20, 2025

[newchem-cpp] lookup_cool_rates0d 2/2 #379

Merged

mabruzzo moved this to Awaiting Review in New Chemistry and C++ Transcription Aug 20, 2025

mabruzzo changed the base branch from newchem-cpp to main August 27, 2025 14:56

mabruzzo changed the base branch from main to newchem-cpp August 27, 2025 14:56

mabruzzo mentioned this pull request Sep 2, 2025

[newchem-cpp] Organize collisional reaction rates into an indexable table #411

Merged

brittonsmith approved these changes Sep 11, 2025

View reviewed changes

brittonsmith merged commit 486fc7f into grackle-project:newchem-cpp Sep 11, 2025
5 checks passed

github-project-automation Bot moved this from Awaiting Review to Done in New Chemistry and C++ Transcription Sep 11, 2025

mabruzzo deleted the gen2024transcribe/lookup_cool_rates0d branch September 11, 2025 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[newchem-cpp] transcribe `lookup_cool_rates0d` 1/2#378

[newchem-cpp] transcribe `lookup_cool_rates0d` 1/2#378
brittonsmith merged 20 commits into
grackle-project:newchem-cppfrom
mabruzzo:gen2024transcribe/lookup_cool_rates0d

mabruzzo commented Aug 20, 2025

Uh oh!

mabruzzo commented Aug 27, 2025

Uh oh!

mabruzzo commented Sep 1, 2025

Uh oh!

mabruzzo commented Sep 2, 2025

Uh oh!

brittonsmith left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

mabruzzo commented Aug 20, 2025

Uh oh!

mabruzzo commented Aug 27, 2025

Uh oh!

mabruzzo commented Sep 1, 2025

Footnotes

Uh oh!

mabruzzo commented Sep 2, 2025

Footnotes

Uh oh!

brittonsmith left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants