Skip to content

[newchem-cpp] transcribe lookup_cool_rates0d 1/2#378

Merged
brittonsmith merged 20 commits into
grackle-project:newchem-cppfrom
mabruzzo:gen2024transcribe/lookup_cool_rates0d
Sep 11, 2025
Merged

[newchem-cpp] transcribe lookup_cool_rates0d 1/2#378
brittonsmith merged 20 commits into
grackle-project:newchem-cppfrom
mabruzzo:gen2024transcribe/lookup_cool_rates0d

Conversation

@mabruzzo

Copy link
Copy Markdown
Collaborator

This PR was originally proposed as brittonsmith#31


This PR must be reviewed after #375


This was the step 1 of 2 for transcribing lookup_cool_rates0d.

  • the first 3 commits tweaked the fortran code. Prior to this PR some constant values were passed directly as arguments (in these cases, we used values of 0 or 1). Essentially, we introduced a few local variables to temporarily store these values and then passed the variables as arguments (this overcomes a limitation of the transcription tool)
  • then we performed the initial transcription
  • After doing the initial transcription, and reducing some most work was effort was devoted to refactoring all of the local variable so we could replace the calls to FORTRAN_NAME(cool1d_multi_g), FORTRAN_NAME(lookup_cool_rates1d_g), FORTRAN_NAME(rate_timestep_g) with the C++ wrappers (that use the reduced arg lists)
    • for example, historically the routine stored made a copy of every species mass density (and the internal energy) and stored the copies within local variables. Since all of our C++ wrappers expect this information to be specified within an instance of grackle_field_data, this needed to change:
      • several commits were storing the copies within the pointers specified by an instance of grackle_field_data (distinct from the instance passed to Grackle's API).
      • The local variables previously dedicated to holding the local copies were converted to references to the local copies. This let us avoid breaking the 1200+ lines of logic at the end of the function. Said logic was clearly copied from step_rate_g and then heavily modified. That logic is more directly addressed in the next PR
    • I also replaced >100 local variables with instances of the ColRecRxnRateCollection, PhotoRxnRateCollection, and GrainSpeciesCollection types since the C++ use these types to pass around large bundles of quantities (there was a bunch of search-and-replace done to avoid breaking logic at the end of the function)
    • I replaced a couple of local variables with an instance of IndexRange
  • I also removed a bunch of unused local variables

The fact that the cool1d_multi_g, lookup_cool_rates1d_g, and rate_timestep_g routines are only called from C++ source files AND the fact they are only called through the C++ wrappers (i.e. grackle::impl::fortran_wrapper::cool1d_multi_g, grackle::impl::fortran_wrapper::lookup_cool_rates1d_g, and grackle::impl::rate_timestep_g) is an important milestone. It means that they can be replaced with transcribed function.


The resulting code in this PR is still very messy. But, I choose to make this PR at this point because it was a point when @ChristopherBignamini could start working from this branch in order to transcribe cool1d_multi_g in parallel with my efforts. In fact, it was originally my intention to break this particular PR into smaller (slightly more logical) pieces, but I prioritized "getting done" in order to stop being a bottleneck for @ChristopherBignamini

Another PR has been posted to further cleanup lookup_cool_rates0d.

…now use the standard GrainSpeciesCollection type.
This includes all collisional and recombination rates the behave
"normally". We now use the standard ColRecRxnRateCollection type.

This involved a lot of search-and-replace
…' into gen2024transcribe/lookup_cool_rates0d
@mabruzzo mabruzzo added the refactor internal reorganization or code simplification with no behavior changes label Aug 20, 2025
@mabruzzo

Copy link
Copy Markdown
Collaborator Author

Ugh, ...

The tests (freefall metal chemistry) are failing when I build this locally on macOS. I have confirmed that the bug is introduced in one of the first 5 commits of this PR... (it is NOT an issue in #375).

Unfortunately, the fact that we merged in both

AFTER I wrote the original PR is making this EXTREMELY annoying to debug.

I'm starting to think that the only way to solve this is to redo the transcription of lookup_cool_rates_1d to track down the underlying issue... (and then once I have a working C++ version, I can come back here and fix things).

@mabruzzo mabruzzo changed the base branch from newchem-cpp to main August 27, 2025 14:56
@mabruzzo mabruzzo changed the base branch from main to newchem-cpp August 27, 2025 14:56
@mabruzzo

mabruzzo commented Sep 1, 2025

Copy link
Copy Markdown
Collaborator Author

@brittonsmith you should move ahead with your review of this PR, and ignore the failing tests on macOS. I'll expand more on this down below.


Looking into the failing tests on macOS has been a giant PITA. The bottom line is there is some kind of undefined behavior. I strongly suspect we're using an uninitialized variable, since that has been a common issue that we've dealt with multiple times with Gen's branch.

Evidence that the code has undefined behavior:

  1. The first clue is the fact that when we transcribe Fortran -> C++ we are getting consistent results linux, but not on macOS1. This suggests that there is some undefined behavior in the transcribed C++ code.
  2. The next clue comes from a bunch of digging that I had done. I started to redo the transcription by working directly off of an up-to-date version of newchem-cpp.
    • As I did this, the error reappeared in re-transcribed version of the C++ code. I confirmed that this isn't error I introduced when resolving merge conflicts...
    • So then I wondered, did we introduce the issue during transcription. I had the brainstorm testing this by calling the transcribed C++ version of ``lookup_cool_rates1dand then immediately calling the Fortran versions oflookup_cool_rates1d`.
      • The plan was to initially comment out the entirety of the C++ version. Then, commit-by-comment I would uncomment parts of the C++ version and comment-out corresponding regions of the Fortran version.
      • Because of how monolithic the logic is, this is somewhat non-trivial, but I got things working...
      • Here is a branch of the code where I did this. The test currently passes in this version of the code.
      • The tests start failing if we comment out this line.
        • The above link shows a line in the fortran version of routine that is responsible for recording the value of
          $$\frac{\partial}{\partial t}\rho_{{\rm H}_2^+}$$
          in the 1D dspdot array. As the code is currently written, the linked line of Fortran code is overwriting the value previously stored by the C++ version of the function (the C++ line that does this can be found here).2
        • Initially, this suggested to me that one of the rate coefficients used to compute
          $$\frac{\partial}{\partial t}\rho_{{\rm H}_2^+}$$
          in the C++ version may not be properly initialized or something. But then, when I started printing out and comparing the exact values (out to 16 significant digits) computed by C++ & Fortran, I was finding that the values were identical
        • While I was doing this, I realized that the undefined behavior is present in the original Fortran code. If we go back to use the value of
          $$\frac{\partial}{\partial t}\rho_{{\rm H}_2^+}$$
          computed by Fortran, and we uncomment this Fortran line (to write the computed derivative value to stdout) the tests will start failing
        • For posterity, you use the following command to run the failing test
          $ pytest -s --answer-dir=./gold-standards-313/nccv4 -k freefall-metal_dust_chemistry_variants-2

To summarize: the undefined behavior triggering the failing test is present in the Fortran logic.3 Since the tests are all passing in CI, I think we should move ahead with this PR, and maybe bump the gold-standard after this PR is merged (I'm fairly confident that the C++ transcription "preserves logic"). We obviously need to track down this undefined behavior, but we have a far better chance of doing that in C++ than in Fortran (there are far fewer variables...). With that said, the "right answer" is probably to try running valgrind...4

Footnotes

  1. This could be multiple things. First, it could be minor differences between gcc & clang. It could also be differences in implementation-defined behavior (presumably linked to malloc) in the standard C library that macOS uses vs the glibc. This issue also seems like the sort of thing that could change between compiler versions and between libc versions...

  2. In other words, commenting out the linked Fortran line means Grackle will use the C++ version of the logic.

  3. Of course, it's plausible that transcription may introduce additional undefined behavior, but that seems somewhat unlikely.

  4. It's been ~8 months since I last ran valgrind on the gen2024 branch (way before we introduced the tests that are now failing)... It's going to take me some effort to figure how to run it again since this is all in a python test (I need to use a manually compiled python version and I need to filter out a bunch of spurious python-related warnings)

@mabruzzo

mabruzzo commented Sep 2, 2025

Copy link
Copy Markdown
Collaborator Author

@brittonsmith, I ran src/python/tests/test_models.py::test_model[freefall-metal_dust_chemistry_variants-2-0] under valgrind (with memcheck) today. I was using my debug build of cpython & a debug build of grackle (the latest version of the newchem-cpp branch1). It takes 2 hours 10 minutes to run this single test under valgrind. And... there valgrind doesn't identify any errors.

It's worth noting that I was running valgrind on linux even though I first encountered the issues on macOS. That's because valgrind isn't compatible with ARM-based Macs.


So, I think this means we should move forward and create gold-standard-nccv5 after this PR is merged. This is for a few reasons:

  1. I feel like I have exhausted our ability to easily debug this problem. And there is 0 chance that we will be able to catch the issue just from reading the code (honestly, we'll have a much better chance after the code has been transcribed and we will no longer have to keep track of 100s of variables).
  2. Importantly, it is only an issue on macOS...
  3. This is a major bottleneck to a lot of other PRs
  4. Finally, as I recall, we never really validated the src/python/tests/test_models.py::test_model[freefall-metal_dust_chemistry_variants-*] tests against "known answers" (they primarily exist to catch changes). Plus these tests iteratively call grackle many, many times (so a minor discrepancies will probably grow). Consequently, it's not obvious to me that the failing test means that we are actually "breaking Grackle"

Footnotes

  1. For posterity, the commit hash is cbb406b

@brittonsmith brittonsmith left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked at the differences for the failing tests using the --model-comparison-dir option. Generally speaking, the differences are very small. The freefall model we uses has a variable timestep that accounts for the effective equation of state. Significant differences will result in a different number of timesteps making it impossible to measure a relative difference (we're not interpolating). Therefore, it is very encouraging that this only happens in a few cases and even those I cannot see a difference in any of the plots with my eye.

In the rest of the cases, the differences are quite small. The majority of them (say 8/10 or so) have differences for species densities of the order ~1e-13 with an occasional jump to ~1e-9 for a couple species. In one case, the differences reach ~1e-3, but even then not high enough to result in different timestepping. In this case, the maximum relative difference in the temperature was 2e-3. I didn't see anything systematic. I'm happy to merge this and move on.

@brittonsmith brittonsmith merged commit 486fc7f into grackle-project:newchem-cpp Sep 11, 2025
5 checks passed
@github-project-automation github-project-automation Bot moved this from Awaiting Review to Done in New Chemistry and C++ Transcription Sep 11, 2025
@mabruzzo mabruzzo deleted the gen2024transcribe/lookup_cool_rates0d branch September 11, 2025 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

refactor internal reorganization or code simplification with no behavior changes

Projects

Development

Successfully merging this pull request may close these issues.

2 participants