Skip to content

[newchem-cpp] Improve implementation of FrozenStringIdxBiMap#484

Merged
brittonsmith merged 10 commits into
grackle-project:newchem-cppfrom
mabruzzo:ncc/better-FrozenKeyIdxBiMap
Apr 3, 2026
Merged

[newchem-cpp] Improve implementation of FrozenStringIdxBiMap#484
brittonsmith merged 10 commits into
grackle-project:newchem-cppfrom
mabruzzo:ncc/better-FrozenKeyIdxBiMap

Conversation

@mabruzzo

@mabruzzo mabruzzo commented Jan 6, 2026

Copy link
Copy Markdown
Collaborator

This should be reviewed after #521 has been merged


Background Context

Recall that PR #451 introduced the FrozenStringIdxBiMap. It is a datatype that implements a bidirectional map (aka a bidirectional dictionary), that can be used to map between a unique set of n strings (keys) and a unique set of indexes (with values of 0 through n-1) and vice-versa. In PR #451 I intentionally adopted an overly-simplistic implementation for the datatype (i.e. it just did linear searches over the full array) to make PR easier to review.

Description

This PR totally changes the underlying implementation to improve efficiency. These changes have NO IMPACT on the interface. Under the hood, FrozenStringIdxBiMap is now implemented in terms of a Hash Table.

On a historical note, this PR effectively supersedes #270 (which originally proposed the same implementation)

Tradeoffs:

Benefit

This is clear-cut: average time complexity to search for a key goes from Θ(n) (linear) to Θ(1) (constant)

Cost

The storage allocated from the heap is larger. Suppose s is the average number of bytes needed to track a string allocation and n keys in a bimap:

  • the old approach only allocated n*(8 + s) bytes (we just allocate n pointers)
  • the new approach requires n*(2+16*k + s) bytes
    • k * n * 16 bytes for each row of the hash table and 2*n bytes (aka n std::uint16_t values) for rapid reverse lookup
    • k is the ratio between capacity & number of keys (a larger k reduces collisions). Currently, it's 2 (but we can reduce it!)

Let's throw out some concrete numbers. Suppose we have 100 items (frankly, that seems a pretty high) and the average length is ~15 bytes

Old Approach New Approach
reference string literals (i.e. s=0) 800 bytes 3400 bytes
copy each string (i.e. s=15) 2300 bytes 4900 bytes

The fact that we are talking about less than 5 killobytes makes me think this cost isn't terribly significant (and, if it's a concern, we could try reducing k from 2 to 1.5)

  • for 100 items (which seems a little high) where we just reference memory in string literals (s=0) we are talking about 800 bytes vs 3400 bytes
  • for 100 items, where we take ownership of the strings and strings have an average length of ~15 bytes we are talking about 2300 bytes vs 4900 bytes

In terms of absolute numbers, 5kB clearly isn't a concern (a linux page size is 4 kB). And if we ever refactor enough and use this in all the places I think it could be useful (e.g. using it to provide map-access to chemistry parameters or grackle fields, and for one or two cases to help with setup), we are talking about maybe creating 5 instances while initializing grackle (and most would promptly get de-allocated)

Other thoughts

Since this just slots into the interface defined by #451, there isn't a ton of urgency to this PR

@brittonsmith

Copy link
Copy Markdown
Contributor

@mabruzzo, could you fix the merger conflicts? I wasn't sure about the one in tests/unit/CMakeLists.txt.

@mabruzzo

mabruzzo commented Apr 2, 2026

Copy link
Copy Markdown
Collaborator Author

@mabruzzo, could you fix the merger conflicts? I wasn't sure about the one in tests/unit/CMakeLists.txt.

Fixed! But you should merge #521, first.

@brittonsmith brittonsmith changed the base branch from newchem-cpp to main April 2, 2026 13:42
@brittonsmith brittonsmith changed the base branch from main to newchem-cpp April 2, 2026 13:42
@brittonsmith brittonsmith merged commit ec29d3a into grackle-project:newchem-cpp Apr 3, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants