Skip to content

PickleStorage speed up#48

Open
AbhinovKoutharapu wants to merge 6 commits into
nasa:developfrom
AbhinovKoutharapu:pickle_speed_up
Open

PickleStorage speed up#48
AbhinovKoutharapu wants to merge 6 commits into
nasa:developfrom
AbhinovKoutharapu:pickle_speed_up

Conversation

@AbhinovKoutharapu

Copy link
Copy Markdown

Cached the phi_sequence and mut_ratio_sequence for faster load times.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to speed up PickleStorage by caching phi_sequence and mut_ratio_sequence in memory, avoiding repeated full-file deserialization when those sequences are accessed.

Changes:

  • Added in-memory caches for phi_sequence and mut_ratio_sequence in PickleStorage.
  • Replaced the restart/length scan mechanism with _scan_new_records() and updated how record offsets/length are tracked.
  • Updated phi_sequence / mut_ratio_sequence properties to return cached values.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread smcpy/utils/storage.py Outdated
Comment on lines +212 to +218
def save_step(self, step):
file = self._open_file(self._mode)
self._mode = "ab"
pickle.dump(step, file, pickle.HIGHEST_PROTOCOL)
self._close(file)
self._phi_sequence.append(step.attrs["phi"])
self._mut_ratio_sequence.append(step.attrs["mutation_ratio"])
Comment thread smcpy/utils/storage.py Outdated
while True:
start = f.tell()
try:
obj = pickle.load(f)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AbhinovKoutharapu current implementation makes sense to me, but I think we're still going to run into some issues with having to load pickle objects, especially for large numbers of particles. The pickle.load time increases with size of particles. When we use this in PySIPS, those objects can be quite large.

I think my preferred approach would be:

  • On write:
  • In save_step, save the metadata in a separate pickle.dump so that the file structure is interleaved: <metadata 0>, <particles 0>, <metadata 1>, <particles 1>, etc.:
pickle.dump(meta_data_dict, file, pickle.HIGHEST_PROTOCOL)
pickle.dump(step, file, pickle.HIGHEST_PROTOCOL)
  • On read:
  • Use genops() to build the byte offset list without having to load any objects (fast)
  • Create different read behavior depending on whether the function requests the particles objects or whether it just wants metadata (e.g., mutation_ratio_sequence)
  • For the interleaved file structure, this means metadata would be the even indices [0, 2, 4, ...] and particle objects would be the odd indices [1, 3, 5, ...]. To be a bit more explicit, consider two separate lists for each (e.g., meta_byte_offsets and step_byte_offsets).

I think the end result is the ability to load the really small metadata objects lightning fast even for really large particle, large step simulations.

Comment thread smcpy/utils/storage.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants