Skip to content

Update core utils: paths config, SMILES encoder, data processing and …#31

Open
T4ras123 wants to merge 1 commit into
mainfrom
pr/core-utils
Open

Update core utils: paths config, SMILES encoder, data processing and …#31
T4ras123 wants to merge 1 commit into
mainfrom
pr/core-utils

Conversation

@T4ras123

Copy link
Copy Markdown
Contributor

This pull request introduces two main sets of changes: (1) improvements to the robustness of path resolution in the configuration code, and (2) significant updates and expansions to the configuration files, especially paths.yaml, to support new data locations, datasets, and model checkpoints. Additionally, there are minor cleanups and docstring removals in the smiles_encoder_decoder.py module.

Configuration and Path Handling Improvements

  • Improved error handling in path resolution functions in paths.py to gracefully skip over unreadable or inaccessible directories, preventing crashes when encountering permission issues on shared or mounted filesystems. [1] [2] [3]

Configuration Expansion and Dataset Support

  • Substantial expansion of paths.yaml:
    • Added new prioritized paths for various data and checkpoint roots, especially under /mnt/weka/vtarasov and /data/vtarasov, to support additional storage backends and user-specific mounts.
    • Added new dataset entries for "revisited" and "big_data" datasets, including multiple binned and grouped variants for train/valid/test splits, and new tokenizer paths. [1] [2]
    • Introduced new model checkpoint entries for various training scenarios, including revisited and big data finetuned models, with corresponding training step mappings.

Codebase Cleanup

  • Removed or condensed redundant docstrings and comments from several functions in smiles_encoder_decoder.py, making the code more concise without altering functionality. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10]
  • Minor imports and typing improvements in smiles_encoder_decoder.py to support new features and improve code quality.

These changes together enhance the flexibility and robustness of the codebase for handling diverse storage environments and new datasets, while also simplifying and cleaning up the code.…eval utils

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant