Skip to content

Update shared path, serialization, and evaluation helpers#30

Open
T4ras123 wants to merge 1 commit into
mainfrom
paths-eval-utils-runner
Open

Update shared path, serialization, and evaluation helpers#30
T4ras123 wants to merge 1 commit into
mainfrom
paths-eval-utils-runner

Conversation

@T4ras123

@T4ras123 T4ras123 commented Apr 6, 2026

Copy link
Copy Markdown
Contributor

This pull request introduces significant improvements and new features to the SMILES encoder/decoder module, focusing on more robust path handling, enriched coordinate binning functionality, and minor bug fixes. The main highlights are the addition of a flexible binning configuration class and methods for encoding/decoding molecules using these bin configurations, as well as improved error handling for file system paths.

Path handling robustness:

  • Improved error handling in _resolve_path_value and _base_candidate_values to gracefully skip over unreadable or inaccessible file system paths, preventing crashes due to OSError when traversing shared or restricted mounts. [1] [2] [3]

Coordinate binning enhancements:

  • Added a new BinConfig dataclass to encapsulate binning configuration, including methods for saving/loading configurations and calculating digit widths. This enables flexible and reusable binning strategies for molecular coordinate encoding. (F3f17218L544R544)
  • Implemented fit_uniform_bins and fit_quantile_bins functions to generate bin edges based on uniform or quantile-based strategies, supporting robust and data-driven binning. (F3f17218L544R544)
  • Added new functions encode_cartesian_with_config and decode_cartesian_with_config to serialize and deserialize 3D molecular coordinates using the new binning configuration, supporting more accurate and customizable encoding/decoding workflows. (F3f17218L544R544)

Bug fixes and minor improvements:

  • Changed the default coordinate binning range in both encode_cartesian_binned and encode_cartesian_binned_v2 from (-13.0, 13.0) to (-11.0, 11.0) for all axes, likely to better fit the data distribution. [1] [2]
  • Fixed a minor output formatting bug in encode_cartesian_binned_v2 by ensuring each atom entry is terminated with a semicolon, improving consistency for downstream parsing.

Code cleanup:

Summary of most important changes:

1. Path handling improvements

  • Added OSError exception handling in _resolve_path_value and _base_candidate_values to skip unreadable paths and continue searching for valid candidates. [1] [2] [3]

2. Coordinate binning functionality

  • Introduced BinConfig dataclass for managing binning configuration, including persistence methods. (F3f17218L544R544)
  • Added fit_uniform_bins and fit_quantile_bins for flexible bin edge generation. (F3f17218L544R544)
  • Implemented encode_cartesian_with_config and decode_cartesian_with_config for encoding/decoding molecules using bin configs. (F3f17218L544R544)

3. Bug fixes

  • Updated default coordinate binning ranges to (-11.0, 11.0) for better data fit. [1] [2]
  • Fixed output formatting in encode_cartesian_binned_v2 to ensure semicolon separation for atom entries.

4. Code cleanup

  • Removed excessive docstrings and improved code readability across multiple functions. [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11]Bundle the path fallback, serialization support, evaluation robustness, and training checkpoint path fixes that need to land together so the shared utilities stay consistent on top of main.

Bundle the path fallback, serialization support, evaluation robustness, and training checkpoint path fixes that need to land together so the shared utilities stay consistent on top of main.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant