Skip to content

Revisited data processing scripts#29

Open
T4ras123 wants to merge 2 commits into
mainfrom
revisited-data-processing-scripts
Open

Revisited data processing scripts#29
T4ras123 wants to merge 2 commits into
mainfrom
revisited-data-processing-scripts

Conversation

@T4ras123

@T4ras123 T4ras123 commented Apr 6, 2026

Copy link
Copy Markdown
Contributor

This pull request introduces several improvements and new features to the data preprocessing scripts for 3D molecular data. The main changes add support for new embedding types, allow configuration of binning via external files, and provide the option to use isomeric SMILES for canonicalization. The codebase is also refactored for better modularity and maintainability by moving utility functions and centralizing embedding logic.

Major feature additions:

Refactoring and code quality improvements:

  • Centralized embedding function selection and bin config loading into get_embedding_func_and_config, removing hardcoded embedding registries from the scripts. [1] [2] [3] [4]
  • Moved utility functions such as copy_single_conformer_mol, extract_conf_meta, and save_grouped_pickle to the shared utils module for better modularity. [1] [2] [3]

Canonical SMILES handling:

  • The canonical SMILES used for output can now be either isomeric or nonisomeric, controlled by the new flag, improving flexibility for downstream tasks. [1] [2] [3] [4]

Other improvements:

  • Simplified and unified coordinate range parsing by introducing parse_coordinate_ranges, replacing inline ast.literal_eval usage. [1] [2]
  • Updated multiprocessing argument passing and embedding function signatures to support the new options and maintain backward compatibility. [1] [2] [3]

These changes make the preprocessing pipeline more flexible, configurable, and maintainable, supporting new embedding strategies and improving reproducibility.

T4ras123 added 2 commits April 6, 2026 12:23
Support the revisited preprocessing workflow with grouped/counting utilities and shared serialization helpers so the data pipeline can be reviewed independently from training and evaluation changes.

Made-with: Cursor
Keep the PR focused on the preprocessing implementation and shared helper code, and drop auxiliary scripts and token counting changes from the review.

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant