Revisited data processing scripts by T4ras123 · Pull Request #29 · YerevaNN/3DMolGen

T4ras123 · 2026-04-06T08:34:24Z

This pull request introduces several improvements and new features to the data preprocessing scripts for 3D molecular data. The main changes add support for new embedding types, allow configuration of binning via external files, and provide the option to use isomeric SMILES for canonicalization. The codebase is also refactored for better modularity and maintainability by moving utility functions and centralizing embedding logic.

Major feature additions:

Added support for new embedding types: "uniform_binned" and "quantile_binned", including command-line options in both data_preprocessing.py and preprocess_geom_grouped.py. [1] [2]
Added --bin_config_path argument to allow specifying a JSON file for binning configuration, used by the new embedding types. (src/molgen3D/data_processing/data_preprocessing.pyR455-R473, Fd44241bL245)
Added --isomeric/--use_isomeric_smiles flag to optionally use isomeric SMILES as the canonical identifier for output samples. (src/molgen3D/data_processing/data_preprocessing.pyR455-R473, Fd44241bL245)

Refactoring and code quality improvements:

Centralized embedding function selection and bin config loading into get_embedding_func_and_config, removing hardcoded embedding registries from the scripts. [1] [2] [3] [4]
Moved utility functions such as copy_single_conformer_mol, extract_conf_meta, and save_grouped_pickle to the shared utils module for better modularity. [1] [2] [3]

Canonical SMILES handling:

The canonical SMILES used for output can now be either isomeric or nonisomeric, controlled by the new flag, improving flexibility for downstream tasks. [1] [2] [3] [4]

Other improvements:

Simplified and unified coordinate range parsing by introducing parse_coordinate_ranges, replacing inline ast.literal_eval usage. [1] [2]
Updated multiprocessing argument passing and embedding function signatures to support the new options and maintain backward compatibility. [1] [2] [3]

These changes make the preprocessing pipeline more flexible, configurable, and maintainable, supporting new embedding strategies and improving reproducibility.

Support the revisited preprocessing workflow with grouped/counting utilities and shared serialization helpers so the data pipeline can be reviewed independently from training and evaluation changes. Made-with: Cursor

Keep the PR focused on the preprocessing implementation and shared helper code, and drop auxiliary scripts and token counting changes from the review. Made-with: Cursor

T4ras123 added 2 commits April 6, 2026 12:23

Add revisited data processing scripts and helpers

6e85905

Support the revisited preprocessing workflow with grouped/counting utilities and shared serialization helpers so the data pipeline can be reviewed independently from training and evaluation changes. Made-with: Cursor

Trim PR to preprocessing modules and shared helpers

a42045b

Keep the PR focused on the preprocessing implementation and shared helper code, and drop auxiliary scripts and token counting changes from the review. Made-with: Cursor

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revisited data processing scripts#29

Revisited data processing scripts#29
T4ras123 wants to merge 2 commits into
mainfrom
revisited-data-processing-scripts

T4ras123 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

T4ras123 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant