Add big-data preprocessing pipeline by T4ras123 · Pull Request #32 · YerevaNN/3DMolGen

T4ras123 · 2026-05-20T10:19:44Z

This pull request updates the data_preprocessing.py pipeline to support new embedding types, improved configuration, and enhanced SMILES handling. The main changes include refactoring embedding logic to support custom binning strategies, adding options for isomeric SMILES, and improving configuration management for embeddings.

Embedding logic and configuration improvements:

Refactored embedding function selection: replaced hardcoded embedding registry with a new get_embedding_func_and_config utility, enabling dynamic loading of embedding functions and binning configurations, and supporting new embedding types such as uniform_binned and quantile_binned (preprocess, CLI argument parsing) [1] [2].
Added support for loading BinConfig from a JSON file via a new --bin_config_path argument, required for the new embedding types [1] [2] [3].

SMILES handling enhancements:

Introduced a --isomeric/--use_isomeric_smiles flag to optionally use isomeric SMILES as the canonical SMILES in output, improving downstream compatibility [1] [2] [3] [4] [5].

Code quality and maintainability:

Centralized and simplified coordinate range parsing by replacing inline parsing with the get_coordinate_ranges_for_embedding utility, which now incorporates bin configuration if provided.
Consolidated embedding logic in encode_mol_with_embedding, reducing duplicated code and making it easier to add new embedding strategies [1] [2].

Miscellaneous:

Improved logging style for consistency and clarity.
Updated help text and argument descriptions for clarity, especially regarding legacy and new embedding options.

These changes make the preprocessing pipeline more flexible, extensible, and user-friendly, especially for users requiring advanced embedding strategies and precise SMILES control.New preprocess_big_data.py handles {smiles: [Mol]} grouped pickles with shard-based processing for bounded RAM usage and SLURM array support. data_preprocessing_revisited.py adds support for the GEOM-revisited split format. data_preprocessing.py cleaned up: dropped unused _geom_root arg, dead --sort_by CLI flag, and fixed loguru format string.

New preprocess_big_data.py handles {smiles: [Mol]} grouped pickles with shard-based processing for bounded RAM usage and SLURM array support. data_preprocessing_revisited.py adds support for the GEOM-revisited split format. data_preprocessing.py cleaned up: dropped unused _geom_root arg, dead --sort_by CLI flag, and fixed loguru format string.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add big-data preprocessing pipeline#32

Add big-data preprocessing pipeline#32
T4ras123 wants to merge 1 commit into
mainfrom
pr/bigdata-preprocessing

T4ras123 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

T4ras123 commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant