Add big-data preprocessing pipeline#32
Open
T4ras123 wants to merge 1 commit into
Open
Conversation
New preprocess_big_data.py handles {smiles: [Mol]} grouped pickles with
shard-based processing for bounded RAM usage and SLURM array support.
data_preprocessing_revisited.py adds support for the GEOM-revisited split
format. data_preprocessing.py cleaned up: dropped unused _geom_root arg,
dead --sort_by CLI flag, and fixed loguru format string.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request updates the
data_preprocessing.pypipeline to support new embedding types, improved configuration, and enhanced SMILES handling. The main changes include refactoring embedding logic to support custom binning strategies, adding options for isomeric SMILES, and improving configuration management for embeddings.Embedding logic and configuration improvements:
get_embedding_func_and_configutility, enabling dynamic loading of embedding functions and binning configurations, and supporting new embedding types such asuniform_binnedandquantile_binned(preprocess, CLI argument parsing) [1] [2].BinConfigfrom a JSON file via a new--bin_config_pathargument, required for the new embedding types [1] [2] [3].SMILES handling enhancements:
--isomeric/--use_isomeric_smilesflag to optionally use isomeric SMILES as the canonical SMILES in output, improving downstream compatibility [1] [2] [3] [4] [5].Code quality and maintainability:
get_coordinate_ranges_for_embeddingutility, which now incorporates bin configuration if provided.encode_mol_with_embedding, reducing duplicated code and making it easier to add new embedding strategies [1] [2].Miscellaneous:
These changes make the preprocessing pipeline more flexible, extensible, and user-friendly, especially for users requiring advanced embedding strategies and precise SMILES control.New preprocess_big_data.py handles {smiles: [Mol]} grouped pickles with shard-based processing for bounded RAM usage and SLURM array support. data_preprocessing_revisited.py adds support for the GEOM-revisited split format. data_preprocessing.py cleaned up: dropped unused _geom_root arg, dead --sort_by CLI flag, and fixed loguru format string.