Skip to content

Add big-data preprocessing pipeline#32

Open
T4ras123 wants to merge 1 commit into
mainfrom
pr/bigdata-preprocessing
Open

Add big-data preprocessing pipeline#32
T4ras123 wants to merge 1 commit into
mainfrom
pr/bigdata-preprocessing

Conversation

@T4ras123

Copy link
Copy Markdown
Contributor

This pull request updates the data_preprocessing.py pipeline to support new embedding types, improved configuration, and enhanced SMILES handling. The main changes include refactoring embedding logic to support custom binning strategies, adding options for isomeric SMILES, and improving configuration management for embeddings.

Embedding logic and configuration improvements:

  • Refactored embedding function selection: replaced hardcoded embedding registry with a new get_embedding_func_and_config utility, enabling dynamic loading of embedding functions and binning configurations, and supporting new embedding types such as uniform_binned and quantile_binned (preprocess, CLI argument parsing) [1] [2].
  • Added support for loading BinConfig from a JSON file via a new --bin_config_path argument, required for the new embedding types [1] [2] [3].

SMILES handling enhancements:

  • Introduced a --isomeric/--use_isomeric_smiles flag to optionally use isomeric SMILES as the canonical SMILES in output, improving downstream compatibility [1] [2] [3] [4] [5].

Code quality and maintainability:

  • Centralized and simplified coordinate range parsing by replacing inline parsing with the get_coordinate_ranges_for_embedding utility, which now incorporates bin configuration if provided.
  • Consolidated embedding logic in encode_mol_with_embedding, reducing duplicated code and making it easier to add new embedding strategies [1] [2].

Miscellaneous:

  • Improved logging style for consistency and clarity.
  • Updated help text and argument descriptions for clarity, especially regarding legacy and new embedding options.

These changes make the preprocessing pipeline more flexible, extensible, and user-friendly, especially for users requiring advanced embedding strategies and precise SMILES control.New preprocess_big_data.py handles {smiles: [Mol]} grouped pickles with shard-based processing for bounded RAM usage and SLURM array support. data_preprocessing_revisited.py adds support for the GEOM-revisited split format. data_preprocessing.py cleaned up: dropped unused _geom_root arg, dead --sort_by CLI flag, and fixed loguru format string.

New preprocess_big_data.py handles {smiles: [Mol]} grouped pickles with
shard-based processing for bounded RAM usage and SLURM array support.
data_preprocessing_revisited.py adds support for the GEOM-revisited split
format. data_preprocessing.py cleaned up: dropped unused _geom_root arg,
dead --sort_by CLI flag, and fixed loguru format string.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant