Skip to content

Uniform quantile data processing#24

Open
T4ras123 wants to merge 3 commits into
mainfrom
uniform_quantile_data_processing
Open

Uniform quantile data processing#24
T4ras123 wants to merge 3 commits into
mainfrom
uniform_quantile_data_processing

Conversation

@T4ras123

Copy link
Copy Markdown
Contributor

This pull request introduces new scripts for computing coordinate ranges and fitting bin configurations, and integrates support for bin configuration objects in the data preprocessing pipeline. The main goal is to provide more robust and flexible handling of coordinate binning, including quantile-based and uniform binning, and to facilitate downstream usage of these bin configurations for encoding molecular coordinates.

New scripts for data analysis and bin fitting:

  • Added scripts/compute_range_R.py for computing recommended coordinate ranges based on conformer radius proxies and quantile statistics from the training set. This script helps determine appropriate value ranges for coordinate binning and reports overflow statistics for train/val/test splits.
  • Added scripts/fit_bins.py for fitting uniform and quantile bin configurations to molecular coordinate data. The script pools coordinates, computes distribution summaries, fits bins, saves bin configs as JSON, and reports overflow statistics for each split.

Bin configuration integration and preprocessing improvements:

  • Added quantile and uniform bin configuration JSON files (quantile_bins.json, uniform_bins.json) for use with coordinate encoding routines. These files specify bin edges, ranges, and metadata for downstream encoding. [1] [2]
  • Modified read_mol and _read_mol_impl in data_preprocessing.py to accept an optional bin_config argument and support the new encode_cartesian_with_config embedding function, enabling flexible binning based on external configuration. [1] [2] [3]

Configuration updates:

  • Updated paths.yaml to add an additional checkpoint path for qwen_yerevann_root and to update the model root for qw600_pre_binned_paired. [1] [2]
    [Copilot is generating a summary...]

root added 3 commits February 27, 2026 01:41
- Introduced  to compute the binning range R from training data, including radius proxy calculations and overflow counts.
- Added  to fit bin configurations (uniform and quantile) from training data, producing JSON files for later use in encoding and decoding.
- Implemented  class for unified configuration management of binning methods, enhancing the encoding/decoding process.
- Updated data preprocessing to support new binning methods and configurations, ensuring compatibility with existing workflows.
- Removed extensive docstrings from , , and  to streamline the code and enhance readability.
- Kept function signatures intact while eliminating unnecessary comments, focusing on maintaining functionality and clarity.
- Introduced new paths for revisited cartesian and quantile binned datasets in paths.yaml.
- Added Qwen3 tokenizer configuration files, including metadata, tokenizer settings, and a chat template for the binned_258 variant.
- Updated count_tokens.py to remove an unnecessary parameter in the tokenizer loading process, enhancing clarity.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant