Infrastructure and updates related to addressing reviewers' concerns#1
Open
jcoludar wants to merge 5 commits into
Open
Infrastructure and updates related to addressing reviewers' concerns#1jcoludar wants to merge 5 commits into
jcoludar wants to merge 5 commits into
Conversation
Three new data preparation scripts for the revision sprint: - GO semantic similarity (Wang method) with BMA protein-pair scoring - Randomly initialized pLM baseline (--random_init flag on embedding_generation.py) - PDB experimental TM-score pipeline (SIFTS mapping + TMalign) - Parquet merge utility to feed new columns into training splits Code improvements to existing files: - Vectorized distance_computation.py batch processing (~2-3x speedup) - Fixed drop_nulls().len() -> null_count() (2 instances) - Removed hardcoded choices whitelist from run_experiments.py and train.py - Documented magic hyperparameters in run_experiments.py - Added goatools>=1.3 dependency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…A, AUROC) Four new scripts completing the non-computational infrastructure: - EC-number hierarchy distance between protein pairs (4-level ordinal metric) - BRENDA/HFSP validation against curated enzyme functional classes - Retrieval metrics: recall-at-first-false-positive and AUROC - Classification evaluation at SCOP/ECOD hierarchy levels All 16 unit tests passing (8 for EC distance, 8 for retrieval metrics). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation - scripts/run_ivan_pipeline.sh: 7-step pipeline runner with --step/--force/--dry-run Chains all new scripts in correct order with idempotent skip-if-exists checks - scripts/download_reference_data.sh: fetches GO ontology, SIFTS, SCOP, ECOD, EC annotations, and CAFA instructions into data/reference/ - src/visualization/create_retrieval_plots.py: publication figures for AUROC bars, recall-at-first-FP bars, heatmap, and summary scatter Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ECOD density distributions, per-residue probes, binding-site alignment recovery, overtraining investigation, organism landscapes, MGnify fetcher, Sup Fig 4 delta plot, graphical abstract. Includes dependency map and blockers list for non-codeable items. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix shebang ordering in ec_hierarchy_distance.py, ecod_homology_pairs.py - Remove unused goatools dependency (self-contained OBO parser used instead) - Fix classification_eval.py import to relative (works outside pytest) - Document mmCIF-to-PDB limitation in pdb_tmscore.py - Add paginated EC download to download_reference_data.sh (was first page only) - Fix _bootstrap_stat() parallel/sequential control flow in metrics.py - Add bootstrap metrics regression tests (5 tests) - Document optional tbparse dependency in overtraining_analysis.py - Fix train.py early_stopping_patience help text (said 5, actual default 3) - Fix PEP 585 type annotations for codebase consistency - Fix test_plot.py to skip gracefully when unknown_unknowns not installed - Add new scripts: ecod_homology_pairs, organism_landscape, overtraining_analysis - Update docs/todo.md with new scripts and applied fixes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
--random_initflagWhat changed in existing files
All modifications are marked with
# --- Ivan infrastructure (2026-03-19) ---comments for easy identification.distance_computation.py— vectorized inner loop, no behavior changerun_experiments.py— removedchoiceswhitelists to accept new parameters, documented magic numberstrain.py— fixed help text typo (early_stopping_patience default)embedding_generation.py— added--random_initflag (existing behavior unchanged)Tests
21 tests pass (EC hierarchy, retrieval metrics, bootstrap control flow). Pre-existing test failures (uniref fixture, missing unified_embedder script) are unchanged.
Docs
docs/specs/2026-03-19-ivan-infrastructure-design.mddocs/todo.mdwith all new scripts and applied fixes