Infrastructure and updates related to addressing reviewers' concerns by jcoludar · Pull Request #1 · tsenoner/plm_choice

jcoludar · 2026-03-20T15:39:10Z

Summary

New data preparation pipelines: GO semantic similarity (Wang method), EC hierarchy distances, PDB experimental TM-scores, BRENDA/HFSP validation, ECOD homology pair distributions, organism landscape bias analysis
New evaluation modules: retrieval metrics (recall@1FP, AUROC), SCOP/ECOD classification evaluation, overtraining analysis
Random-initialized pLM baseline via --random_init flag
Shell scripts for reference data download and full pipeline orchestration
Vectorized distance computation (~2-3x speedup), bootstrap control flow fix in metrics.py

What changed in existing files

All modifications are marked with # --- Ivan infrastructure (2026-03-19) --- comments for easy identification.

distance_computation.py — vectorized inner loop, no behavior change
run_experiments.py — removed choices whitelists to accept new parameters, documented magic numbers
train.py — fixed help text typo (early_stopping_patience default)
embedding_generation.py — added --random_init flag (existing behavior unchanged)

Tests

21 tests pass (EC hierarchy, retrieval metrics, bootstrap control flow). Pre-existing test failures (uniref fixture, missing unified_embedder script) are unchanged.

Docs

Design spec: docs/specs/2026-03-19-ivan-infrastructure-design.md
Updated docs/todo.md with all new scripts and applied fixes

Three new data preparation scripts for the revision sprint: - GO semantic similarity (Wang method) with BMA protein-pair scoring - Randomly initialized pLM baseline (--random_init flag on embedding_generation.py) - PDB experimental TM-score pipeline (SIFTS mapping + TMalign) - Parquet merge utility to feed new columns into training splits Code improvements to existing files: - Vectorized distance_computation.py batch processing (~2-3x speedup) - Fixed drop_nulls().len() -> null_count() (2 instances) - Removed hardcoded choices whitelist from run_experiments.py and train.py - Documented magic hyperparameters in run_experiments.py - Added goatools>=1.3 dependency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…A, AUROC) Four new scripts completing the non-computational infrastructure: - EC-number hierarchy distance between protein pairs (4-level ordinal metric) - BRENDA/HFSP validation against curated enzyme functional classes - Retrieval metrics: recall-at-first-false-positive and AUROC - Classification evaluation at SCOP/ECOD hierarchy levels All 16 unit tests passing (8 for EC distance, 8 for retrieval metrics). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ation - scripts/run_ivan_pipeline.sh: 7-step pipeline runner with --step/--force/--dry-run Chains all new scripts in correct order with idempotent skip-if-exists checks - scripts/download_reference_data.sh: fetches GO ontology, SIFTS, SCOP, ECOD, EC annotations, and CAFA instructions into data/reference/ - src/visualization/create_retrieval_plots.py: publication figures for AUROC bars, recall-at-first-FP bars, heatmap, and summary scatter Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ECOD density distributions, per-residue probes, binding-site alignment recovery, overtraining investigation, organism landscapes, MGnify fetcher, Sup Fig 4 delta plot, graphical abstract. Includes dependency map and blockers list for non-codeable items. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Fix shebang ordering in ec_hierarchy_distance.py, ecod_homology_pairs.py - Remove unused goatools dependency (self-contained OBO parser used instead) - Fix classification_eval.py import to relative (works outside pytest) - Document mmCIF-to-PDB limitation in pdb_tmscore.py - Add paginated EC download to download_reference_data.sh (was first page only) - Fix _bootstrap_stat() parallel/sequential control flow in metrics.py - Add bootstrap metrics regression tests (5 tests) - Document optional tbparse dependency in overtraining_analysis.py - Fix train.py early_stopping_patience help text (said 5, actual default 3) - Fix PEP 585 type annotations for codebase consistency - Fix test_plot.py to skip gracefully when unknown_unknowns not installed - Add new scripts: ecod_homology_pairs, organism_landscape, overtraining_analysis - Update docs/todo.md with new scripts and applied fixes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jcoludar and others added 5 commits March 19, 2026 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Infrastructure and updates related to addressing reviewers' concerns#1

Infrastructure and updates related to addressing reviewers' concerns#1
jcoludar wants to merge 5 commits into
tsenoner:mainfrom
jcoludar:feat/ivan-infrastructure

jcoludar commented Mar 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jcoludar commented Mar 20, 2026

Summary

What changed in existing files

Tests

Docs

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant