Skip to content

Infrastructure and updates related to addressing reviewers' concerns#1

Open
jcoludar wants to merge 5 commits into
tsenoner:mainfrom
jcoludar:feat/ivan-infrastructure
Open

Infrastructure and updates related to addressing reviewers' concerns#1
jcoludar wants to merge 5 commits into
tsenoner:mainfrom
jcoludar:feat/ivan-infrastructure

Conversation

@jcoludar

Copy link
Copy Markdown
Collaborator

Summary

  • New data preparation pipelines: GO semantic similarity (Wang method), EC hierarchy distances, PDB experimental TM-scores, BRENDA/HFSP validation, ECOD homology pair distributions, organism landscape bias analysis
  • New evaluation modules: retrieval metrics (recall@1FP, AUROC), SCOP/ECOD classification evaluation, overtraining analysis
  • Random-initialized pLM baseline via --random_init flag
  • Shell scripts for reference data download and full pipeline orchestration
  • Vectorized distance computation (~2-3x speedup), bootstrap control flow fix in metrics.py

What changed in existing files

All modifications are marked with # --- Ivan infrastructure (2026-03-19) --- comments for easy identification.

  • distance_computation.py — vectorized inner loop, no behavior change
  • run_experiments.py — removed choices whitelists to accept new parameters, documented magic numbers
  • train.py — fixed help text typo (early_stopping_patience default)
  • embedding_generation.py — added --random_init flag (existing behavior unchanged)

Tests

21 tests pass (EC hierarchy, retrieval metrics, bootstrap control flow). Pre-existing test failures (uniref fixture, missing unified_embedder script) are unchanged.

Docs

  • Design spec: docs/specs/2026-03-19-ivan-infrastructure-design.md
  • Updated docs/todo.md with all new scripts and applied fixes

jcoludar and others added 5 commits March 19, 2026 18:41
Three new data preparation scripts for the revision sprint:
- GO semantic similarity (Wang method) with BMA protein-pair scoring
- Randomly initialized pLM baseline (--random_init flag on embedding_generation.py)
- PDB experimental TM-score pipeline (SIFTS mapping + TMalign)
- Parquet merge utility to feed new columns into training splits

Code improvements to existing files:
- Vectorized distance_computation.py batch processing (~2-3x speedup)
- Fixed drop_nulls().len() -> null_count() (2 instances)
- Removed hardcoded choices whitelist from run_experiments.py and train.py
- Documented magic hyperparameters in run_experiments.py
- Added goatools>=1.3 dependency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…A, AUROC)

Four new scripts completing the non-computational infrastructure:
- EC-number hierarchy distance between protein pairs (4-level ordinal metric)
- BRENDA/HFSP validation against curated enzyme functional classes
- Retrieval metrics: recall-at-first-false-positive and AUROC
- Classification evaluation at SCOP/ECOD hierarchy levels

All 16 unit tests passing (8 for EC distance, 8 for retrieval metrics).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation

- scripts/run_ivan_pipeline.sh: 7-step pipeline runner with --step/--force/--dry-run
  Chains all new scripts in correct order with idempotent skip-if-exists checks
- scripts/download_reference_data.sh: fetches GO ontology, SIFTS, SCOP, ECOD,
  EC annotations, and CAFA instructions into data/reference/
- src/visualization/create_retrieval_plots.py: publication figures for
  AUROC bars, recall-at-first-FP bars, heatmap, and summary scatter

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
ECOD density distributions, per-residue probes, binding-site alignment
recovery, overtraining investigation, organism landscapes, MGnify fetcher,
Sup Fig 4 delta plot, graphical abstract. Includes dependency map and
blockers list for non-codeable items.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Fix shebang ordering in ec_hierarchy_distance.py, ecod_homology_pairs.py
- Remove unused goatools dependency (self-contained OBO parser used instead)
- Fix classification_eval.py import to relative (works outside pytest)
- Document mmCIF-to-PDB limitation in pdb_tmscore.py
- Add paginated EC download to download_reference_data.sh (was first page only)
- Fix _bootstrap_stat() parallel/sequential control flow in metrics.py
- Add bootstrap metrics regression tests (5 tests)
- Document optional tbparse dependency in overtraining_analysis.py
- Fix train.py early_stopping_patience help text (said 5, actual default 3)
- Fix PEP 585 type annotations for codebase consistency
- Fix test_plot.py to skip gracefully when unknown_unknowns not installed
- Add new scripts: ecod_homology_pairs, organism_landscape, overtraining_analysis
- Update docs/todo.md with new scripts and applied fixes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant