Demote proximity subsampling to opt-in and fix OOM errors#1210
Conversation
I remember WA DOH was keen on this functionality back in the day. If they still depend on it, that's a reason to keep it. A kinder change for users would be to replace the original proximity subsampling with the new All of this said as someone who is not signing up to do this work, though! 😅 Maybe Claude is happy to do this? |
|
Thanks @huddlej. There was another issue that I should have mentioned: my suspicion is that as implemented here (or as implemented in However, I can plan to demonstrate this rather than just assert it. |
Yes, I don't think this'll work unless you have a ton of memory, but it might work if we did some clade based subsampling before the proximity calculations:
|
|
That's fair! I wasn't thinking about the full GISAID/open database as the contextual set, but something smaller like the example in our tutorial. This functionality has been out there long enough, it's possible that people are using it this way. (I just pinged WA DOH folks to get a sense.) |
|
Via Claude To test, I ran the pre-removal workflow against the full open pool on a laptop (Nextstrain Docker runtime, 7.7 GiB RAM, single core per job), Washington focal, What proximity adds. A dry-run to the same subsampling endpoint shows a metadata-only build runs 3 jobs ( Time added by proximity (focal = 296 recent WA,
Default Calibration: |
|
^ So I was wrong that it wouldn't run. It's slow and will crash out if you're not careful about keeping focal size small, but it runs. Thanks for flagging the tutorial John, however I disagree with it as written: https://docs.nextstrain.org/projects/ncov/en/latest/tutorial/genomic-surveillance.html#break-down-the-command The only reason someone should be doing priorities is pseudo-contact tracing. You have an outbreak and you want to fish out all the sequences in the database that are most closely related to this outbreak. The set up in the tutorial is neither here nor there and does proximity sampling on a small subset of a few thousand North America sequences. The tutorial should change no matter what. If we want to keep priorities as an option then it's a reimplementation with |
|
Oh... I understand your comment better now James. I see that we'd need ~30Gb of memory per 1M SARS-CoV-2 genomes. So |
|
I vote to drop it entirely from ncov - if there's demand from users then we can explore adding a |
PR #1210 removed proximity subsampling entirely. Demonstrating it showed the original chunked implementation still runs over the full open pool and serves a real niche -- pseudo-contact-tracing, where a small outbreak/clade focal set is used to fish its closest relatives out of the database. Its only failure mode is OOM when the focal set is large. The proposed `augur proximity` replacement loads the entire context pool into RAM (a memory regression), so we keep the original logic instead. Restore the proximity machinery (scripts/get_distance_to_focal_set.py, scripts/priorities.py, the get_priorities/get_priority_argument helpers, the subsample rule's --priority wiring, the proximity_score/priority_score rules, and the priorities: crowding_penalty default) and re-document it in the config reference, but leave it out of the default region/country/division/location schemes, the tutorial, and CI. Fix the OOM by computing proximity_score's --chunk-size from the focal-set size at run time, so the dense chunk x n_focal distance matrices stay ~2.8 GB regardless of focal size (measured peak ~4 GB; mem_mb raised to 6000). priority_score gets a pool-scaled mem_mb. Verified on the open 100k pool: default schemes pull no proximity rules; an opt-in proximity build runs end to end; a small focal keeps chunk_size 10000 (480 MB) while a 10k-focal build auto-drops to chunk_size 5832 and peaks at 4.1 GB with no OOM. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Genetic-proximity prioritization (`priorities: type: proximity`, and the sibling `type: file`) is removed from the default region/country/division/location subsampling schemes, the genomic-surveillance tutorial, the config-file guide, and CI; these now subsample purely by group_by + seq_per_group/max_sequences. The orphan scripts/add_priorities_to_meta.py is dropped. The proximity machinery (scripts/get_distance_to_focal_set.py, scripts/priorities.py, the proximity_score/priority_score rules and their config wiring) remains available as a documented opt-in for the niche pseudo-contact-tracing use case: a small outbreak or clade focal set, used to fish its closest relatives out of the database. proximity_score now computes its --chunk-size from the focal-set size at run time so the dense chunk x n_focal distance matrices stay ~2.8 GB regardless of focal size (measured peak ~4 GB; mem_mb 6000), avoiding the OOM a large focal set previously caused. priority_score gets a pool-scaled mem_mb. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
Via Claude Based on the discussion above, I've revised the scope of this PR (now a single commit rebased on current master). It is no longer "remove proximity entirely" — it demotes proximity to an opt-in option and fixes the OOM. Removed (proximity is no longer used by default):
Kept (available as a documented opt-in):
Memory fix (the OOM seen in the benchmark above): Verified on the open 100k pool: default schemes pull no proximity rules; an opt-in proximity build runs end-to-end; a small focal keeps |
Third in the pandemic-era cruft cleanup series (after #1208, #1209). This is the headline behavior change.
Motivation
Production open/GISAID builds moved to population-weighted subsampling and no longer use proximity-based priorities. Only the CI smoke-test still exercised proximity.
The idea behind proximity was if you have a focal region of interest you can include genetically similar sequences from outside this focal region. This was maybe important in 2020 when introductions, etc... were of interest, but this is now longer at all the case. We want to know about evolution and which lineages are where.
I can't think of a good reason to want proximity based subsampling in this day and age. So I don't think it answers a question and as implemented it's heavily taxing computationally.
Also this has been shifted out to
augur proximityin nextstrain/augur#1962. I don't think we'll want it back for SARS-CoV-2, but if we do there's a better affordance now.What's removed
proximity_scoreandpriority_score(main_workflow.smk), theget_priorities/get_priority_argumenthelpers, and the--prioritywiring in thesubsamplerule.get_distance_to_focal_set.py,priorities.py, and the orphanadd_priorities_to_meta.py.priorities: {crowding_penalty}config block indefaults/parameters.yaml.priorities: {type: proximity}blocks from the defaultregion,region_grouped_by_country,country,division, andlocationschemes. These schemes still work — they subsample viagroup_by+seq_per_group/max_sequences, just without genetic-proximity prioritization of contextual samples.nextstrain-cischeme, so CI no longer exercises proximity.priorities: type: fileoption (user-supplied priority TSV), since it shared the same--priorityplumbing.Tutorial
docs/src/tutorial/genomic-surveillance.rstis left mostly as-is. It dropspriorities:YAML blocks from its example. And it updates the prose to reference focal and background separate from genetically proximalDocs updated
Removed the
priorities/crowding_penaltysections from the config reference and the proximity examples + explanatory paragraph from the config guide.Verification
snakemake --profile nextstrain_profiles/nextstrain-ci -ndrops cleanly from 37 → 33 jobs (exactly the proximity + priority steps), no errors.Test plan
Version bump
Dropping this affordance will require bumping repo version to v18.
🤖 Generated with Claude Code
This full removal has been revised to a demotion. The option still exists, just not promoted in tutorials, etc... the way it was previously. See below for more details.