Skip to content

Demote proximity subsampling to opt-in and fix OOM errors#1210

Open
trvrb wants to merge 1 commit into
masterfrom
cleanup-proximity
Open

Demote proximity subsampling to opt-in and fix OOM errors#1210
trvrb wants to merge 1 commit into
masterfrom
cleanup-proximity

Conversation

@trvrb

@trvrb trvrb commented Jun 27, 2026

Copy link
Copy Markdown
Member

Third in the pandemic-era cruft cleanup series (after #1208, #1209). This is the headline behavior change.

Motivation

Production open/GISAID builds moved to population-weighted subsampling and no longer use proximity-based priorities. Only the CI smoke-test still exercised proximity.

The idea behind proximity was if you have a focal region of interest you can include genetically similar sequences from outside this focal region. This was maybe important in 2020 when introductions, etc... were of interest, but this is now longer at all the case. We want to know about evolution and which lineages are where.

I can't think of a good reason to want proximity based subsampling in this day and age. So I don't think it answers a question and as implemented it's heavily taxing computationally.

Also this has been shifted out to augur proximity in nextstrain/augur#1962. I don't think we'll want it back for SARS-CoV-2, but if we do there's a better affordance now.

What's removed

  • Rules proximity_score and priority_score (main_workflow.smk), the get_priorities / get_priority_argument helpers, and the --priority wiring in the subsample rule.
  • Scripts get_distance_to_focal_set.py, priorities.py, and the orphan add_priorities_to_meta.py.
  • The priorities: {crowding_penalty} config block in defaults/parameters.yaml.
  • The priorities: {type: proximity} blocks from the default region, region_grouped_by_country, country, division, and location schemes. These schemes still work — they subsample via group_by + seq_per_group/max_sequences, just without genetic-proximity prioritization of contextual samples.
  • The same block from the nextstrain-ci scheme, so CI no longer exercises proximity.
  • This also removes the sibling priorities: type: file option (user-supplied priority TSV), since it shared the same --priority plumbing.

Tutorial

docs/src/tutorial/genomic-surveillance.rst is left mostly as-is. It drops priorities: YAML blocks from its example. And it updates the prose to reference focal and background separate from genetically proximal

Docs updated

Removed the priorities / crowding_penalty sections from the config reference and the proximity examples + explanatory paragraph from the config guide.

Verification

  • No remaining real references to proximity/priorities/crowding across code, profiles, and docs (only generic-English "take priority" in an unrelated script and historical changelog entries remain).
  • snakemake --profile nextstrain_profiles/nextstrain-ci -n drops cleanly from 37 → 33 jobs (exactly the proximity + priority steps), no errors.

Test plan

  • CI green.

Version bump

Dropping this affordance will require bumping repo version to v18.

🤖 Generated with Claude Code


This full removal has been revised to a demotion. The option still exists, just not promoted in tutorials, etc... the way it was previously. See below for more details.

@huddlej

huddlej commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

I can't think of a good reason to want proximity based subsampling in this day and age. So I don't think it answers a question and as implemented it's heavily taxing computationally.

Also this has been shifted out to augur proximity in nextstrain/augur#1962. I don't think we'll want it back for SARS-CoV-2, but if we do there's a better affordance now.

I remember WA DOH was keen on this functionality back in the day. If they still depend on it, that's a reason to keep it.

A kinder change for users would be to replace the original proximity subsampling with the new augur proximity implementation. This change would still delete most of the older logic, but it would (hopefully) keep most of the functionality. Since this repo was the origin of the idea behind that augur subcommand, the act of replacing the old logic with the new should confirm that the new interface does what we really want.

All of this said as someone who is not signing up to do this work, though! 😅 Maybe Claude is happy to do this?

@trvrb

trvrb commented Jun 29, 2026

Copy link
Copy Markdown
Member Author

Thanks @huddlej. There was another issue that I should have mentioned: my suspicion is that

  priorities:
        type: "proximity"

as implemented here (or as implemented in augur proximity) will fail to run for ncov just because the database to compare to is too big. I don't think the command will actually run. Ie it's not going to be helpful for anyone to ship an unrunnable feature with augur proximity swap.

However, I can plan to demonstrate this rather than just assert it.

@jameshadfield

Copy link
Copy Markdown
Member

as implemented in augur proximity) will fail to run for ncov just because the database to compare to is too big.

Yes, I don't think this'll work unless you have a ton of memory, but it might work if we did some clade based subsampling before the proximity calculations:

Due to the many millions of sequences available, this tool will not be able to search against all sequences (as they won’t all fit into memory). We suggest you downsample those using temporal filters, nextstrain clades or pango lineages to get a smaller contextual set before using this tool.

@huddlej

huddlej commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

That's fair! I wasn't thinking about the full GISAID/open database as the contextual set, but something smaller like the example in our tutorial. This functionality has been out there long enough, it's possible that people are using it this way. (I just pinged WA DOH folks to get a sense.)

@trvrb

trvrb commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

Via Claude

To test, I ran the pre-removal workflow against the full open pool on a laptop (Nextstrain Docker runtime, 7.7 GiB RAM, single core per job), Washington focal, division-style build. Pool = aligned.fasta.zst = 9,428,719 sequences. proximity_score streams the pool in 10k-sequence chunks, building a dense chunk × focal distance matrix per chunk (time ∝ pool × focal, memory ∝ chunk × focal). Numbers are from Snakemake benchmark: files.

What proximity adds. A dry-run to the same subsampling endpoint shows a metadata-only build runs 3 jobs (subsample ×2, combine_samples); the proximity build runs 7 — adding index_sequences, extract_subsampled_sequences, proximity_score, priority_score. The full-pool sequence index is not built by metadata-only subsampling.

Time added by proximity (focal = 296 recent WA, --min-date 2025-01-01):

added step wall peak RSS
index_sequences (full pool) 1 h 17 m 56 s 2.6 GB
extract_subsampled_sequences 3 m 01 s 4.2 GB
proximity_score 17 m 07 s 470 MB
priority_score 2 m 03 s 5.2 GB
total added 1 h 40 m 10 s (peak 5.2 GB)

Default division scheme focal (all WA = 16,207 seqs): proximity_score is OOM-killed — non-zero exit, empty log (SIGKILL). Per-chunk dense arrays (10000 × 16207 × 8 B ≈ 1.3 GB each, ~6 at once ≈ 8 GB) exceed the 7.7 GiB ceiling. The same script completes at focal 296/309, so the failure is memory driven by focal size.

Calibration: proximity_score on the open 100k subsample = 309 focal × 90,857 pool = 11.97 s / 474 MB (scales ~linearly to the full-pool figures).

@trvrb

trvrb commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

^ So I was wrong that it wouldn't run. It's slow and will crash out if you're not careful about keeping focal size small, but it runs.

Thanks for flagging the tutorial John, however I disagree with it as written: https://docs.nextstrain.org/projects/ncov/en/latest/tutorial/genomic-surveillance.html#break-down-the-command

The only reason someone should be doing priorities is pseudo-contact tracing. You have an outbreak and you want to fish out all the sequences in the database that are most closely related to this outbreak. The set up in the tutorial is neither here nor there and does proximity sampling on a small subset of a few thousand North America sequences. The tutorial should change no matter what.

If we want to keep priorities as an option then it's a reimplementation with augur proximity and demoting proximity in tutorials, etc... The logic to keep is if someone really does want to do the pseudo-contact tracing of an outbreak (or a clade).

@trvrb

trvrb commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

Oh... I understand your comment better now James. I see that we'd need ~30Gb of memory per 1M SARS-CoV-2 genomes. So augur proximity would not work for the above use case. I'm leaning now towards not messing with things, but just at least not pushing proximity in tutorials etc... (again, today's use case is very different than the 2020/2021 use case).

@jameshadfield

Copy link
Copy Markdown
Member

I vote to drop it entirely from ncov - if there's demand from users then we can explore adding a augur proximity tutorial using some sensible filtering to get the contextual set down to a manageable size. We added a lot of functionality to this repo as part of outbreak response and I think it's too constraining to have to support all of that functionality forevermore.

trvrb added a commit that referenced this pull request Jun 30, 2026
PR #1210 removed proximity subsampling entirely. Demonstrating it showed the
original chunked implementation still runs over the full open pool and serves a
real niche -- pseudo-contact-tracing, where a small outbreak/clade focal set is
used to fish its closest relatives out of the database. Its only failure mode is
OOM when the focal set is large. The proposed `augur proximity` replacement
loads the entire context pool into RAM (a memory regression), so we keep the
original logic instead.

Restore the proximity machinery (scripts/get_distance_to_focal_set.py,
scripts/priorities.py, the get_priorities/get_priority_argument helpers, the
subsample rule's --priority wiring, the proximity_score/priority_score rules,
and the priorities: crowding_penalty default) and re-document it in the config
reference, but leave it out of the default region/country/division/location
schemes, the tutorial, and CI.

Fix the OOM by computing proximity_score's --chunk-size from the focal-set size
at run time, so the dense chunk x n_focal distance matrices stay ~2.8 GB
regardless of focal size (measured peak ~4 GB; mem_mb raised to 6000).
priority_score gets a pool-scaled mem_mb.

Verified on the open 100k pool: default schemes pull no proximity rules; an
opt-in proximity build runs end to end; a small focal keeps chunk_size 10000
(480 MB) while a 10k-focal build auto-drops to chunk_size 5832 and peaks at
4.1 GB with no OOM.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@trvrb trvrb force-pushed the cleanup-proximity branch from a810c3c to c7b1cb5 Compare June 30, 2026 02:53
@trvrb trvrb changed the title Remove proximity subsampling Demote proximity subsampling to opt-in and fix OOM errors Jun 30, 2026
Genetic-proximity prioritization (`priorities: type: proximity`, and the sibling
`type: file`) is removed from the default region/country/division/location
subsampling schemes, the genomic-surveillance tutorial, the config-file guide,
and CI; these now subsample purely by group_by + seq_per_group/max_sequences.
The orphan scripts/add_priorities_to_meta.py is dropped.

The proximity machinery (scripts/get_distance_to_focal_set.py,
scripts/priorities.py, the proximity_score/priority_score rules and their config
wiring) remains available as a documented opt-in for the niche
pseudo-contact-tracing use case: a small outbreak or clade focal set, used to
fish its closest relatives out of the database.

proximity_score now computes its --chunk-size from the focal-set size at run time
so the dense chunk x n_focal distance matrices stay ~2.8 GB regardless of focal
size (measured peak ~4 GB; mem_mb 6000), avoiding the OOM a large focal set
previously caused. priority_score gets a pool-scaled mem_mb.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@trvrb trvrb force-pushed the cleanup-proximity branch from c7b1cb5 to 4d8de01 Compare June 30, 2026 02:58
@trvrb

trvrb commented Jun 30, 2026

Copy link
Copy Markdown
Member Author

Via Claude

Based on the discussion above, I've revised the scope of this PR (now a single commit rebased on current master). It is no longer "remove proximity entirely" — it demotes proximity to an opt-in option and fixes the OOM.

Removed (proximity is no longer used by default):

  • The priorities: {type: proximity} blocks from the default region, country, division, and location schemes — these now subsample purely by group_by + seq_per_group/max_sequences.
  • Proximity from the genomic-surveillance tutorial, the config-file guide, and the nextstrain-ci smoke-test.
  • The orphan scripts/add_priorities_to_meta.py.

Kept (available as a documented opt-in):

  • scripts/get_distance_to_focal_set.py, scripts/priorities.py, the proximity_score/priority_score rules, the --priority subsampling wiring, and the priorities: {crowding_penalty} default.
  • The priorities config-reference docs (both type: proximity and type: file).
  • This serves the niche it's actually good for — pseudo-contact-tracing: a small outbreak/clade focal set, used to fish its closest relatives out of the database.

Memory fix (the OOM seen in the benchmark above): proximity_score now computes --chunk-size from the focal-set size at run time, so the dense chunk × n_focal distance matrices stay ~2.8 GB regardless of focal size (measured peak ~4 GB; mem_mb 6000). The large default-scheme focal (16,207 seqs → ~8 GB) was what got OOM-killed. priority_score gets a pool-scaled mem_mb.

Verified on the open 100k pool: default schemes pull no proximity rules; an opt-in proximity build runs end-to-end; a small focal keeps chunk_size 10000 (480 MB), while a 10k-focal build auto-drops to chunk_size 5832 and peaks at 4.1 GB with no OOM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants