Demote proximity subsampling to opt-in and fix OOM errors by trvrb · Pull Request #1210 · nextstrain/ncov

trvrb · 2026-06-27T03:42:03Z

Third in the pandemic-era cruft cleanup series (after #1208, #1209). This is the headline behavior change.

Motivation

Production open/GISAID builds moved to population-weighted subsampling and no longer use proximity-based priorities. Only the CI smoke-test still exercised proximity.

The idea behind proximity was if you have a focal region of interest you can include genetically similar sequences from outside this focal region. This was maybe important in 2020 when introductions, etc... were of interest, but this is now longer at all the case. We want to know about evolution and which lineages are where.

I can't think of a good reason to want proximity based subsampling in this day and age. So I don't think it answers a question and as implemented it's heavily taxing computationally.

Also this has been shifted out to augur proximity in nextstrain/augur#1962. I don't think we'll want it back for SARS-CoV-2, but if we do there's a better affordance now.

What's removed

Rules proximity_score and priority_score (main_workflow.smk), the get_priorities / get_priority_argument helpers, and the --priority wiring in the subsample rule.
Scripts get_distance_to_focal_set.py, priorities.py, and the orphan add_priorities_to_meta.py.
The priorities: {crowding_penalty} config block in defaults/parameters.yaml.
The priorities: {type: proximity} blocks from the default region, region_grouped_by_country, country, division, and location schemes. These schemes still work — they subsample via group_by + seq_per_group/max_sequences, just without genetic-proximity prioritization of contextual samples.
The same block from the nextstrain-ci scheme, so CI no longer exercises proximity.
This also removes the sibling priorities: type: file option (user-supplied priority TSV), since it shared the same --priority plumbing.

Tutorial

docs/src/tutorial/genomic-surveillance.rst is left mostly as-is. It drops priorities: YAML blocks from its example. And it updates the prose to reference focal and background separate from genetically proximal

Docs updated

Removed the priorities / crowding_penalty sections from the config reference and the proximity examples + explanatory paragraph from the config guide.

Verification

No remaining real references to proximity/priorities/crowding across code, profiles, and docs (only generic-English "take priority" in an unrelated script and historical changelog entries remain).
snakemake --profile nextstrain_profiles/nextstrain-ci -n drops cleanly from 37 → 33 jobs (exactly the proximity + priority steps), no errors.

Test plan

CI green.

Version bump

Dropping this affordance will require bumping repo version to v18.

🤖 Generated with Claude Code

This full removal has been revised to a demotion. The option still exists, just not promoted in tutorials, etc... the way it was previously. See below for more details.

huddlej · 2026-06-29T19:26:01Z

I can't think of a good reason to want proximity based subsampling in this day and age. So I don't think it answers a question and as implemented it's heavily taxing computationally.

Also this has been shifted out to augur proximity in nextstrain/augur#1962. I don't think we'll want it back for SARS-CoV-2, but if we do there's a better affordance now.

I remember WA DOH was keen on this functionality back in the day. If they still depend on it, that's a reason to keep it.

A kinder change for users would be to replace the original proximity subsampling with the new augur proximity implementation. This change would still delete most of the older logic, but it would (hopefully) keep most of the functionality. Since this repo was the origin of the idea behind that augur subcommand, the act of replacing the old logic with the new should confirm that the new interface does what we really want.

All of this said as someone who is not signing up to do this work, though! 😅 Maybe Claude is happy to do this?

trvrb · 2026-06-29T20:04:44Z

Thanks @huddlej. There was another issue that I should have mentioned: my suspicion is that

  priorities:
        type: "proximity"

as implemented here (or as implemented in augur proximity) will fail to run for ncov just because the database to compare to is too big. I don't think the command will actually run. Ie it's not going to be helpful for anyone to ship an unrunnable feature with augur proximity swap.

However, I can plan to demonstrate this rather than just assert it.

jameshadfield · 2026-06-29T21:59:14Z

as implemented in augur proximity) will fail to run for ncov just because the database to compare to is too big.

Yes, I don't think this'll work unless you have a ton of memory, but it might work if we did some clade based subsampling before the proximity calculations:

Due to the many millions of sequences available, this tool will not be able to search against all sequences (as they won’t all fit into memory). We suggest you downsample those using temporal filters, nextstrain clades or pango lineages to get a smaller contextual set before using this tool.

huddlej · 2026-06-29T22:25:16Z

That's fair! I wasn't thinking about the full GISAID/open database as the contextual set, but something smaller like the example in our tutorial. This functionality has been out there long enough, it's possible that people are using it this way. (I just pinged WA DOH folks to get a sense.)

trvrb · 2026-06-30T00:32:53Z

Via Claude

To test, I ran the pre-removal workflow against the full open pool on a laptop (Nextstrain Docker runtime, 7.7 GiB RAM, single core per job), Washington focal, division-style build. Pool = aligned.fasta.zst = 9,428,719 sequences. proximity_score streams the pool in 10k-sequence chunks, building a dense chunk × focal distance matrix per chunk (time ∝ pool × focal, memory ∝ chunk × focal). Numbers are from Snakemake benchmark: files.

What proximity adds. A dry-run to the same subsampling endpoint shows a metadata-only build runs 3 jobs (subsample ×2, combine_samples); the proximity build runs 7 — adding index_sequences, extract_subsampled_sequences, proximity_score, priority_score. The full-pool sequence index is not built by metadata-only subsampling.

Time added by proximity (focal = 296 recent WA, --min-date 2025-01-01):

added step	wall	peak RSS
`index_sequences` (full pool)	1 h 17 m 56 s	2.6 GB
`extract_subsampled_sequences`	3 m 01 s	4.2 GB
`proximity_score`	17 m 07 s	470 MB
`priority_score`	2 m 03 s	5.2 GB
total added	1 h 40 m 10 s	(peak 5.2 GB)

Default division scheme focal (all WA = 16,207 seqs): proximity_score is OOM-killed — non-zero exit, empty log (SIGKILL). Per-chunk dense arrays (10000 × 16207 × 8 B ≈ 1.3 GB each, ~6 at once ≈ 8 GB) exceed the 7.7 GiB ceiling. The same script completes at focal 296/309, so the failure is memory driven by focal size.

Calibration: proximity_score on the open 100k subsample = 309 focal × 90,857 pool = 11.97 s / 474 MB (scales ~linearly to the full-pool figures).

trvrb · 2026-06-30T00:41:33Z

^ So I was wrong that it wouldn't run. It's slow and will crash out if you're not careful about keeping focal size small, but it runs.

Thanks for flagging the tutorial John, however I disagree with it as written: https://docs.nextstrain.org/projects/ncov/en/latest/tutorial/genomic-surveillance.html#break-down-the-command

The only reason someone should be doing priorities is pseudo-contact tracing. You have an outbreak and you want to fish out all the sequences in the database that are most closely related to this outbreak. The set up in the tutorial is neither here nor there and does proximity sampling on a small subset of a few thousand North America sequences. The tutorial should change no matter what.

If we want to keep priorities as an option then it's a reimplementation with augur proximity and demoting proximity in tutorials, etc... The logic to keep is if someone really does want to do the pseudo-contact tracing of an outbreak (or a clade).

trvrb · 2026-06-30T01:54:00Z

Oh... I understand your comment better now James. I see that we'd need ~30Gb of memory per 1M SARS-CoV-2 genomes. So augur proximity would not work for the above use case. I'm leaning now towards not messing with things, but just at least not pushing proximity in tutorials etc... (again, today's use case is very different than the 2020/2021 use case).

jameshadfield · 2026-06-30T02:37:32Z

I vote to drop it entirely from ncov - if there's demand from users then we can explore adding a augur proximity tutorial using some sensible filtering to get the contextual set down to a manageable size. We added a lot of functionality to this repo as part of outbreak response and I think it's too constraining to have to support all of that functionality forevermore.

PR #1210 removed proximity subsampling entirely. Demonstrating it showed the original chunked implementation still runs over the full open pool and serves a real niche -- pseudo-contact-tracing, where a small outbreak/clade focal set is used to fish its closest relatives out of the database. Its only failure mode is OOM when the focal set is large. The proposed `augur proximity` replacement loads the entire context pool into RAM (a memory regression), so we keep the original logic instead. Restore the proximity machinery (scripts/get_distance_to_focal_set.py, scripts/priorities.py, the get_priorities/get_priority_argument helpers, the subsample rule's --priority wiring, the proximity_score/priority_score rules, and the priorities: crowding_penalty default) and re-document it in the config reference, but leave it out of the default region/country/division/location schemes, the tutorial, and CI. Fix the OOM by computing proximity_score's --chunk-size from the focal-set size at run time, so the dense chunk x n_focal distance matrices stay ~2.8 GB regardless of focal size (measured peak ~4 GB; mem_mb raised to 6000). priority_score gets a pool-scaled mem_mb. Verified on the open 100k pool: default schemes pull no proximity rules; an opt-in proximity build runs end to end; a small focal keeps chunk_size 10000 (480 MB) while a 10k-focal build auto-drops to chunk_size 5832 and peaks at 4.1 GB with no OOM. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

Genetic-proximity prioritization (`priorities: type: proximity`, and the sibling `type: file`) is removed from the default region/country/division/location subsampling schemes, the genomic-surveillance tutorial, the config-file guide, and CI; these now subsample purely by group_by + seq_per_group/max_sequences. The orphan scripts/add_priorities_to_meta.py is dropped. The proximity machinery (scripts/get_distance_to_focal_set.py, scripts/priorities.py, the proximity_score/priority_score rules and their config wiring) remains available as a documented opt-in for the niche pseudo-contact-tracing use case: a small outbreak or clade focal set, used to fish its closest relatives out of the database. proximity_score now computes its --chunk-size from the focal-set size at run time so the dense chunk x n_focal distance matrices stay ~2.8 GB regardless of focal size (measured peak ~4 GB; mem_mb 6000), avoiding the OOM a large focal set previously caused. priority_score gets a pool-scaled mem_mb. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

trvrb · 2026-06-30T03:00:27Z

Via Claude

Based on the discussion above, I've revised the scope of this PR (now a single commit rebased on current master). It is no longer "remove proximity entirely" — it demotes proximity to an opt-in option and fixes the OOM.

Removed (proximity is no longer used by default):

The priorities: {type: proximity} blocks from the default region, country, division, and location schemes — these now subsample purely by group_by + seq_per_group/max_sequences.
Proximity from the genomic-surveillance tutorial, the config-file guide, and the nextstrain-ci smoke-test.
The orphan scripts/add_priorities_to_meta.py.

Kept (available as a documented opt-in):

scripts/get_distance_to_focal_set.py, scripts/priorities.py, the proximity_score/priority_score rules, the --priority subsampling wiring, and the priorities: {crowding_penalty} default.
The priorities config-reference docs (both type: proximity and type: file).
This serves the niche it's actually good for — pseudo-contact-tracing: a small outbreak/clade focal set, used to fish its closest relatives out of the database.

Memory fix (the OOM seen in the benchmark above): proximity_score now computes --chunk-size from the focal-set size at run time, so the dense chunk × n_focal distance matrices stay ~2.8 GB regardless of focal size (measured peak ~4 GB; mem_mb 6000). The large default-scheme focal (16,207 seqs → ~8 GB) was what got OOM-killed. priority_score gets a pool-scaled mem_mb.

Verified on the open 100k pool: default schemes pull no proximity rules; an opt-in proximity build runs end-to-end; a small focal keeps chunk_size 10000 (480 MB), while a 10k-focal build auto-drops to chunk_size 5832 and peaks at 4.1 GB with no OOM.

This was referenced Jun 27, 2026

Remove the historical 21L (BA.2-rooted) GISAID build variant #1211

Merged

Document the rebuild-100k development subsample #1212

Merged

trvrb force-pushed the cleanup-proximity branch from a810c3c to c7b1cb5 Compare June 30, 2026 02:53

trvrb changed the title ~~Remove proximity subsampling~~ Demote proximity subsampling to opt-in and fix OOM errors Jun 30, 2026

trvrb force-pushed the cleanup-proximity branch from c7b1cb5 to 4d8de01 Compare June 30, 2026 02:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Demote proximity subsampling to opt-in and fix OOM errors#1210

Demote proximity subsampling to opt-in and fix OOM errors#1210
trvrb wants to merge 1 commit into
masterfrom
cleanup-proximity

trvrb commented Jun 27, 2026 •

edited

Loading

Uh oh!

huddlej commented Jun 29, 2026

Uh oh!

trvrb commented Jun 29, 2026 •

edited

Loading

Uh oh!

jameshadfield commented Jun 29, 2026

Uh oh!

huddlej commented Jun 29, 2026

Uh oh!

trvrb commented Jun 30, 2026 •

edited

Loading

Uh oh!

trvrb commented Jun 30, 2026

Uh oh!

trvrb commented Jun 30, 2026

Uh oh!

jameshadfield commented Jun 30, 2026

Uh oh!

trvrb commented Jun 30, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

trvrb commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

What's removed

Tutorial

Docs updated

Verification

Test plan

Version bump

Uh oh!

huddlej commented Jun 29, 2026

Uh oh!

trvrb commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jameshadfield commented Jun 29, 2026

Uh oh!

huddlej commented Jun 29, 2026

Uh oh!

trvrb commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

trvrb commented Jun 30, 2026

Uh oh!

trvrb commented Jun 30, 2026

Uh oh!

jameshadfield commented Jun 30, 2026

Uh oh!

trvrb commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

trvrb commented Jun 27, 2026 •

edited

Loading

trvrb commented Jun 29, 2026 •

edited

Loading

trvrb commented Jun 30, 2026 •

edited

Loading

trvrb commented Jun 30, 2026 •

edited

Loading