Skip to content

v3.0.1 release candidate#28

Open
nrminor wants to merge 13 commits into
mainfrom
v3.0.1-rc
Open

v3.0.1 release candidate#28
nrminor wants to merge 13 commits into
mainfrom
v3.0.1-rc

Conversation

@nrminor

@nrminor nrminor commented Jun 20, 2026

Copy link
Copy Markdown
Member

Time for a new release candidate, this one for a patch release v3.0.1!

End-to-end testing, at last

This PR carries a stack of commits that, most excitingly, includes a full end-to-end integration testing setup. This includes SRA-downloaded real data, local synthetic data, as well as a tiny custom BLAST database and deacon enrichment reference, plus an expanded suite of test fixtures and more validation. Everything is small enough that it can be checked into version control, sync'd with the remote origin, and thus comes with the source code it's tested with. Though it takes a very long time on github action runners (~1 hour), this end-to-end setup is also run as a check in CI.

The only part of this pipeline the test setup does not exercise is the LIMS integration subworkflow. This will require its own handling and is a gap that we plan to fill in future versions.

Project development shorthands

Additionally, we also introduce a project justfile for shorthands to tricky development commands. Developers and agents should use these shorthands so that development feedback loops are consistent between PRs. To run the integration test, simply run just e2e.

SRA download support (without the disk expense)

While previous NVD versions claimed nominal support for SRA download, this PR actually provides that functionality. Because this was promised previously but never exercised in tests and thus found to be incomplete, this is technically a bug fix and not a feature addition.

Rather than reaching for SRA-tools, this PR instead taps sracha. SRA-tools imfamously requires 5-10x the disk space for the actual downloaded FASTQ, and in the year-of-our-lord 2026, it still does not support compressed FASTQ writing. sracha addresses both of these needs for our work's enormous FASTQs--uncompressed FASTQs on disk are never acceptable. It is also faster and more actively maintained.

Samplesheet generator improvements

Most nextflow pipelines require a user-generated samplesheet CSV of input file paths with some metadata, and NVD is no exception. To help users generate this artifact from a directory of FASTQ files, NVD's command line interface includes nvd samplesheet generate. This command was previously unaware of Illumina/CASAVA file naming conventions, which meant it would leave suffixes like "S23_L002" behind in sample IDs. This, expectedly in retrospect, breaks ETLs in our LIMS! To address this, v3.0.1 includes an opt-in --sanitize flag to the samplesheet generator interface that knows CASAVA structure and will strip it if detected. This feature comes with tests to assert that this stripping works without a high rate of false positives.

nrminor added 8 commits June 18, 2026 22:08
Replace the implicit BLAST completion signal in CLASSIFY_WITH_BLASTN with explicit per-sample package channels. BLASTN-needed and BLASTN-skipped samples should both continue downstream as terminal sample packages, so later merging does not have to infer completeness from missing channel emissions.

Keep the external merged_results shape compatible for reporting and LIMS integration, while removing the unbounded mix/filter/groupTuple barrier that made per-sample BLAST results wait for the whole BLAST channel to close.
Add a maintainer-only fixture generation script for the small SRA-backed integration test dataset. The script fetches pinned viral RefSeq FASTA records, builds the tiny Deacon and BLAST nucleotide indexes consumed by the pipeline, writes the SRA samplesheet, and records checksums plus source metadata in a manifest.

Commit the first generated fixture set so the pipeline integration test can consume small checked-in references instead of teaching NVD to construct custom BLAST or Deacon indexes at runtime.
Add an explicit slow/network pytest that verifies the checked-in mini viral fixtures can drive a full NVD run from SRA accession download through viral enrichment, assembly, BLAST classification, and final reporting.

The test reads the fixture manifest, validates committed fixture checksums before launching Nextflow, runs the pipeline with isolated temporary results/work directories, and asserts that final BLAST outputs contain the expected viral signal.

Add a dedicated GitHub Actions workflow for the integration test so changes to workflow code or fixture data can prove the tiny end-to-end path still completes.
By using justfile recipes, more of the same commands will be used more consistently for development tasks. This applies particularly to complicated pixi- or uv-driven pytest commands, which tend to be run with inconsistent flags by agents performing subsets of tests while verifying feature work or refactors.
Synchronize the package version exposed from py_nvd.__init__ with the release-candidate metadata and update GitHub Actions Pixi setup so CI can read the current lock-file format.

This commit is scoped to CI/release metadata fixes caught by the v3.0.1-rc pull request checks.
@nrminor nrminor force-pushed the v3.0.1-rc branch 2 times, most recently from 1b578de to ce33eb4 Compare June 21, 2026 15:24
nrminor added 2 commits June 21, 2026 10:57
Teach `nvd samplesheet generate` an opt-in `--sanitize` mode that recognizes Illumina/CASAVA FASTQ naming suffixes and strips them before writing sample IDs. The default generation path should preserve current behavior so existing users do not see sample ID changes unless they ask for sanitization.

Add focused tests for paired Illumina filenames to lock down both the existing unsanitized behavior and the new sanitized behavior.
nrminor added 3 commits June 23, 2026 14:17
Change taxonomy preparation so an existing prepared taxonomy database is reused even when the source dump files are older than the freshness warning window. Missing required taxdump files still trigger the existing download/build path, and missing taxonomy.sqlite still triggers a rebuild from present dump files.

This avoids mutating shared HPC taxonomy directories merely because nodes.dmp is old, which prevents age-based refresh attempts from surfacing user/group permission mismatches during normal pipeline runs. The existing CLI, Nextflow params, and published v3.0 schema contract are left unchanged.

Keep offline and sync as the existing controls for no-download and require-prepared behavior, while warning when stale-but-complete taxonomy is reused.
Broaden viral BLAST lineage detection so annotated rows that identify Viruses outside the older superkingdom spelling are retained. Also make LCA annotation treat zero-byte and header-only merged BLAST tables as valid no-hit sentinels by writing a header-only output with the full LCA schema instead of crashing in Polars.
Set the pipeline-wide default Nextflow error strategy to finish so processes without a more specific error strategy no longer immediately terminate submitted or running work when one task fails.

This keeps the existing per-process retry/ignore policies and process-name selectors intact because more specific process directives and selectors continue to override the global process default. The change only affects processes that previously fell through to Nextflow's default terminate behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant