Conversation
Replace the implicit BLAST completion signal in CLASSIFY_WITH_BLASTN with explicit per-sample package channels. BLASTN-needed and BLASTN-skipped samples should both continue downstream as terminal sample packages, so later merging does not have to infer completeness from missing channel emissions. Keep the external merged_results shape compatible for reporting and LIMS integration, while removing the unbounded mix/filter/groupTuple barrier that made per-sample BLAST results wait for the whole BLAST channel to close.
Add a maintainer-only fixture generation script for the small SRA-backed integration test dataset. The script fetches pinned viral RefSeq FASTA records, builds the tiny Deacon and BLAST nucleotide indexes consumed by the pipeline, writes the SRA samplesheet, and records checksums plus source metadata in a manifest. Commit the first generated fixture set so the pipeline integration test can consume small checked-in references instead of teaching NVD to construct custom BLAST or Deacon indexes at runtime.
Add an explicit slow/network pytest that verifies the checked-in mini viral fixtures can drive a full NVD run from SRA accession download through viral enrichment, assembly, BLAST classification, and final reporting. The test reads the fixture manifest, validates committed fixture checksums before launching Nextflow, runs the pipeline with isolated temporary results/work directories, and asserts that final BLAST outputs contain the expected viral signal. Add a dedicated GitHub Actions workflow for the integration test so changes to workflow code or fixture data can prove the tiny end-to-end path still completes.
By using justfile recipes, more of the same commands will be used more consistently for development tasks. This applies particularly to complicated pixi- or uv-driven pytest commands, which tend to be run with inconsistent flags by agents performing subsets of tests while verifying feature work or refactors.
Synchronize the package version exposed from py_nvd.__init__ with the release-candidate metadata and update GitHub Actions Pixi setup so CI can read the current lock-file format. This commit is scoped to CI/release metadata fixes caught by the v3.0.1-rc pull request checks.
1b578de to
ce33eb4
Compare
Teach `nvd samplesheet generate` an opt-in `--sanitize` mode that recognizes Illumina/CASAVA FASTQ naming suffixes and strips them before writing sample IDs. The default generation path should preserve current behavior so existing users do not see sample ID changes unless they ask for sanitization. Add focused tests for paired Illumina filenames to lock down both the existing unsanitized behavior and the new sanitized behavior.
Change taxonomy preparation so an existing prepared taxonomy database is reused even when the source dump files are older than the freshness warning window. Missing required taxdump files still trigger the existing download/build path, and missing taxonomy.sqlite still triggers a rebuild from present dump files. This avoids mutating shared HPC taxonomy directories merely because nodes.dmp is old, which prevents age-based refresh attempts from surfacing user/group permission mismatches during normal pipeline runs. The existing CLI, Nextflow params, and published v3.0 schema contract are left unchanged. Keep offline and sync as the existing controls for no-download and require-prepared behavior, while warning when stale-but-complete taxonomy is reused.
Broaden viral BLAST lineage detection so annotated rows that identify Viruses outside the older superkingdom spelling are retained. Also make LCA annotation treat zero-byte and header-only merged BLAST tables as valid no-hit sentinels by writing a header-only output with the full LCA schema instead of crashing in Polars.
Set the pipeline-wide default Nextflow error strategy to finish so processes without a more specific error strategy no longer immediately terminate submitted or running work when one task fails. This keeps the existing per-process retry/ignore policies and process-name selectors intact because more specific process directives and selectors continue to override the global process default. The change only affects processes that previously fell through to Nextflow's default terminate behavior.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Time for a new release candidate, this one for a patch release v3.0.1!
End-to-end testing, at last
This PR carries a stack of commits that, most excitingly, includes a full end-to-end integration testing setup. This includes SRA-downloaded real data, local synthetic data, as well as a tiny custom BLAST database and deacon enrichment reference, plus an expanded suite of test fixtures and more validation. Everything is small enough that it can be checked into version control, sync'd with the remote origin, and thus comes with the source code it's tested with. Though it takes a very long time on github action runners (~1 hour), this end-to-end setup is also run as a check in CI.
The only part of this pipeline the test setup does not exercise is the LIMS integration subworkflow. This will require its own handling and is a gap that we plan to fill in future versions.
Project development shorthands
Additionally, we also introduce a project
justfilefor shorthands to tricky development commands. Developers and agents should use these shorthands so that development feedback loops are consistent between PRs. To run the integration test, simply runjust e2e.SRA download support (without the disk expense)
While previous NVD versions claimed nominal support for SRA download, this PR actually provides that functionality. Because this was promised previously but never exercised in tests and thus found to be incomplete, this is technically a bug fix and not a feature addition.
Rather than reaching for SRA-tools, this PR instead taps
sracha. SRA-tools imfamously requires 5-10x the disk space for the actual downloaded FASTQ, and in the year-of-our-lord 2026, it still does not support compressed FASTQ writing.srachaaddresses both of these needs for our work's enormous FASTQs--uncompressed FASTQs on disk are never acceptable. It is also faster and more actively maintained.Samplesheet generator improvements
Most nextflow pipelines require a user-generated samplesheet CSV of input file paths with some metadata, and NVD is no exception. To help users generate this artifact from a directory of FASTQ files, NVD's command line interface includes
nvd samplesheet generate. This command was previously unaware of Illumina/CASAVA file naming conventions, which meant it would leave suffixes like "S23_L002" behind in sample IDs. This, expectedly in retrospect, breaks ETLs in our LIMS! To address this, v3.0.1 includes an opt-in--sanitizeflag to the samplesheet generator interface that knows CASAVA structure and will strip it if detected. This feature comes with tests to assert that this stripping works without a high rate of false positives.