Conversation
Replace the implicit BLAST completion signal in CLASSIFY_WITH_BLASTN with explicit per-sample package channels. BLASTN-needed and BLASTN-skipped samples should both continue downstream as terminal sample packages, so later merging does not have to infer completeness from missing channel emissions. Keep the external merged_results shape compatible for reporting and LIMS integration, while removing the unbounded mix/filter/groupTuple barrier that made per-sample BLAST results wait for the whole BLAST channel to close.
Add a maintainer-only fixture generation script for the small SRA-backed integration test dataset. The script fetches pinned viral RefSeq FASTA records, builds the tiny Deacon and BLAST nucleotide indexes consumed by the pipeline, writes the SRA samplesheet, and records checksums plus source metadata in a manifest. Commit the first generated fixture set so the pipeline integration test can consume small checked-in references instead of teaching NVD to construct custom BLAST or Deacon indexes at runtime.
Add an explicit slow/network pytest that verifies the checked-in mini viral fixtures can drive a full NVD run from SRA accession download through viral enrichment, assembly, BLAST classification, and final reporting. The test reads the fixture manifest, validates committed fixture checksums before launching Nextflow, runs the pipeline with isolated temporary results/work directories, and asserts that final BLAST outputs contain the expected viral signal. Add a dedicated GitHub Actions workflow for the integration test so changes to workflow code or fixture data can prove the tiny end-to-end path still completes.
By using justfile recipes, more of the same commands will be used more consistently for development tasks. This applies particularly to complicated pixi- or uv-driven pytest commands, which tend to be run with inconsistent flags by agents performing subsets of tests while verifying feature work or refactors.
Synchronize the package version exposed from py_nvd.__init__ with the release-candidate metadata and update GitHub Actions Pixi setup so CI can read the current lock-file format. This commit is scoped to CI/release metadata fixes caught by the v3.0.1-rc pull request checks.
Teach `nvd samplesheet generate` an opt-in `--sanitize` mode that recognizes Illumina/CASAVA FASTQ naming suffixes and strips them before writing sample IDs. The default generation path should preserve current behavior so existing users do not see sample ID changes unless they ask for sanitization. Add focused tests for paired Illumina filenames to lock down both the existing unsanitized behavior and the new sanitized behavior.
4a5a4f9 to
36561df
Compare
Change taxonomy preparation so an existing prepared taxonomy database is reused even when the source dump files are older than the freshness warning window. Missing required taxdump files still trigger the existing download/build path, and missing taxonomy.sqlite still triggers a rebuild from present dump files. This avoids mutating shared HPC taxonomy directories merely because nodes.dmp is old, which prevents age-based refresh attempts from surfacing user/group permission mismatches during normal pipeline runs. The existing CLI, Nextflow params, and published v3.0 schema contract are left unchanged. Keep offline and sync as the existing controls for no-download and require-prepared behavior, while warning when stale-but-complete taxonomy is reused.
Broaden viral BLAST lineage detection so annotated rows that identify Viruses outside the older superkingdom spelling are retained. Also make LCA annotation treat zero-byte and header-only merged BLAST tables as valid no-hit sentinels by writing a header-only output with the full LCA schema instead of crashing in Polars.
Set the pipeline-wide default Nextflow error strategy to finish so processes without a more specific error strategy no longer immediately terminate submitted or running work when one task fails. This keeps the existing per-process retry/ignore policies and process-name selectors intact because more specific process directives and selectors continue to override the global process default. The change only affects processes that previously fell through to Nextflow's default terminate behavior.
Add a standalone read-input resolver for NVD samplesheets. The resolver turns exact FASTQ paths, paired or single FASTQ glob declarations, and SRA accessions into a canonical JSONL manifest that downstream Nextflow wiring can consume later. The resolver keeps the tricky ingress rules localized: exact files take precedence over globs, globs take precedence over SRA accessions, lower-precedence sources produce loud warnings, exact path columns reject glob metacharacters, and all local paths are validated before the pipeline fans out. Add focused tests for platform normalization, duplicate sample rejection, symlink-preserving absolute paths, single-end glob sorting, CASAVA-style paired lane validation, compression consistency, JSONL output shape, and stderr warning behavior. This commit deliberately stops before Deacon streaming or GATHER_READS integration so the canonical declaration contract can be reviewed on its own.
Add stream_fastqs_to_deacon.py as the isolated mechanism for feeding resolved FASTQ bundles into Deacon without materializing full concatenated FASTQs. The helper creates named pipes, streams one or more ordered FASTQ inputs into those pipes, runs Deacon against the pipe paths, and reports failures from both Deacon and the stream producers. Keep the lifecycle in Python rather than embedding a long Bash script in Nextflow. The helper validates executable availability, input presence, paired-list length, compression consistency, and explicitly rejects zstd until a supervised zstd producer is added. Add tests with both fake consumers and real Deacon. The real-Deacon tests build a tiny index and compare FIFO-streamed output byte-for-byte against a tiny direct-control run, while the fake-consumer tests pin failure modes that are difficult to trigger reliably with Deacon itself: producer errors, truncated gzip streams, Deacon exiting before all FIFOs open, Deacon failing after reading from a FIFO, FIFO cleanup, and the guarantee that the consumer sees FIFO inputs rather than regular concatenated files. This follows the resolved read declaration commit and prepares for a later GATHER_READS/PREPROCESS_READS integration commit.
Tighten the read-input resolver contract before wiring it into Nextflow. FASTQ paths and glob patterns must now be absolute so resolution does not depend on the current working directory of a Nextflow task. This keeps the samplesheet interface explicit for cluster and automation use cases, where stable absolute symlink paths are preferable to context-sensitive relative paths. The resolver still preserves symlink paths rather than rewriting them to storage targets. Also remove zstd FASTQ suffixes from the resolver's supported input set until the Deacon streaming helper grows supervised zstd producer support. This keeps the resolver and streaming helper contracts aligned.
Wire the resolved read-input manifest into the Nextflow pipeline. This commit should make GATHER_READS resolve the samplesheet once, route SRA and local read declarations into a single explicit bundle channel shape, and let Deacon filtering choose direct input for single-file bundles or stream_fastqs_to_deacon.py for multi-file bundles. Keep the integration focused on production wiring rather than adding the full lane-glob pipeline integration test. That test coverage will follow in a separate commit once the channel contract and process wiring are in place. The intended internal contract is tuple(meta, r1_files, r2_files), where meta carries sample id, platform, read mode, and source while FASTQ paths remain top-level path collections for Nextflow staging and hashing.
Replace the paired-gzip temporary bundle workaround with a single FIFO-based transport path. Gzip and zstd inputs now keep extension-bearing FIFO names and stream raw compressed bytes so Deacon owns extension-based decoding; xz inputs continue to decode to plain FASTQ FIFOs. Fix the FIFO writer root cause by using O_NONBLOCK only to avoid hanging while connecting, then switching the file descriptor back to blocking mode before writes. This preserves no-reader failure behavior without corrupting paired FASTQ streams under backpressure. Update helper tests to use in-process Deacon runners instead of generating Python scripts from raw strings. Add assertions for gzip/zstd FIFO naming, paired read ordinality, swapped mate detection, and the larger real-Deacon paired bundle regression. Broaden the resolver suffix allowlist for .fastq.zst/.fq.zst and adjust the integration test so local FASTQ rows assert input resolution and lane ordering rather than unsupported final BLAST rows.
Use the declared sequencing platform, not the read source, to choose the minimap2 preset for mapping reads back to contigs. SRA accessions can yield short-read FASTQs, including physically single FASTQs after filtering/repair, but that does not imply ONT data. Keep this change intentionally narrow: Illumina uses the short-read preset, ONT uses map-ont, and interleaved layout handling remains unchanged.
Add a --group-lanes mode to nvd samplesheet generate for Illumina/CASAVA FASTQ directories. The mode emits one samplesheet row per CASAVA sample prefix and writes lane patterns into fastq1_glob and fastq2_glob, leaving the exact-path fastq1 and fastq2 columns empty. This keeps grouping distinct from physical concatenation and composes with --sanitize without requiring it. The generator now fails early for grouped-lane inputs that cannot safely become downstream glob declarations: non-CASAVA filenames, incomplete R1/R2 lane pairs, mixed FASTQ extensions within a grouped sample, and duplicate emitted sample IDs caused by sanitization. These checks catch directory problems during samplesheet generation rather than during a later NVD run. Samplesheet output keeps the older five-column shape unless glob columns are actually used. When grouped lanes are generated, the writer expands to include fastq1_glob and fastq2_glob so downstream read resolution receives the lower-level glob declarations it already understands. Tests cover grouped CASAVA lane generation, preservation of read-like tokens in sample IDs, fail-fast malformed grouped inputs, conditional five-column versus seven-column CSV output, and the existing read resolver behavior.
Add an advanced samplesheet guide note for nvd samplesheet generate --group-lanes. The docs keep the README-level samplesheet shape simple while explaining in the CLI guide that grouped Illumina/CASAVA lane inputs are represented with lower-level fastq1_glob and fastq2_glob columns only when that mode is requested. Add assets/grouped_lane_samplesheet.csv as a concrete expanded-form example and whitelist it in .gitignore so it is tracked alongside the existing samplesheet assets. The example validates with the current nvd samplesheet validator.
Move read-input resolution into package code and use it as the single source of truth for samplesheet validation. The Nextflow-facing resolver script now visibly orchestrates reusable library calls instead of carrying its own copy of the logic. The shared resolver keeps validation filesystem-sensitive: exact FASTQ paths must exist, glob patterns must be absolute, glob matches must be visible, sample IDs must be unique, and paired glob declarations must describe the same read set. Path handling preserves user-facing absolute symlink namespaces rather than canonicalizing through realpaths. Centralize FASTQ filename primitives for suffix support, compression classification, CASAVA lane parsing, and safe paired-glob keys. Paired globs now support CASAVA names plus simple terminal read markers without falling back to sorted-order pairing. Validation continues to accept xz and zst inputs because Deacon supports them. Extract reusable samplesheet validation into package modules while keeping CLI rendering in the CLI layer. Both nvd samplesheet validate and nvd validate samplesheet now render the same validation result without command modules importing each other. Tighten the resolver API by removing the unused cwd parameter and marking resolver helpers private. Add warning-only SRA accession shape checks for manually authored samplesheets, matching the existing generate-from-SRA warning model. Document grouped-lane globs as live declarations and clarify the symlink-preserving absolute-path policy. Add an importability tripwire for the read resolver script and expand tests around grouped lanes, duplicate IDs, platform aliases, compression handling, terminal read markers, SRA warnings, and filesystem-sensitive validation.
Compute the run context from the same read-input resolver used by the pipeline read-gathering path instead of independently parsing and deduplicating sample IDs. This keeps manually authored samplesheets under one set of semantics: duplicate sample IDs, invalid platforms, missing FASTQ files, bad glob declarations, and empty samplesheets now fail before a sample-set ID is emitted. The process still runs in the Nextflow work directory and relies on package importability rather than source-tree path shims. Add focused tests for successful sample ID extraction, duplicate rejection, missing FASTQ rejection, and stable context hashing from resolved samples.
Add a golden seam test for grouped-lane FASTQ handling. The test builds small CASAVA-style paired-lane FASTQs, generates a grouped-lane samplesheet through the samplesheet helpers, resolves it through the shared read-input engine, and streams the resolved R1/R2 bundles through the same Deacon streaming script API used by the workflow. The fake Deacon runner reads the FIFO inputs and verifies that mate direction and record ordinality are preserved across concatenated lane bundles. This keeps the test fast and local while still exercising the user-relevant lane-concat path from generated samplesheet to streamed paired reads. Add Hypothesis as a dev dependency and use it to vary lane numbers, chunk numbers, record counts, and discovery order. The property test asserts that grouped-lane generation and resolution produce exactly one paired-glob sample, preserve the expected platform/source metadata, order R1/R2 files by lane/chunk keys, and stream the expected paired read sequence.
Bring GitHub CI up to the same fast-test contract used by local development. The CI workflow now runs ============================= test session starts ============================== platform darwin -- Python 3.12.11, pytest-8.3.5, pluggy-1.5.0 rootdir: /Users/nickminor/Documents/bioinformatics/nvd-lane-concat configfile: pyproject.toml testpaths: lib/py_nvd, bin, tests plugins: anyio-4.9.0, hypothesis-6.155.7 collected 341 items / 4 deselected / 337 selected lib/py_nvd/cli/commands/test_samplesheet.py ............................ [ 8%] ........ [ 10%] lib/py_nvd/cli/commands/test_setup.py .......... [ 13%] lib/py_nvd/cli/test_cli_smoke.py .......... [ 16%] lib/py_nvd/cli/test_utils.py ................... [ 22%] lib/py_nvd/test_db.py ....... [ 24%] lib/py_nvd/test_models.py .............................................. [ 37%] ................................................. [ 52%] lib/py_nvd/test_nextflow_integrity.py .. [ 53%] lib/py_nvd/test_taxonomy.py ............................................ [ 66%] ........................................ [ 78%] bin/test_annotate_blast_lca.py ............. [ 81%] bin/test_annotate_blast_results.py ... [ 82%] bin/test_compute_run_context.py .... [ 83%] bin/test_notify_slack.py ... [ 84%] bin/test_resolve_read_inputs.py ........................ [ 91%] bin/test_stream_fastqs_to_deacon.py ......................... [ 99%] tests/test_lane_concat_golden.py .. [100%] ====================== 337 passed, 4 deselected in 6.93s =======================, matching the ============================= test session starts ============================== platform darwin -- Python 3.12.11, pytest-8.3.5, pluggy-1.5.0 rootdir: /Users/nickminor/Documents/bioinformatics/nvd-lane-concat configfile: pyproject.toml testpaths: lib/py_nvd, bin, tests plugins: anyio-4.9.0, hypothesis-6.155.7 collected 341 items / 4 deselected / 337 selected lib/py_nvd/cli/commands/test_samplesheet.py ............................ [ 8%] ........ [ 10%] lib/py_nvd/cli/commands/test_setup.py .......... [ 13%] lib/py_nvd/cli/test_cli_smoke.py .......... [ 16%] lib/py_nvd/cli/test_utils.py ................... [ 22%] lib/py_nvd/test_db.py ....... [ 24%] lib/py_nvd/test_models.py .............................................. [ 37%] ................................................. [ 52%] lib/py_nvd/test_nextflow_integrity.py .. [ 53%] lib/py_nvd/test_taxonomy.py ............................................ [ 66%] ........................................ [ 78%] bin/test_annotate_blast_lca.py ............. [ 81%] bin/test_annotate_blast_results.py ... [ 82%] bin/test_compute_run_context.py .... [ 83%] bin/test_notify_slack.py ... [ 84%] bin/test_resolve_read_inputs.py ........................ [ 91%] bin/test_stream_fastqs_to_deacon.py ......................... [ 99%] tests/test_lane_concat_golden.py .. [100%] ====================== 337 passed, 4 deselected in 6.52s ======================= recipe and ensuring non-slow tests under lib/, bin/, and tests/ run for pull requests. Also trigger CI when tests or uv.lock change, since both can affect the fast test suite, and use Python 3.12 for uv-managed CI jobs to match the supported Python range and current development environment. Validated locally with the fast pytest command and Validating 1 workflow file(s)... ✅ Valid: .github/workflows/ci.yml Summary: 1 valid, 0 invalid.
Introduce a top-level experimental boolean parameter that is disabled by default and can be used by future release-candidate work to gate experimental subworkflows. The parameter is wired through the Nextflow defaults, the NVD params Pydantic model, the nvd run CLI, and the params JSON schema. The schema is versioned as v3.1.0 while preserving the v3.0.0 schema, and the latest schema symlink now points at the v3.1.0 schema. The CLI intentionally exposes only the positive --experimental flag rather than a paired --experimental/--no-experimental option, since the gate should only be enabled explicitly.
Adds the parameter surface for experimental sourmash reference profiling without wiring those params into workflow execution yet. This keeps the public configuration, CLI, model, and schema changes isolated from the later Nextflow resource-resolution and profiling work. The new params cover prebuilt local or URL-provided sourmash reference sketches, local FASTA inputs that can be sketched into a reference database later, optional taxonomy lineages, and sketching controls for k-size and scaled values.
Introduce the v3.1 taxonomy availability controls that were intentionally kept out of the v3.0.1 patch line. This adds an explicit pipeline-facing taxonomy mode for read-only versus prepare-if-missing behavior, plus admin-facing refresh controls for taxonomy preparation. Refactor taxonomy management toward a status -> policy -> plan -> execute shape so tests can exercise the decision logic without depending on downloads or filesystem mutation. Wire the new controls through the CLI, Nextflow params, schema, and annotation processes, while preserving the legacy NVD_TAXONOMY_OFFLINE compatibility path when no explicit mode is provided.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Opening this early to get discussion going for the v3.1.0 release candidate, which is still very much evolving.
Currently, this PR stacks on top of a few other branches with development work happening, including
lane-concatandv3.0.1-rc. It will eventually be rebased on top of a renamed version ofexperimental-sourmashas well.The targets for this release include:
--experimentalparameter and command line argument that can turn on experimental features deployed for testing in main. This is meant to make testing new features without maintaining multiple checkouts of NVD easier.--experimentaland may or may not remain so when merged intomain.This branch currently isn't based on top of
experimental-sourmashjust yet, as work that may involve commit history modifications is still ongoing there. As such, the sourmash updates described in point 3 above aren't in the PR diff yet. But they will be soon!