Add sanitization for sample identifiers during sample sheet generation

NVD samplesheets are our predominant means of tracking sample metadata, with the sample identifier being the central unit of information there. When samplesheets are auto-generated via `nvd samplesheet generate`, sample identifiers are pulled from each FASTQ by splitting off `{_R1,_R2}.fastq.gz` suffixes. For Illumina data, that doesn't capture all information appended onto each provided sample identifier when naming FASTQ files. This means nvd samplesheet IDs will still contain additional text (e.g. L001, S21, etc.) on top of the user-provided sample identifiers, which may break metadata matchups like we do in our LabKey ETLs.

To address this edge case, NVD samplesheet generation should come with a `--sanitize` flag that opts in to stripping additional Illumina-convention text. This text comes in a standard format and should be easy to recognize and split. The goal here is that samplesheets for Illumina data will end up with the same sample identifiers as were submitted in the Illumina instrument's sample sheet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add sanitization for sample identifiers during sample sheet generation #24

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Add sanitization for sample identifiers during sample sheet generation #24

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions