Skip to content

Add sanitization for sample identifiers during sample sheet generation #24

Description

@nrminor

NVD samplesheets are our predominant means of tracking sample metadata, with the sample identifier being the central unit of information there. When samplesheets are auto-generated via nvd samplesheet generate, sample identifiers are pulled from each FASTQ by splitting off {_R1,_R2}.fastq.gz suffixes. For Illumina data, that doesn't capture all information appended onto each provided sample identifier when naming FASTQ files. This means nvd samplesheet IDs will still contain additional text (e.g. L001, S21, etc.) on top of the user-provided sample identifiers, which may break metadata matchups like we do in our LabKey ETLs.

To address this edge case, NVD samplesheet generation should come with a --sanitize flag that opts in to stripping additional Illumina-convention text. This text comes in a standard format and should be easy to recognize and split. The goal here is that samplesheets for Illumina data will end up with the same sample identifiers as were submitted in the Illumina instrument's sample sheet.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions