NVD samplesheets are our predominant means of tracking sample metadata, with the sample identifier being the central unit of information there. When samplesheets are auto-generated via nvd samplesheet generate, sample identifiers are pulled from each FASTQ by splitting off {_R1,_R2}.fastq.gz suffixes. For Illumina data, that doesn't capture all information appended onto each provided sample identifier when naming FASTQ files. This means nvd samplesheet IDs will still contain additional text (e.g. L001, S21, etc.) on top of the user-provided sample identifiers, which may break metadata matchups like we do in our LabKey ETLs.
To address this edge case, NVD samplesheet generation should come with a --sanitize flag that opts in to stripping additional Illumina-convention text. This text comes in a standard format and should be easy to recognize and split. The goal here is that samplesheets for Illumina data will end up with the same sample identifiers as were submitted in the Illumina instrument's sample sheet.
NVD samplesheets are our predominant means of tracking sample metadata, with the sample identifier being the central unit of information there. When samplesheets are auto-generated via
nvd samplesheet generate, sample identifiers are pulled from each FASTQ by splitting off{_R1,_R2}.fastq.gzsuffixes. For Illumina data, that doesn't capture all information appended onto each provided sample identifier when naming FASTQ files. This means nvd samplesheet IDs will still contain additional text (e.g. L001, S21, etc.) on top of the user-provided sample identifiers, which may break metadata matchups like we do in our LabKey ETLs.To address this edge case, NVD samplesheet generation should come with a
--sanitizeflag that opts in to stripping additional Illumina-convention text. This text comes in a standard format and should be easy to recognize and split. The goal here is that samplesheets for Illumina data will end up with the same sample identifiers as were submitted in the Illumina instrument's sample sheet.