Skip to content

Demultiplexing tip for undetermined fastq files #22

Description

@bfremin

We have been getting data back as a giant fastq file of undetermined reads (instead of bcl) with the barcode in the read name. Most tools that demultiplex from fastq were very slow, could not be parallelized, and/or failed. This is just a pre-preprocessing tip.

You need two files (a file that lists your barcodes, and a script)

barcodes.txt:
samplenameA GGACTCCT+AGAGGATA
samplenameB TAGGCATG+AGAGGATA
samplenameC CTCTCTAC+AGAGGATA
...all your samples

demultiplex.sh
#!/bin/bash
module load sickle/1.33

#demultiplex samples
grep -A3 --no-group-separator -i $2 {giant_UndeterminedFile_1.fq} | gzip > $1_1.fq.gz &
grep -A3 --no-group-separator -i $2 {giant_UndeterminedFile_2.fq} | gzip > $1_2.fq.gz &
wait

#remove instances that do not have pairs (trimming will fail if you do not)
sickle pe -f $1_1.fq.gz -r $1_2.fq.gz -t sanger -o paired_$1_1.fq -p paired_$1_2.fq -s $1_single.fq

Run:
cat barcodes.txt | xargs -l bash -c 'sbatch ..... demultiplex.sh $0 $1'

Will save you a lot of time instead of trying existing tools.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions