A long-read, whole-genome structural variant (SV) caller with copy number predictions from coverage and SNP B-allele frequency. Inputs are long read alignments (BAM) and reference genome (FASTA), a VCF with high-quality SNPs
(e.g. via Clair3, NanoCaller), and per-chromosome VCF files with SNP population frequencies (e.g. from gnomAD). Class documentation is available at https://wglab.openbioinformatics.org/ContextSV
First, install Anaconda.
Next, create a new environment. This installation has been tested with Python 3.10, Linux 64-bit.
conda create -n contextsv python=3.10
conda activate contextsvContextSV and its dependencies can then be installed using the following command:
conda install -c wglab -c conda-forge -c bioconda contextsv
# Or using mamba (faster dependency resolution):
mamba install -c wglab contextsvAfter installation, you should have access to the following commands in your terminal:
contextsv: the main SV callercontextsv-cnv-plot: utility to generate CNV plots from ContextSV JSON outputcontextscore: ContextScore utility for post-filtering of low-confidence SV calls
Example usage:
# SV calling example:
contextsv \
--bam sample.bam \
--ref hg38.fa \
--outdir output/ \
--threads 4 \
--snp snps.vcf \
--eth nfe \
--pfb gnomadv4_filepaths.txt \
--assembly-gaps hg38-gaps.bed \ # optional: assembly gaps file
--save-cnv # optional: save CNV calls in JSON
# SV post-filtering example:
contextscore \
--input input.vcf \
--output scored.vcf \
--sample-coverage 30 \
--buildver hg38 \
--threshold 0.2 \
--annovar /path/to/annovar \
--annovar-db /path/to/humandb
# CNV plotting example:
contextsv-cnv-plot ./output/CNVCalls.json chr3 --formats html,svg --output-dir ./CNV_PlotsFirst, install Docker. Pull the latest image from Docker hub, which contains the latest release and its dependencies.
docker pull genomicslab/contextsvExample usage:
# SV calling:
docker run --rm genomicslab/contextsv --help
# SV post-filtering:
docker run --rm \
-v /path/to/data:/mnt \
genomicslab/contextsv \
contextscore \
--help
# CNV plotting:
docker run --rm \
-v /path/to/data:/mnt \
genomicslab/contextsv \
contextsv-cnv-plot \
--helpContextSV requires HTSLib as a dependency that can be installed using Anaconda. Create an environment containing HTSLib:
conda create -n htsenv -c bioconda -c conda-forge htslib
conda activate htsenvThen follow the instructions below to build ContextSV:
git clone https://github.com/WGLab/ContextSV
cd ContextSV
makeContextSV can then be run:
./build/contextsv --help
Options:
-b, --bam <bam_file> Long-read BAM file (required)
-r, --ref <ref_file> Reference genome FASTA file (required)
-s, --snp <vcf_file> Long-read SNP VCF file (required)
-o, --outdir <output_dir> Output directory (required)
-t, --threads <thread_count> Number of threads, chromosome-level parallelization (default: 1)
-h, --hmm <hmm_file> HMM parameter file for copy number predictions (included in the repository)
-e, --eth <eth_file> Ethnicity as used in gnomAD (e.g. "asj" for Ashkenazi Jewish, "nfe" for Non-Finnish European, etc.)
-p, --pfb <pfb_file> File containing per-chromosome population allele frequency filepaths as described in this documentation
--assembly-gaps <gaps_file> Assembly gaps file in BED format available from UCSC Genome Browser (https://hgdownload.soe.ucsc.edu/goldenPath/hg38/database/gap.txt.gz for GRCh38)
--save-cnv Save CNV data in JSON for downstream plotting with contextsv-cnv-plot
--debug Debug mode with verbose logging
--version Print version and exit
-h, --help Print usage and exitSNP population allele frequency information is used for copy number predictions in this tool (see PennCNV for specifics). We recommend downloading this data from the Genome Aggregation Database (gnomAD).
Download links for genome VCF files are located here (last updated April 3, 2024):
-
gnomAD v4.0.0 (GRCh38): https://gnomad.broadinstitute.org/downloads#4
-
gnomAD v2.1.1 (GRCh37): https://gnomad.broadinstitute.org/downloads#2
download_dir="~/data/gnomad/v4.0.0/"
chr_list=("1" "2" "3" "4" "5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" "17" "18" "19" "20" "21" "22" "X" "Y")
for chr in "${chr_list[@]}"; do
echo "Downloading chromosome ${chr}..."
wget "https://storage.googleapis.com/gcp-public-data--gnomad/release/4.0/vcf/genomes/gnomad.genomes.v4.0.sites.chr${chr}.vcf.bgz" -P "${download_dir}"
doneFinally, create a text file that specifies the chromosome and its corresponding gnomAD filepath. This file will be passed in as an argument:
gnomadv4_filepaths.txt
1=~/data/gnomad/v4.0.0/gnomad.genomes.v4.0.sites.chr1.vcf.bgz
2=~/data/gnomad/v4.0.0/gnomad.genomes.v4.0.sites.chr2.vcf.bgz
3=~/data/gnomad/v4.0.0/gnomad.genomes.v4.0.sites.chr3.vcf.bgz
...
X=~/data/gnomad/v4.0.0/gnomad.genomes.v4.0.sites.chrX.vcf.bgz
Y=~/data/gnomad/v4.0.0/gnomad.genomes.v4.0.sites.chrY.vcf.bgzFor release history, please visit here.
Please refer to the contextSV issue pages for posting your issues. We will also respond your questions quickly. Your comments are critical to improve our tool and will benefit other users.