Our preprint is now available! https://doi.org/10.1101/2024.12.31.630903
For most users, we recommend to install conda or mamba virtual environment to install these programs. For package installation and management, please advise with conda/mamba manuals.
D&D pipeline was tested on Python 3.9.19 and R 4.3.3 using mamba (https://mamba.readthedocs.io/en/latest/). Dependencies and requirements are specified in main/environment.yml.
git clone https://github.com/sangho1130/DnD.git
cd DnD/main/
# create a conda environment
conda env create -f environment.yml
conda activate dndseq
# install rtracklayer, ggplot2, reshape2
Rscript install_packages.R
# install dndseq pipeline
pip install -e .
which dnd-pt1
# test run
# download a mini data from https://doi.org/10.6084/m9.figshare.30853628
# and place it in this location
# then run,
sbatch cmd_dnd.sh
Test mini data can be downloaded from: https://doi.org/10.6084/m9.figshare.30853628
gnomAD germline database (hg38) is required from: https://console.cloud.google.com/storage/browser/gatk-best-practices/somatic-hg38/
(filtered version used in the paper is available at https://doi.org/10.6084/m9.figshare.27956862)
Motif database for MEME can be downloaded from: https://meme-suite.org/meme/doc/download.html
To summarize genome edits introduced by DnD in scATAC-seq data, a bam file produced by CellRanger was split into each cell type based on barcode sequence and cell annotation using sinto. For bulk ATAC-seq data, bam files aligned to the genome using BWA-MEM2 were analyzed. After splitting bam files, the following six steps were performed to analyze DnD-mediated genomic variants.
- First, each bam file was preprocessed to remove uninformative and low quality read alignments. Duplicated reads were marked and simultaneously filtered using “picard MarkDuplicates” with a REMOVE_DUPLICATES=true parameter. Read alignments with high mapping quality Phred score (>= 20), primary alignment, reads aligned to intact chromosomes, and those with properly aligned mates were retained using samtools (version 1.19).
- Next, all single nucleotide variants (SNVs) found in each filtered bam file were collected using “bcftools mpileup” with following parameters, -a FORMAT/AD,FORMAT/DP,INFO/AD --no-BAQ --min-MQ 1 --max-depth 8000. The pileup result subsequently converted into the vcf format reporting SNVs supported by at least two reads supporting variants from minimum three aligned reads. These thresholds can be adjusted by users based on their data.
- Germline mutations were then filtered based on loci and alleles from the gnomAD database and variant allele frequency higher than 10%. If available, custom databases in vcf format can be provided to additionally filter uninformative mutations.
- Preprocessed bam files from step 1 were analyzed using MACS2 (version 2.2.9.1) to call peaks with -f BAM --nomodel parameters. Peaks were then filtered with blacklist region annotation using bedtools (version 2.31.1). Motif analysis was performed using MEME Simple Enrichment Analysis (SEA) with HOmo sapiens COmprehensive MOdel COllection (HOCOMOCO) v11 core motif set to identify binding sites in peaks. Optionally, users can perform motif analysis using HOMER2 or a reference bed file generated by ChIP-seq.
- Peaks harboring motifs of interest or overlapping with ChIP-seq reference tracks were classified as target peaks. Target peaks were resized to 200 bp (up/downstream 100 bp from the motif center or peak summit with ChIP-seq) and overlaid with C-to-T and G-to-A variants identified in step 3. When multiple motif positions were found, a position with the highest motif score was chosen. Peaks without binding motifs of interest are classified as background peaks and were resized to 200 bp by taking +/- 100 bp from the peak summit.
To compare D&D edit counts between target and background regions, edit counts per peak were summarized and signal-to-noise ratio (SNR) were calculated by dividing the number of C-to-T and G-to-A by the number of other variants. First, edits per peak were calculated by dividing SNV counts by the number of peaks in resized target and background peak regions. We assume that the frequencies of non-D&D edit (edits other than C-to-T) are consistent across the target and background regions, so this can be used to normalize the D&D edit counts. In this manner, SNR was then calculated by dividing the D&D edit counts by the mean of non-D&D edit counts.
Footprint analysis for D&D edits was performed by counting the number of D&D edits in each base pair from randomly sampled target and background peaks, +-100 bp from the center of the motifs. The random sampling was repeated for 10 times and mean and standard deviation were used for visualization. For both the cell mixing experiment and primary blood cells, we used 200 randomly selected peaks. The number of subsampled peaks can be set by the user.
D&D signals are collected and evaluated with three-step python scripts. By default, D&D edits will be called with motif analysis using MEME Simple Enrichment Analysis (SEA). Users can instead run HOMER2 with their own build for de novo motif searching or use matching ChIP-seq tracks as reference.
important note Bam files for samples, cell cluster or subclusters should be prepared in the following manner. For single-cell data, bam files can be separated into user-provided clusters using sinto. Please refer to the sinto's manual (https://timoast.github.io/sinto/). For bulk samples, D&D-seq takes genome aligned bam files as input. We noticed that recent picards version requires read group information in bam files and missing it leads to failure of the pipeline. In that case, users can manually add custom read group using the following samtools command.
samtools addreplacerg -r "@RG\tID:dndseq\tSM:dndseq\tLB:MissingLibrary.1\tPU:dndseq\tPL:ILLUMINA" -o <output.bam> <input.bam>
Bam files should be structured like below.
$ tree bams
# should be structured like below
bams
├── dndseq_ctcf_chr1
│ ├── k562_ctcf_chr1.bam
│ └── k562_ctcf_chr1.bam.bai
└── (..XYZ..)
├── (..XYZ..).bam
└── (..XYZ..).bam.bai
Part 1: preprocessing, pileup and filtering variants, and first round peak calling
$ dnd-pt1 --help
usage: [-h] -d DIR [-o OUTPUT] [--thread THREAD] [--start START] [--end END] --fasta FASTA --gnomad GNOMAD [--mapq MAPQ] [--chrom] [--smt-other OTHER] [--se] [--count COUNT] [--alt ALT] [--left LEFT] [--right RIGHT] [--pass-gnomad]
[--custom CUSTOM [CUSTOM ...]] [--vaf VAF] [--snv SNV] [--gsize GSIZE] [--opt OPT] [--blacklist BLACKLIST] [--pass-bklist]
optional arguments:
-h, --help show this help message and exit
-d DIR, --Dir DIR directory path
-o OUTPUT, --Output OUTPUT
[Global] (*optional) output directory path
--thread THREAD [Global] (*optional) number of threads; default is 4
--start START [Global] (*optional) the first directory index
--end END [Global] (*optional) the last directory index
--fasta FASTA [Global] genome fasta (indexed) used in alignment
--gnomad GNOMAD [Global] path to gnomAD vcf file
--mapq MAPQ [Step 1] (*optional) threshold for mapping quality; default is 20
--chrom [Step 1] (*optional) do not filter non-chromosome
--smt-other OTHER [Step 1] (*optional) other filter parameters for samtools in "~"
--se [Step 1] (*optional) input bam is single-end
--count COUNT [Step 2] (*optional) minimum total counts; default is 3
--alt ALT [Step 2] (*optional) minimum ALT counts; default is 1
--left LEFT [Step 2] (*optional) ignore variants in this many bases on the left side of reads
--right RIGHT [Step 2] (*optional) ignore variants in this many bases on the right side of reads
--pass-gnomad [Step 3] (*optional) do not run gnomAD filering
--custom CUSTOM [CUSTOM ...]
[Step 3] (*optional) custom vcf file(s) to filter, multiple files are accepted
--vaf VAF [Step 3] (*optional) filter mutations frequent than this value; e.g. 10 for 10perc; default is 10
--snv SNV [Step 4] (*optoinal) SNV patterns to search; e.g. --snv "C>T,G>A, G>C"; default is "C>T,G>A"
--gsize GSIZE [Step 5] (*optional) effective genome size for macs2 callpeak; default is hs (homo sapiens)
--opt OPT [Step 5] (*optional) other parameters for macs2 callpeak
--blacklist BLACKLIST
[Step 5] (*optional) blacklist file; default is human hg38-blacklist.v2.bed
--pass-bklist [Step 5] (*optional) do not run blacklist filtering
$ dnd-pt1 -d <path_to_bam_directory> -o <path_to_output_directory> --fasta <genome_fasta_file> --gnomad <germline_vcf_file>
Expected outputs are step1-5 directories in <-o>
├── step1_preprocess
│ └── k562_ctcf_chr1
│ ├── k562_ctcf_chr1.bam
│ ├── k562_ctcf_chr1.bai
│ └── step1_picard_deduplication.txt
├── step3_fltvcf
│ └── k562_ctcf_chr1
│ └── k562_ctcf_chr1.flt.vcf
├── step4_snvs
│ └── k562_ctcf_chr1
│ ├── k562_ctcf_chr1.flt.CT_GA.SNVs.vcf
│ ├── k562_ctcf_chr1.flt.SNVs.vcf
│ ├── k562_ctcf_chr1.snvs.bam
│ ├── k562_ctcf_chr1.snvs.bam.bai
│ └── qnames.txt
└── step5_peaks
└── k562_ctcf_chr1
├── peaks_bkflt.narrowPeak
└── summits_bkflt.bed
Part 2: Joint peak calling and motif searching
We found that having a common set of peaks in different samples is often convenient when comparing D&D edit signals. Part2 jointly calls peaks in all samples found in your <step5_peaks> directory and intersect with individual calls. Old peaks are moved to the <celltype_specific> directory in each sample. Users who do not wish to jointly call peaks can skip peak calling by specifying <--pass-peakcall>. Then, the script performs motif analysis using MEME Simple Enrichment Analysis (SEA) or HOMER2. Users can choose whether to use joint peak calls in directory or individual peak call by specifying the <--sample> argument.
$ dnd-pt2 -h
$ dnd-pt2 -h
usage: [-h] -d DIR [-o OUTPUT] [--pass-peakcall] [--sample SAMPLE] --mode [{sea,homer2}] --fasta FASTA [--gsize GSIZE] [--opt OPT] [--blacklist BLACKLIST] [--pass-bklist] [--motif MOTIF] [--homer-ref HM2REF]
optional arguments:
-h, --help show this help message and exit
-d DIR, --Dir DIR directory path
-o OUTPUT, --Output OUTPUT
[Global] (*optional) output directory path
--pass-peakcall [Global] (*optional) pass peak calling and perform motif analysis ONLY; default is DO NOT PASS
--sample SAMPLE [Global, --pass-peakcall] (*optional) motif analysis for which sample: "joint" or specify sample name; default is "joint"
--mode [{sea,homer2}]
[Global] which mode: "sea", "homer2"; default is "sea"
--fasta FASTA [Global, --mode:sea] genome fasta (indexed) used in alignment
--gsize GSIZE [Step 5] (*optional) effective genome size for macs2 callpeak; default is hs
--opt OPT [Step 5] (*optional) other parameters for macs2 callpeak
--blacklist BLACKLIST
[Step 5] (*optional) blacklist file; default is human hg38-blacklist.v2.bed
--pass-bklist [Step 5] (*optional) do not run blacklist filtering
--motif MOTIF [Step 5, --mode:sea] (*optional) motif reference; default is hocomoco v11 core human from the meme package
--homer-ref HM2REF [Step 5, --mode:homer2] (*optional) homer2 reference; default is "grch38_crgatac"
$ dnd-pt2 -d <path_to_pt1_output_directory> --fasta <genome_fasta_file> --mode sea
This step will create a "merged" directory with joint peak calling results in "step5_peaks/", and intersected peaks in each sample's directory. Original MACS2 result files will be stored in "celltype_specific" directory in each sample.
└── step5_peaks
└── k562_ctcf_chr1
│ ├── celltype_specific
│ │ ├── peaks_bkflt.narrowPeak
│ │ └── summits_bkflt.bed
│ ├── peaks_bkflt.narrowPeak
│ └── summits_bkflt.bed
└── merged
├── meme_sea
│ ├── sea.html
│ ├── sea.tsv
│ ├── sequences.tsv
│ └── sites.tsv
├── peaks_bkflt.fasta
├── peaks_bkflt.narrowPeak
├── summits_bkflt.bed
└── unfiltered
├── peaks_bkflt.narrowPeak
└── summits_bkflt.bed
Part 3: Motif annotation and D&D edit evaluation
$ dnd-pt3 -h
usage: --mode only supports homer2 or sea [-h] -d DIR [-o OUTPUT] [--size SIZE] [--rand RAND] --sample SAMPLE [--var VARIANTS] [--genomeOrder GENOMEORDER] --mode [{chip,homer2,sea}] [--indiv-sea] [--chipseq CHIPSEQ] [--homer-ref HM2REF]
[--motif MOTIF [MOTIF ...]]
optional arguments:
-h, --help show this help message and exit
-d DIR, --Dir DIR directory path
-o OUTPUT, --Output OUTPUT
[Global] (*optional) output directory path
--size SIZE [Global] (*optional) test peak width size in bp; default is 200
--rand RAND [Global] (*optional) down-sample peaks to this number; default is 200
--sample SAMPLE [Global] specify <sample name> or <"all"> for all samples in <step5>
--var VARIANTS [Global] (*optional) expected D&D variants; default is "C>T,G>A"
--genomeOrder GENOMEORDER
[Global] (*optional) chromosome ordered in bed; default is </data/GRCh38.p14.genome.chrs.bed>
--mode [{chip,homer2,sea}]
[Global] which mode: "sea", "homer2" or "chip"
--indiv-sea [Global] (*optional, if --mode sea) specify when you analyzed SEA motif search for each sample separately
--chipseq CHIPSEQ [Step 6, --mode:chip] chip-seq reference
--homer-ref HM2REF [Step 6, --mode:homer2] (*optional) homer2 reference; default is "grch38_crgatac"
--motif MOTIF [MOTIF ...]
[Step 6, --mode:homer2 or sea] <path to the homer2 motif file> for "homer2" or <TF name> for "sea", multiple arguments are supported
$ dnd-pt3 -d <path_to_pt1_output_directory> --sample k562_ctcf_chr1 --mode sea --motif CTCF_HUMAN.H11MO.0.A
This will generate "step6_tfpeaks_sea" directory with includes "sample" (in this example, k562_ctcf_chr1).
Each "sample" directory has peaks, D&D edits, and footprint plots for TF binding peaks and background peaks. Users can define window size to collect D&D edits or the number of peaks to sample in footprinting plotting.
step6_tfpeaks_sea
└── dndseq_ctcf_chr1
├── background.countNormalized.txt
├── background.countPerPeak.txt
├── context_di
├── context_tri
├── dnd_countNormalized.pdf
├── dnd_countNormalized_SNR.pdf
├── motif_bkgd.narrowPeak
├── motif_bkgd.narrowPeak.CT_GA.vcf
├── motif_bkgd.narrowPeak.nonDnD.vcf
├── motif_bkgd.narrowPeak.vcf
├── motif_bkgd.narrowPeak.width200
├── motif_bkgd.narrowPeak.width200.CT_GA.vcf
├── motif_bkgd.narrowPeak.width200.nonDnD.vcf
├── motif_bkgd.narrowPeak.width200.vcf
├── motif_bkgd.narrowPeak.width200.vcf.bed
├── motif_bkgd.narrowPeak.width200.vcf.CT_+-5bp.bed
├── motif_bkgd.narrowPeak.width200.vcf.CT_+-5bp.fasta
├── motif_bkgd.narrowPeak.width200.vcf.GA_+-5bp.bed
├── motif_bkgd.narrowPeak.width200.vcf.GA_+-5bp.fasta
├── motif_bkgd.narrowPeak.width200.vcf.GA_+-5bp_revComp.fasta
├── motif_bkgd.unflt.narrowPeak
├── motifCenteredCounts
├── motifCenteredCounts_nonDnD
├── motif_TF.narrowPeak
├── motif_TF.narrowPeak.CT_GA.vcf
├── motif_TF.narrowPeak.motifCentered.width200
├── motif_TF.narrowPeak.motifCentered.width200.CT_GA.vcf
├── motif_TF.narrowPeak.motifCentered.width200.nonDnD.vcf
├── motif_TF.narrowPeak.motifCentered.width200.vcf
├── motif_TF.narrowPeak.motifCentered.width200.vcf.bed
├── motif_TF.narrowPeak.motifCentered.width200.vcf.CT_+-5bp.bed
├── motif_TF.narrowPeak.motifCentered.width200.vcf.CT_+-5bp.fasta
├── motif_TF.narrowPeak.motifCentered.width200.vcf.GA_+-5bp.bed
├── motif_TF.narrowPeak.motifCentered.width200.vcf.GA_+-5bp.fasta
├── motif_TF.narrowPeak.motifCentered.width200.vcf.GA_+-5bp_revComp.fasta
├── motif_TF.narrowPeak.nonDnD.vcf
├── motif_TF.narrowPeak.vcf
├── motif_TF.narrowPeak.width200
├── motif_TF.narrowPeak.width200.CT_GA.vcf
├── motif_TF.narrowPeak.width200.nonDnD.vcf
├── motif_TF.narrowPeak.width200.vcf
├── motif_TF.unflt.narrowPeak
├── resizedPeakCounts
├── resizedPeakCounts_nonDnD
├── target.countNormalized.txt
└── target.countPerPeak.txt
Part 4: Load D&D edits
$ dnd-pt4 -h
usage: [-h] -v VCF [--bulk] [--ref REF] [--alt ALT] [--left LEFT] [--right RIGHT] -p PEAK -b BAM [--flt-bam] -o OUTPUT
optional arguments:
-h, --help show this help message and exit
-v VCF, --Vcf VCF input vcf file
--bulk (*optional) specify if the sample is bulk sample; default is NO
--ref REF (*optional) reference allele, rev.comp. is automatically considered; default is C
--alt ALT (*optional) altered allele, rev.comp. is automatically considered; default is T
--left LEFT (*optional) ignore variants in this many bases on the left side of reads; default is 0
--right RIGHT (*optional) ignore variants in this many bases on the right side of reads; default is 0
-p PEAK, --Peak PEAK input peak file
-b BAM, --Bam BAM input bam file
--flt-bam (*optional) filter <-b/--Bam> using variants; default is NOT FILTERING
-o OUTPUT, --Output OUTPUT
output directory path
$ dnd-pt4 -v <D&D edit in vcf> -p <peaks with motif in bed> -b <bam file> -o <output path>
test_edits/
├── dndedits.fragments.editCount.txt
├── dndedits.fragments.txt
├── dndedits.qnames.txt
├── dndedits.txt
├── fragments.tsv.gz
├── fragments.tsv.gz.tbi
└── mtx
├── barcodes.tsv
├── matrix.mtx
└── peaks.bed
Now, D&D edits can be loaded in R using mtx and fragments file. For example,
require(magrittr)
require(readr)
require(Matrix)
require(tidyr)
require(dplyr)
require(Signac)
require(Seurat)
features <- readr::read_tsv("mtx/peaks.bed", col_names = F) %>% tidyr::unite(feature)
barcodes <- readr::read_tsv("mtx/barcodes.tsv", col_names = F) %>% tidyr::unite(barcode)
counts <- Matrix::readMM("mtx/matrix.mtx") %>% magrittr::set_rownames(features$feature) %>%
magrittr::set_colnames(barcodes$barcode)
chrom_assay <- Signac::CreateChromatinAssay(counts = counts, sep = c("_", "-"),
fragments = "fragments.tsv.gz", min.cells = 0, min.features = 0)
dndObj <- Seurat::CreateSeuratObject(counts = chrom_assay, project = "D&D", assay = "dnd")
head(dndObj@meta.data)
# orig.ident nCount_dnd nFeature_dnd
# AAACGAAAGACCGCAA-1 D&D 7 6
# AAACGAAAGGATGTAT-1 D&D 15 14
# AAACGAAAGTCACGCC-1 D&D 5 5
# AAACGAACAGGTTATC-1 D&D 32 27
# AAACGAATCTACATCT-1 D&D 31 28
# AAACTCGAGTTACCAC-1 D&D 11 9
# nCount_dnd = the number of D&D edits
# nFeature_dnd = the number of edited peaks
Raw data is available at Gene Expression Omnibus (GSE296495)


