No installation required.
- Clone the git repo
- Make the directory available as PATH
Alternatively
- copy the scripts to a directory
- Make the directory available as PATH
Let's say you have the directory /home/user/git-repo/pcliptools
which contains all the files in the following manner
.
├── bin
│ ├── bedGraphToBigWig # borrowed from Kent utilities
│ ├── pcliptools-align
│ ├── pcliptools-annotate
│ ├── pcliptools-cluster
│ ├── pcliptools-compare-bed
│ ├── pcliptools-compute-substitution
│ ├── pcliptools-functions
│ ├── pcliptools-plot-substitution.R
│ └── pcliptools-test-hypothesis.R
├── LICENSE
└── README.md
add the following line to your .bashrc or .zshrc
export PATH="$PATH:/home/user/git-repo/pcliptools/bin"
Typically your .bashrc and .zshrc remain as a hidden file in /home/user
Therefore you can use Vim or nano to edit the file /home/user/.bashrc or /home/user/.zshrc
pcliptools-align performs alignment and pcliptools-cluster calls the peak.
Virtually any version (latest are preferred) of the following tools should work.
pcliptools-align needs the following programs:
- cutadapt (tested with v4.5 and 4.9)
- STAR (tested with v2.7.9a and 2.7.11b)
- samtools (tested with v1.20+)
pcliptools-cluster needs the following programs:
- samtools (tested with v1.20+)
- bedtools (tested with 2.31+)
- R (tested with v4.4.1)
- library dplyr (tested with v1.1.4 )
Additional unix commands require are awk, sed, sort, join etc.
All of these softwares ae supposed to be installed in a standard HPC.
For alignment pcliptools relies on STAR.
It also utilizes cutadapt for adapter trimming
Alignemnt can be done with minimal code
pcliptools-align \
-i INPUT_FASTQ \
-n EXP_ID \
-M 2 \
-g STAR_INDEX \
-f GENOME_FASTA The full help menu is given below:
pcliptools-align
[ -i INPUT_FILE ] # Input FASTQ file
[ -n EXP_NAME ] # Experiment name
[ -m MULTIMAP_N : 10 ] # Max multimapped reads per location
[ -M MISMATCH_NMAX : 4 ] # Max mismatches allowed
[ -w WIN_ANCHOR : 300 ] # Anchor window size for alignment
[ -s SEED_SEARCH : 20 ] # Seed search length
[ -t USE_THREADS : 16 ] # Number of threads
[ -R USE_MEMORY : 32 ] # Max RAM (GB)
[ -l MIN_READLENGTH : 13 ] # Minimum read length to keep
[ -u UMI_LENGTH : 4 ] # UMI length
[ -F ADAPTER5 : GTTCAGAGTTCTACAGTCCGACGATC ] # 5' adapter
[ -T ADAPTER3 : NNTGACTGTGGAATTCTCGGGTGCCAAGG ] # 3' adapter
[ -g STAR_INDEX ] # Path to STAR index
[ -f GENOME_FASTA ] # Reference genome (FASTA)
Hallmark of PAR-CLIP is T-to-C mismatches. User can take the alignment file and
with the commad line tool pcliptools-compute-substitution different mismatches
will be provided as barplot. It will use pcliptools-plot-substitution.R in the
backend to render the plot.
A minimal example would be:
pcliptools-compute-substitution \
-b INPUT_BAM \
-t "Plot Title" \
-d MIN_READ_DEPTH \
-f GENOME_FASTA
pcliptools-cluster calls the cluster with aligned BAM file and supplied genime fasta and annotations.
A minimal example would be:
pcliptools-cluster \
-b aligned_reads.bam \
-n protein_X \
-g Homo_sapiens.GRCh38.112.chr.gtf.gz \ # ENSEMBL GTF
-f reference-genome.fa
Full help menu
pcliptools-cluster
[ -b INPUT_BAM ] # Input BAM file (sorted and indexed)
[ -n INPUT_NAME ] # Output prefix for cluster files
[ -t USE_THREADS : 16 ] # Number of threads to use
[ -m USE_MEMORY : 16 ] # Memory limit in GB
[ -g ENSEMBL_GTF ] # Ensembl-style GTF file for annotation
[ -f GENOME_FASTA ] # Reference genome FASTA file
[ -d MIN_READ_DEPTH : 3 ] # Minimum read depth to form a cluster
[ -w MIN_CLUSTER_WIDTH : 11 ] # Minimum cluster width in base pairs
[ -W MAX_CLUSTER_WIDTH : 50 ] # Maximum cluster width in base pairs
[ -q MINMAPQ : 55 ] # Minimum mapping quality (MAPQ) to include reads
[ -Q MINBASEQ : 20 ] # Minimum base quality (Phred) at cluster positions
Upon successful run of pcliptools-cluster, it generates the following files
.forward.bedg.gz # stores all mismatch information from forward strand
.reverse.bedg.gz # stores all mismatch information from reverse strand
.mismatch-metrics.tsv # mismatch metrics and read depth, contains both strands.
.readgroup-stat.tsv # results of statistical tests performed on `.mismatch-metrics.tsv`
.readgroup-annotation.tsv # intermediate file for annotation of read groups.
.resulting-metrics.tsv # final metrics, contains all statistical results and annotations.
consists of following columns:
First 6 files are for standard bed files.
1 chr
2 start
3 end
4 group_name # an arbitary name assigned to the group
5 read_depth # read depth for the read group
6 strand
7 total_coverage # number of bases covered
8 count_T2C # count of T-to-C
9 count_nonT2C # count of mismatches other than T-to-C
10 loc_T2C # number of unique positions observed T-to-C
11 pois_prob_T2C # Poisson probability for T-to-C
12 pois_prob_nonT2C # Poisson probability for non-T-to-C
13 pois_test # p values for poisson test
14 FDR # False Discovery Rate
15 annotation # genomic region
16 gene_name # genetric names of gene
17 gene_id # ENSEMBL gene id
.confident-filtered-clusters.bed # filtering .resulting-metrics.tsv with FDR <= 0.05
Note on Genome fasta, index and GTF:
For alignment purpose pcliptools uses UCSC reference genome i.e. for hg38 (both fasta and STAR index) however, for annotation purpose it uses ENSEMBL annotation.
Both GENCODE and ENSEMBL provides most comporehensive GTF but, only ENSEMBL provided 3' and 5' orientation of UTRs. That's why to reduce operational time fixing the direction of UTRs, we used ENSEMBL annotation.
Those who are familiar with the formats can foresee the trouble: UCSC uses chr1, chr2 etc naming convention for chromosome but ENSEMBL uses 1, 2, 3 etc.
pcliptools-annotate internaly take care of the differences accross formats. Which means 1, 2, 3 format for chromosome in ENSEMBL are internally converted to chr1, chr2 respectively and the formats are made comparable.
Therefore it is intentional to align and call peak with UCSC reference genomes but ONLY for the annotation ENSEMBL GTF is used (v0.7.3). Although v0.7.4 will start using GENCODE GFF3.
This module takes the intermediate bedgraph files of T-to-C mismatches and create bigwig file of T-to-C mismatches.
This needs additional executable bedGraphToBigWig from Kent utilities
The help menu woule be:
pcliptools-compute-bigwig
[ -f FORWARD_BEDG ] # Forward strand bedGraph.gz file
[ -r REVERSE_BEDG ] # Reverse strand bedGraph.gz file
[ -n EXP_ID ] # Experiment ID or output prefix
[ -g GENOME_SIZE ] # Genome size file (tab-separated: chr \t size)
[ -t TYPE ] # Type of data: T2C or nonT2C
The resulting bigwig file normalize the T-to-C count to CPM value. Some external tools i.e. deepTools can utilize the bigwogs to illustrate informative graphs and plots.
For comparing two-three bed file to find out the overlap among them or how mane genes are common in between pcliptools-compare-bed
is used. This is an extra utility that requires additional dependencies i.e. python, matplotlib, matplotlib-venn, pybedtools
A minimal example would be:
# comparison of corrodinates
pcliptools-compare-bed \
-i bed1.bed bed2.bed bed3.bed \
-l label1 label2 label3 \
-s 50 \
-g hg38 \
-t "Cluster Overlap +/- 50 base" \
-o overlap_venn_slop-50nt.pdf
and
# comparison of genes
pcliptools-compare-bed \
-i bed1.bed bed2.bed bed3.bed \
-l label1 label2 label3 \
-g hg38 \
--gene-name \
--gene-column 8 \
--drop-multi \
-t "Target Gene Overlap " \
-o overlap_of_gene_venn.pdfFull help menu would be
usage: pcliptools-compare-bed [-h] -i INPUT_BEDS [INPUT_BEDS ...]
[-l LABELS [LABELS ...]] [-s SLOP]
[-g GENOME] [--gene-name] [--drop-multi]
[--show-percentage] [--gene-column GENE_COLUMN]
[-c COLORS [COLORS ...]] [-t TITLE]
[-f FONT_SIZE] [-o OUT_NAME] [-v]
Make a venn diagram with provided bed files
optional arguments:
-h, --help show this help message and exit
-i INPUT_BEDS [INPUT_BEDS ...], --input-beds INPUT_BEDS [INPUT_BEDS ...], --beds INPUT_BEDS [INPUT_BEDS ...]
Inut bed files, seperated by space. (default: None)
-l LABELS [LABELS ...], --labels LABELS [LABELS ...]
Labels/Names bed files, seperated by space. (default: None)
-s SLOP, --slop SLOP how mane bases to slop/extend each side. (default : 20)
-g GENOME, --genome GENOME
which genome is used. (default : hg38)
--gene-name make venn just by the gene names instead of coodrdinates (default: False)
--drop-multi drop the multigenic region combined by comma (default: False)
--show-percentage show percentage of overlap (default: False)
--gene-column GENE_COLUMN
column index for gene names in the file. (default : 8))
-c COLORS [COLORS ...], --colors COLORS [COLORS ...]
Name of the colors (default: None)
-t TITLE, --title TITLE
title of the figure (default: None)
-f FONT_SIZE, --font-size FONT_SIZE
font size. (default : 15)
-o OUT_NAME, --out-name OUT_NAME
out figure names. (default : venn.pdf)
-v, --version show program's version number and exit