Skip to content

paulahsan/pcliptools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

76 Commits
 
 
 
 
 
 

Repository files navigation

pcliptools

Installation

No installation required.

  1. Clone the git repo
  2. Make the directory available as PATH

Alternatively

  1. copy the scripts to a directory
  2. Make the directory available as PATH

Let's say you have the directory /home/user/git-repo/pcliptools which contains all the files in the following manner

.
├── bin
│   ├── bedGraphToBigWig # borrowed from Kent utilities
│   ├── pcliptools-align
│   ├── pcliptools-annotate
│   ├── pcliptools-cluster
│   ├── pcliptools-compare-bed
│   ├── pcliptools-compute-substitution
│   ├── pcliptools-functions
│   ├── pcliptools-plot-substitution.R
│   └── pcliptools-test-hypothesis.R
├── LICENSE
└── README.md

add the following line to your .bashrc or .zshrc export PATH="$PATH:/home/user/git-repo/pcliptools/bin"

Typically your .bashrc and .zshrc remain as a hidden file in /home/user Therefore you can use Vim or nano to edit the file /home/user/.bashrc or /home/user/.zshrc

pcliptools-align performs alignment and pcliptools-cluster calls the peak.

Dependencies for pcliptools

Virtually any version (latest are preferred) of the following tools should work.

pcliptools-align needs the following programs:

  • cutadapt (tested with v4.5 and 4.9)
  • STAR (tested with v2.7.9a and 2.7.11b)
  • samtools (tested with v1.20+)

pcliptools-cluster needs the following programs:

  • samtools (tested with v1.20+)
  • bedtools (tested with 2.31+)
  • R (tested with v4.4.1)
    • library dplyr (tested with v1.1.4 )

Additional unix commands require are awk, sed, sort, join etc.

All of these softwares ae supposed to be installed in a standard HPC.

Alignment wilth pcliptools-align

For alignment pcliptools relies on STAR. It also utilizes cutadapt for adapter trimming Alignemnt can be done with minimal code

pcliptools-align \ 
    -i INPUT_FASTQ \
    -n EXP_ID \
    -M 2 \
    -g STAR_INDEX \
    -f GENOME_FASTA 

The full help menu is given below:

pcliptools-align
    [ -i INPUT_FILE             ]   # Input FASTQ file
    [ -n EXP_NAME               ]   # Experiment name
    [ -m MULTIMAP_N : 10        ]   # Max multimapped reads per location
    [ -M MISMATCH_NMAX : 4      ]   # Max mismatches allowed
    [ -w WIN_ANCHOR : 300       ]   # Anchor window size for alignment
    [ -s SEED_SEARCH : 20       ]   # Seed search length
    [ -t USE_THREADS : 16       ]   # Number of threads
    [ -R USE_MEMORY : 32        ]   # Max RAM (GB)
    [ -l MIN_READLENGTH : 13    ]   # Minimum read length to keep
    [ -u UMI_LENGTH : 4         ]   # UMI length
    [ -F ADAPTER5 : GTTCAGAGTTCTACAGTCCGACGATC ]   # 5' adapter
    [ -T ADAPTER3 : NNTGACTGTGGAATTCTCGGGTGCCAAGG ] # 3' adapter
    [ -g STAR_INDEX             ]   # Path to STAR index
    [ -f GENOME_FASTA           ]   # Reference genome (FASTA)

QC for T to C mismatch

Hallmark of PAR-CLIP is T-to-C mismatches. User can take the alignment file and with the commad line tool pcliptools-compute-substitution different mismatches will be provided as barplot. It will use pcliptools-plot-substitution.R in the backend to render the plot.

A minimal example would be:

pcliptools-compute-substitution \
    -b INPUT_BAM \
    -t "Plot Title" \
    -d MIN_READ_DEPTH \
    -f GENOME_FASTA

Peak calling to identify clusters of RNA-RBP interaction

pcliptools-cluster calls the cluster with aligned BAM file and supplied genime fasta and annotations. A minimal example would be:

pcliptools-cluster \
    -b aligned_reads.bam \
    -n protein_X \
    -g Homo_sapiens.GRCh38.112.chr.gtf.gz \ # ENSEMBL GTF
    -f reference-genome.fa

Full help menu

pcliptools-cluster
    [ -b INPUT_BAM              ]   # Input BAM file (sorted and indexed)
    [ -n INPUT_NAME             ]   # Output prefix for cluster files
    [ -t USE_THREADS : 16       ]   # Number of threads to use
    [ -m USE_MEMORY : 16        ]   # Memory limit in GB
    [ -g ENSEMBL_GTF            ]   # Ensembl-style GTF file for annotation
    [ -f GENOME_FASTA           ]   # Reference genome FASTA file
    [ -d MIN_READ_DEPTH : 3     ]   # Minimum read depth to form a cluster
    [ -w MIN_CLUSTER_WIDTH : 11 ]   # Minimum cluster width in base pairs
    [ -W MAX_CLUSTER_WIDTH : 50 ]   # Maximum cluster width in base pairs
    [ -q MINMAPQ : 55           ]   # Minimum mapping quality (MAPQ) to include reads
    [ -Q MINBASEQ : 20          ]   # Minimum base quality (Phred) at cluster positions

Description of the output files

Upon successful run of pcliptools-cluster, it generates the following files

.forward.bedg.gz # stores all mismatch information from forward strand
.reverse.bedg.gz # stores all mismatch information from reverse strand

.mismatch-metrics.tsv # mismatch metrics and read depth, contains both strands.
.readgroup-stat.tsv # results of statistical tests performed on `.mismatch-metrics.tsv`
.readgroup-annotation.tsv # intermediate file for annotation of read groups.

.resulting-metrics.tsv # final metrics, contains all statistical results and annotations.
consists of following columns:
First 6 files are for standard bed files.
	1 chr
	2 start
	3 end
	4 group_name # an arbitary name assigned to the group
	5 read_depth # read depth for the read group
	6 strand
	7 total_coverage # number of bases covered
	8 count_T2C # count of T-to-C
	9 count_nonT2C # count of mismatches other than T-to-C
	10 loc_T2C # number of unique positions observed T-to-C
	11 pois_prob_T2C # Poisson probability for T-to-C
	12 pois_prob_nonT2C # Poisson probability for non-T-to-C
	13 pois_test # p values for poisson test
	14 FDR # False Discovery Rate
	15 annotation # genomic region
	16 gene_name # genetric names of gene
	17 gene_id # ENSEMBL gene id

.confident-filtered-clusters.bed # filtering .resulting-metrics.tsv with FDR <= 0.05

Note on Genome fasta, index and GTF:

For alignment purpose pcliptools uses UCSC reference genome i.e. for hg38 (both fasta and STAR index) however, for annotation purpose it uses ENSEMBL annotation.

Both GENCODE and ENSEMBL provides most comporehensive GTF but, only ENSEMBL provided 3' and 5' orientation of UTRs. That's why to reduce operational time fixing the direction of UTRs, we used ENSEMBL annotation.

Those who are familiar with the formats can foresee the trouble: UCSC uses chr1, chr2 etc naming convention for chromosome but ENSEMBL uses 1, 2, 3 etc. pcliptools-annotate internaly take care of the differences accross formats. Which means 1, 2, 3 format for chromosome in ENSEMBL are internally converted to chr1, chr2 respectively and the formats are made comparable.

Therefore it is intentional to align and call peak with UCSC reference genomes but ONLY for the annotation ENSEMBL GTF is used (v0.7.3). Although v0.7.4 will start using GENCODE GFF3.

Compute BigWig of T-to-C mismatch

This module takes the intermediate bedgraph files of T-to-C mismatches and create bigwig file of T-to-C mismatches. This needs additional executable bedGraphToBigWig from Kent utilities The help menu woule be:

pcliptools-compute-bigwig
    [ -f FORWARD_BEDG     ]   # Forward strand bedGraph.gz file
    [ -r REVERSE_BEDG     ]   # Reverse strand bedGraph.gz file
    [ -n EXP_ID           ]   # Experiment ID or output prefix
    [ -g GENOME_SIZE      ]   # Genome size file (tab-separated: chr \t size)
    [ -t TYPE             ]   # Type of data: T2C or nonT2C

The resulting bigwig file normalize the T-to-C count to CPM value. Some external tools i.e. deepTools can utilize the bigwogs to illustrate informative graphs and plots.

Compare experiment

For comparing two-three bed file to find out the overlap among them or how mane genes are common in between pcliptools-compare-bed is used. This is an extra utility that requires additional dependencies i.e. python, matplotlib, matplotlib-venn, pybedtools

A minimal example would be:

# comparison of corrodinates
pcliptools-compare-bed \
    -i bed1.bed bed2.bed bed3.bed \
    -l label1 label2 label3 \
	-s 50 \
    -g hg38 \
    -t "Cluster Overlap +/- 50 base" \
    -o overlap_venn_slop-50nt.pdf

and

# comparison of genes
pcliptools-compare-bed \
    -i bed1.bed bed2.bed bed3.bed \
    -l label1 label2 label3 \
    -g hg38 \
	--gene-name \
	--gene-column 8 \
	--drop-multi \
    -t "Target Gene Overlap " \
    -o overlap_of_gene_venn.pdf

Full help menu would be

usage: pcliptools-compare-bed [-h] -i INPUT_BEDS [INPUT_BEDS ...]
                              [-l LABELS [LABELS ...]] [-s SLOP]
                              [-g GENOME] [--gene-name] [--drop-multi]
                              [--show-percentage] [--gene-column GENE_COLUMN]
                              [-c COLORS [COLORS ...]] [-t TITLE]
                              [-f FONT_SIZE] [-o OUT_NAME] [-v]

Make a venn diagram with provided bed files

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_BEDS [INPUT_BEDS ...], --input-beds INPUT_BEDS [INPUT_BEDS ...], --beds INPUT_BEDS [INPUT_BEDS ...]
                        Inut bed files, seperated by space. (default: None)
  -l LABELS [LABELS ...], --labels LABELS [LABELS ...]
                        Labels/Names bed files, seperated by space. (default: None)
  -s SLOP, --slop SLOP  how mane bases to slop/extend each side. (default : 20)
  -g GENOME, --genome GENOME
                        which genome is used. (default : hg38)
  --gene-name           make venn just by the gene names instead of coodrdinates (default: False)
  --drop-multi          drop the multigenic region combined by comma (default: False)
  --show-percentage     show percentage of overlap (default: False)
  --gene-column GENE_COLUMN
                        column index for gene names in the file. (default : 8))
  -c COLORS [COLORS ...], --colors COLORS [COLORS ...]
                        Name of the colors (default: None)
  -t TITLE, --title TITLE
                        title of the figure (default: None)
  -f FONT_SIZE, --font-size FONT_SIZE
                        font size. (default : 15)
  -o OUT_NAME, --out-name OUT_NAME
                        out figure names. (default : venn.pdf)
  -v, --version         show program's version number and exit

About

A new peak caller and pipeline for PAR-CLIP data analysis.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors