Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
146 changes: 146 additions & 0 deletions post_processing/TPMCalculator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@

# mRNA Quantification using TPMCalculator

<a id="introduction"></a>
## Introduction

This pipeline quantifies mRNA abundance directly from BAM file alignments using **TPMCalculator** and generates visualizations to assess the results. It is designed to process multiple samples and combine results into comprehensive matrices for downstream analysis.

The pipeline is organized into four scripts, each responsible for a different stage of processing and analysis. These scripts require access to a **SLURM** job manager for submitting and running the workflow.

## Table of Contents
* [Introduction](#introduction)
* [Installation](#installation)
* [Environment](#environment)
* [TPMCalculator](#tpmcalc)
* [GFF3 to GTF](#gff3_to_gtf)
* [Pipeline Overview](#pipeline_overview)
* [Recommended Directory Structure](#dir_struc)

<a id="installation"></a>
## Installation
This repository can be cloned locally by running the following `git` command:
```bash
git clone https://github.com/baliga-lab/Global_Search.git
```
Please note that Git is required to run the above command. For instructions on downloading Git, please see [the Git Guide](https://github.com/git-guides/install-git).

<a id="environment"></a>
### Environment

#### Conda
This application is built on top of multiple Python packages with specific version requirements. We recommend using `conda` to create an isolated Python environment with all necessary packages. If you do not have it installed, follow the [Miniconda Instillation Instructions](https://www.anaconda.com/docs/getting-started/miniconda/install). The list of necessary packages can be found at in the [`environment.yml`](./environment.yml) file.

To create the specified `gs-coral-env` Conda environment, run the following command:
```bash
conda env create -f environment.yml
```

Once the Conda environment is created, it can be activated by:
```bash
conda activate gs-coral-env
```
After coding inside the environment, it can be deactivated with the command:
```bash
conda deactivate
```

**Please note! This environment file does not contain the packages required to run the whole Global_Search pipeline. The packages included are only enough to run the scripts in the TPMCalculator section**. Installation instructions for the Gloabl_Search pipleine can be found in the main [README](../../README.md)

<a id="tpmcalc"></a>
## TPMCalculator

TPMCalculator requires two inputs:
- A **GTF annotation file** describing gene models.
- **BAM alignment files** for each sample.

It produces four output files per sample containing TPM values and raw read counts for:
- **Genes**
- **Transcripts**
- **Exons**
- **Introns**

<a id="gff3_to_gtf"></a>
## GFF3 to GTF

While there are several ways to convert a GFF file to a GTF file, a custom script was used to facilitate this conversion. The [`gff3_to_gtf.py`](./scripts/gff3_to_gtf.py) script...
- Reads a GFF3 file
- Converts `gene`, `mRNA/transcript`, `exon`, and `CDS` features into GTF format
- Constructs valid `gene_id` and `transcript_id` fields
- Writes a new GTF file

Run the following command to convert a GFF3 file to a GTF file:
```bash
python3 gff3_to_gtf.py <input_gff3> <output_gtf>
```

<a id="pipeline_overview"></a>
## Pipeline Overview

The scripts should be run in the following order:

0. **gff3_to_gtf.py**
- See instructions above.

1. **check_bam_sorted.py**
- **Purpose:** Verify that all BAM files are correctly sorted and log their sort order.
- **Inputs:**
- `base_dir`: Path to the folder containing BAM files. BAM files can be nested within sample directories.
- **Outputs:**
- TSV file logging the sort order of each BAM file (`bam_sort_status_<sample>.tsv`).

2. **generate_tpmcalculator_results.py**
- **Purpose:** Generate TPMCalculator results for each sample and create SLURM job scripts to run the analyses.
- **Inputs:**
- `base_dir`: Path to the folder containing BAM files. Bam files may be nested.
- `gtf_file`: Path to the reference GTF annotation file.
- `tpmcalculator_dir`: Desired output directory for TPMCalculator results.
- **Outputs:**
- TPMCalculator output files for each sample (genes, transcripts, exons, introns).
- SLURM job scripts for each sample (`*_tpmcalculator_job.sh`).

3. **combine_tpmcalculator_results.py**
- **Purpose:** Combine TPMCalculator results from all samples into unified TPM and reads matrices.
- **Inputs:**
- `base_dir`: Folder containing all TPMCalculator outputs for all samples. Output files may be nested.
- `output_dir`: Desired output directory for combined matrices.
- **Outputs:**
- TPM matrix (`STAR_TPMcalculator_<project>_TPMs_Matrix_Merged.csv`).
- Reads matrix (`STAR_TPMcalculator_<project>_Reads_Matrix_Merged.csv`).

4. **analyze_organism_totals.py**
- **Purpose:** Summarize total TPMs and reads by organism (e.g., Host, Symbiont species) and generate visualizations to verify results.
- **Inputs:**
- `base_dir`: Folder containing combined TPMCalculator results. This is the output directory.
- `tpm_file`: Path to the merged TPM matrix.
- `reads_file`: Path to the merged reads matrix.
- **Outputs:**
- Summary CSV files of TPM and reads totals by organism (`TPM_totals_by_organism.csv`, `Reads_totals_by_organism.csv`).
- Barplots and summary figures (`organism_totals_barplots.png/pdf`, `organism_summary_stats.png/pdf`).

<a id="dir_struc"></a>
## Recommended Directory Structure

For clarity and organization, the pipeline outputs are recommended to be structured as follows:
```
├── sample_organism_name/
│ ├── combined_results/
│ │ ├── organism_analysis/
│ │ │ ├── <summary visualization files>
│ │ │ ├── ...
│ │ ├── STAR_TPMCalculator_<sample_organism_name>_Reads_Matrix_Merged.csv
│ │ ├── STAR_TPMCalculator_<sample_organism_name>_TPMs_Matrix_Merged.csv
│ ├── tpmcalculator_output/
│ │ ├── sample_1/
│ │ │ ├── <TPMCalculator output files>
│ │ │ ├── ...
│ │ ├── sample_2/
│ │ │ ├── <TPMCalculator output files>
│ │ │ ├── ...
│ │ ├── sample_.../
│ │ │ ├── <TPMCalculator output files>
│ │ │ ├── ...
│ ├── reference/
│ │ ├── <sample_organism_name>.gff3 or gff
│ │ ├── <sample_organism_name>.gtf
```
217 changes: 217 additions & 0 deletions post_processing/TPMCalculator/environment.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
name: gs-coral-env
channels:
- bioconda
- conda-forge
- https://repo.anaconda.com/pkgs/main
- https://repo.anaconda.com/pkgs/r
dependencies:
- _libgcc_mutex=0.1=conda_forge
- _openmp_mutex=4.5=2_gnu
- _r-mutex=1.0.1=anacondar_1
- agat=1.5.1=pl5321hdfd78af_0
- bamtools=2.5.3=he132191_0
- binutils_impl_linux-64=2.45=default_hfdba357_104
- bwidget=1.10.1=ha770c72_1
- bzip2=1.0.8=hda65f42_8
- c-ares=1.34.5=hb9d3cd8_0
- ca-certificates=2025.11.12=hbd8a1cb_0
- cairo=1.18.4=h3394656_0
- curl=8.16.0=h4e3cde8_0
- font-ttf-dejavu-sans-mono=2.37=hab24e00_0
- font-ttf-inconsolata=3.000=h77eed37_0
- font-ttf-source-code-pro=2.038=h77eed37_0
- font-ttf-ubuntu=0.83=h77eed37_3
- fontconfig=2.15.0=h7e30c49_1
- fonts-conda-ecosystem=1=0
- fonts-conda-forge=1=hc364b38_1
- freetype=2.14.1=ha770c72_0
- fribidi=1.0.16=hb03c661_0
- gcc_impl_linux-64=15.2.0=h6f0f26c_15
- gffread=0.12.7=h077b44d_6
- gfortran_impl_linux-64=15.2.0=h281d09f_15
- graphite2=1.3.14=hecca717_2
- gsl=2.7=he838d99_0
- gxx_impl_linux-64=15.2.0=hda75c37_15
- harfbuzz=12.2.0=h15599e2_0
- htslib=1.22.1=h566b1c6_0
- icu=75.1=he02047a_0
- jsoncpp=1.9.6=hf42df4d_1
- kernel-headers_linux-64=6.12.0=he073ed8_3
- keyutils=1.6.3=hb9d3cd8_0
- krb5=1.21.3=h659f571_0
- ld_impl_linux-64=2.45=default_hbd61a6d_104
- lerc=4.0.0=h0aef613_1
- libblas=3.11.0=4_h4a7cf45_openblas
- libcblas=3.11.0=4_h0358290_openblas
- libcurl=8.16.0=h4e3cde8_0
- libdb=6.2.32=h9c3ff4c_0
- libdeflate=1.24=h86f0d12_0
- libedit=3.1.20250104=pl5321h7949ede_0
- libev=4.33=hd590300_2
- libexpat=2.7.3=hecca717_0
- libffi=3.5.2=h9ec8514_0
- libfreetype=2.14.1=ha770c72_0
- libfreetype6=2.14.1=h73754d4_0
- libgcc=15.2.0=h767d61c_7
- libgcc-devel_linux-64=15.2.0=hcc6f6b0_115
- libgcc-ng=15.2.0=h69a702a_7
- libgfortran=15.2.0=h69a702a_15
- libgfortran-ng=15.2.0=h69a702a_15
- libgfortran5=15.2.0=h68bc16d_15
- libglib=2.86.2=h32235b2_0
- libgomp=15.2.0=h767d61c_7
- libiconv=1.18=h3b78370_2
- libjpeg-turbo=3.1.2=hb03c661_0
- liblapack=3.11.0=4_h47877c9_openblas
- liblzma=5.8.1=hb9d3cd8_2
- libnghttp2=1.67.0=had1ee68_0
- libopenblas=0.3.30=pthreads_h94d23a6_4
- libpng=1.6.53=h421ea60_0
- libsanitizer=15.2.0=h90f66d4_15
- libssh2=1.11.1=hcf80075_0
- libstdcxx=15.2.0=h8f9b012_7
- libstdcxx-devel_linux-64=15.2.0=hd446a21_115
- libstdcxx-ng=15.2.0=h4852527_7
- libtiff=4.7.1=h8261f1e_0
- libuuid=2.41.2=h5347b49_1
- libwebp-base=1.6.0=hd42ef1d_0
- libxcb=1.17.0=h8a09558_0
- libxcrypt=4.4.36=hd590300_1
- libzlib=1.3.1=hb9d3cd8_2
- make=4.4.1=hb9d3cd8_2
- ncurses=6.5=h2d0b736_3
- openssl=3.6.0=h26f9b46_0
- pango=1.56.4=hadf4263_0
- pcre2=10.46=h1321c63_0
- perl=5.32.1=7_hd590300_perl5
- perl-b-cow=0.007=pl5321hb9d3cd8_1
- perl-b-hooks-endofscope=0.28=pl5321ha770c72_1
- perl-base=2.23=pl5321hd8ed1ab_0
- perl-bioperl-core=1.7.8=pl5321hdfd78af_1
- perl-business-isbn=3.007=pl5321hd8ed1ab_0
- perl-business-isbn-data=20210112.006=pl5321hd8ed1ab_0
- perl-carp=1.50=pl5321hd8ed1ab_0
- perl-class-inspector=1.36=pl5321hdfd78af_0
- perl-class-load=0.25=pl5321ha770c72_2
- perl-class-load-xs=0.10=pl5321hb9d3cd8_0
- perl-class-methodmaker=2.25=pl5321h7b50bb2_1
- perl-clone=0.46=pl5321hb9d3cd8_1
- perl-compress-raw-bzip2=2.214=pl5321hda65f42_0
- perl-compress-raw-zlib=2.214=pl5321h4dac143_0
- perl-constant=1.33=pl5321hd8ed1ab_0
- perl-cpan-meta-check=0.014=pl5321ha770c72_0
- perl-data-dump=1.25=pl5321h7b50bb2_2
- perl-data-optlist=0.114=pl5321ha770c72_1
- perl-db_file=1.858=pl5321hb9d3cd8_0
- perl-devel-globaldestruction=0.14=pl5321ha770c72_0
- perl-devel-overloadinfo=0.007=pl5321ha770c72_1
- perl-devel-stacktrace=2.04=pl5321h296ab09_0
- perl-digest-hmac=1.05=pl5321hdfd78af_0
- perl-digest-md5=2.59=pl5321hb9d3cd8_3
- perl-dist-checkconflicts=0.11=pl5321ha770c72_0
- perl-encode=3.21=pl5321hb9d3cd8_1
- perl-encode-locale=1.05=pl5321hdfd78af_7
- perl-eval-closure=0.14=pl5321ha770c72_0
- perl-exporter=5.74=pl5321hd8ed1ab_0
- perl-exporter-tiny=1.002002=pl5321hd8ed1ab_0
- perl-extutils-makemaker=7.70=pl5321hd8ed1ab_0
- perl-file-listing=6.16=pl5321hdfd78af_0
- perl-file-pushd=1.016=pl5321ha770c72_0
- perl-file-share=0.25=pl5321hdfd78af_3
- perl-file-sharedir=1.118=pl5321hdfd78af_0
- perl-file-sharedir-install=0.14=pl5321hdfd78af_0
- perl-file-spec=3.48_01=pl5321hdfd78af_2
- perl-getopt-long=2.58=pl5321hdfd78af_0
- perl-graph=0.9735=pl5321hdfd78af_0
- perl-heap=0.80=pl5321hdfd78af_1
- perl-html-parser=3.81=pl5321h4ac6f70_1
- perl-html-tagset=3.24=pl5321hdfd78af_0
- perl-http-cookiejar-lwp=0.014=pl5321hdfd78af_0
- perl-http-cookies=6.11=pl5321hdfd78af_0
- perl-http-daemon=6.16=pl5321hdfd78af_0
- perl-http-date=6.06=pl5321hdfd78af_0
- perl-http-message=7.01=pl5321hdfd78af_0
- perl-http-negotiate=6.01=pl5321hdfd78af_4
- perl-inc-latest=0.500=pl5321ha770c72_0
- perl-io-html=1.004=pl5321hdfd78af_0
- perl-io-socket-ssl=2.075=pl5321hd8ed1ab_0
- perl-io-tty=1.20=pl5321hb9d3cd8_3
- perl-ipc-run=20250809.0=pl5321hdfd78af_0
- perl-libwww-perl=6.81=pl5321hdfd78af_0
- perl-list-moreutils=0.430=pl5321hdfd78af_0
- perl-list-moreutils-xs=0.430=pl5321h7b50bb2_5
- perl-lwp-mediatypes=6.04=pl5321hdfd78af_1
- perl-lwp-protocol-https=6.14=pl5321hdfd78af_1
- perl-mime-base64=3.16=pl5321hb9d3cd8_3
- perl-module-build=0.4234=pl5321ha770c72_1
- perl-module-implementation=0.09=pl5321ha770c72_1
- perl-module-load=0.34=pl5321hdfd78af_0
- perl-module-runtime=0.016=pl5321ha770c72_1
- perl-module-runtime-conflicts=0.003=pl5321ha770c72_0
- perl-moose=2.2207=pl5321hb9d3cd8_2
- perl-mozilla-ca=20250602=pl5321hdfd78af_0
- perl-mro-compat=0.15=pl5321ha770c72_0
- perl-namespace-clean=0.27=pl5321h296ab09_1
- perl-net-http=6.24=pl5321hdfd78af_0
- perl-net-ssleay=1.94=pl5321hf672d98_1
- perl-ntlm=1.09=pl5321hdfd78af_5
- perl-package-deprecationmanager=0.18=pl5321ha770c72_2
- perl-package-stash=0.40=pl5321ha770c72_2
- perl-package-stash-xs=0.30=pl5321hb9d3cd8_2
- perl-params-util=1.102=pl5321hb9d3cd8_1
- perl-parent=0.243=pl5321hd8ed1ab_0
- perl-pod-escapes=1.07=pl5321hdfd78af_2
- perl-regexp-common=2017060201=pl5321hd8ed1ab_0
- perl-safe=2.37=pl5321hdfd78af_2
- perl-scalar-list-utils=1.70=pl5321hb03c661_0
- perl-set-object=1.43=pl5321h7b50bb2_0
- perl-socket=2.027=pl5321h5c03b87_6
- perl-sort-naturally=1.03=pl5321hdfd78af_3
- perl-statistics-r=0.34=pl5321r44hdfd78af_7
- perl-storable=3.15=pl5321hb9d3cd8_2
- perl-sub-exporter=0.991=pl5321ha770c72_1
- perl-sub-exporter-progressive=0.001013=pl5321ha770c72_0
- perl-sub-identify=0.14=pl5321hb9d3cd8_2
- perl-sub-install=0.928=pl5321hd8ed1ab_0
- perl-sub-name=0.28=pl5321hb9d3cd8_1
- perl-term-progressbar=2.23=pl5321hdfd78af_0
- perl-test=1.26=pl5321hd8ed1ab_0
- perl-test-cleannamespaces=0.24=pl5321ha770c72_2
- perl-test-fatal=0.016=pl5321ha770c72_0
- perl-test-harness=3.44=pl5321hd8ed1ab_0
- perl-test-leaktrace=0.17=pl5321h7bf1ee8_1
- perl-test-needs=0.002009=pl5321hd8ed1ab_0
- perl-test-requiresinternet=0.05=pl5321hdfd78af_1
- perl-test-warnings=0.031=pl5321ha770c72_0
- perl-text-balanced=2.07=pl5321hdfd78af_0
- perl-text-wrap=2021.0814=pl5321hd8ed1ab_0
- perl-time-local=1.35=pl5321hdfd78af_0
- perl-timedate=2.33=pl5321hdfd78af_2
- perl-try-tiny=0.31=pl5321ha770c72_0
- perl-uri=5.34=pl5321ha770c72_0
- perl-url-encode=0.03=pl5321h9ee0642_1
- perl-variable-magic=0.64=pl5321hb9d3cd8_0
- perl-vars=1.03=pl5321hdfd78af_2
- perl-www-robotrules=6.02=pl5321hdfd78af_4
- perl-yaml=1.30=pl5321hdfd78af_0
- pixman=0.46.4=h54a6638_1
- pthread-stubs=0.4=hb9d3cd8_1002
- r-base=4.4.3=h14df4e6_4
- readline=8.2=h8c095d6_2
- samtools=1.22.1=h96c455f_0
- sed=4.9=h6688a6e_0
- sysroot_linux-64=2.39=hc4b9eeb_3
- tk=8.6.13=noxft_ha0e22de_103
- tktable=2.10=h8d826fa_7
- tpmcalculator=0.0.5=h2bd4fab_3
- tzdata=2025b=h78e105d_0
- xorg-libice=1.1.2=hb9d3cd8_0
- xorg-libsm=1.2.6=he73a12e_0
- xorg-libx11=1.8.12=h4f16b4b_0
- xorg-libxau=1.0.12=hb03c661_1
- xorg-libxdmcp=1.1.5=hb03c661_1
- xorg-libxext=1.3.6=hb9d3cd8_0
- xorg-libxrender=0.9.12=hb9d3cd8_0
- xorg-libxt=1.3.1=hb9d3cd8_0
- zlib=1.3.1=hb9d3cd8_2
- zstd=1.5.7=hb8e6e7a_2
Loading
Loading