diff --git a/post_processing/TPMCalculator/README.md b/post_processing/TPMCalculator/README.md
new file mode 100644
index 0000000..72fb754
--- /dev/null
+++ b/post_processing/TPMCalculator/README.md
@@ -0,0 +1,146 @@
+
+# mRNA Quantification using TPMCalculator
+
+
+## Introduction
+
+This pipeline quantifies mRNA abundance directly from BAM file alignments using **TPMCalculator** and generates visualizations to assess the results. It is designed to process multiple samples and combine results into comprehensive matrices for downstream analysis.
+
+The pipeline is organized into four scripts, each responsible for a different stage of processing and analysis. These scripts require access to a **SLURM** job manager for submitting and running the workflow.
+
+## Table of Contents
+* [Introduction](#introduction)
+* [Installation](#installation)
+ * [Environment](#environment)
+* [TPMCalculator](#tpmcalc)
+* [GFF3 to GTF](#gff3_to_gtf)
+* [Pipeline Overview](#pipeline_overview)
+* [Recommended Directory Structure](#dir_struc)
+
+
+## Installation
+This repository can be cloned locally by running the following `git` command:
+```bash
+git clone https://github.com/baliga-lab/Global_Search.git
+```
+Please note that Git is required to run the above command. For instructions on downloading Git, please see [the Git Guide](https://github.com/git-guides/install-git).
+
+
+### Environment
+
+#### Conda
+This application is built on top of multiple Python packages with specific version requirements. We recommend using `conda` to create an isolated Python environment with all necessary packages. If you do not have it installed, follow the [Miniconda Instillation Instructions](https://www.anaconda.com/docs/getting-started/miniconda/install). The list of necessary packages can be found at in the [`environment.yml`](./environment.yml) file.
+
+To create the specified `gs-coral-env` Conda environment, run the following command:
+```bash
+conda env create -f environment.yml
+```
+
+Once the Conda environment is created, it can be activated by:
+```bash
+conda activate gs-coral-env
+```
+After coding inside the environment, it can be deactivated with the command:
+```bash
+conda deactivate
+```
+
+**Please note! This environment file does not contain the packages required to run the whole Global_Search pipeline. The packages included are only enough to run the scripts in the TPMCalculator section**. Installation instructions for the Gloabl_Search pipleine can be found in the main [README](../../README.md)
+
+
+## TPMCalculator
+
+TPMCalculator requires two inputs:
+- A **GTF annotation file** describing gene models.
+- **BAM alignment files** for each sample.
+
+It produces four output files per sample containing TPM values and raw read counts for:
+- **Genes**
+- **Transcripts**
+- **Exons**
+- **Introns**
+
+
+## GFF3 to GTF
+
+While there are several ways to convert a GFF file to a GTF file, a custom script was used to facilitate this conversion. The [`gff3_to_gtf.py`](./scripts/gff3_to_gtf.py) script...
+- Reads a GFF3 file
+- Converts `gene`, `mRNA/transcript`, `exon`, and `CDS` features into GTF format
+- Constructs valid `gene_id` and `transcript_id` fields
+- Writes a new GTF file
+
+Run the following command to convert a GFF3 file to a GTF file:
+```bash
+python3 gff3_to_gtf.py
+```
+
+
+## Pipeline Overview
+
+The scripts should be run in the following order:
+
+0. **gff3_to_gtf.py**
+ - See instructions above.
+
+1. **check_bam_sorted.py**
+ - **Purpose:** Verify that all BAM files are correctly sorted and log their sort order.
+ - **Inputs:**
+ - `base_dir`: Path to the folder containing BAM files. BAM files can be nested within sample directories.
+ - **Outputs:**
+ - TSV file logging the sort order of each BAM file (`bam_sort_status_.tsv`).
+
+2. **generate_tpmcalculator_results.py**
+ - **Purpose:** Generate TPMCalculator results for each sample and create SLURM job scripts to run the analyses.
+ - **Inputs:**
+ - `base_dir`: Path to the folder containing BAM files. Bam files may be nested.
+ - `gtf_file`: Path to the reference GTF annotation file.
+ - `tpmcalculator_dir`: Desired output directory for TPMCalculator results.
+ - **Outputs:**
+ - TPMCalculator output files for each sample (genes, transcripts, exons, introns).
+ - SLURM job scripts for each sample (`*_tpmcalculator_job.sh`).
+
+3. **combine_tpmcalculator_results.py**
+ - **Purpose:** Combine TPMCalculator results from all samples into unified TPM and reads matrices.
+ - **Inputs:**
+ - `base_dir`: Folder containing all TPMCalculator outputs for all samples. Output files may be nested.
+ - `output_dir`: Desired output directory for combined matrices.
+ - **Outputs:**
+ - TPM matrix (`STAR_TPMcalculator__TPMs_Matrix_Merged.csv`).
+ - Reads matrix (`STAR_TPMcalculator__Reads_Matrix_Merged.csv`).
+
+4. **analyze_organism_totals.py**
+ - **Purpose:** Summarize total TPMs and reads by organism (e.g., Host, Symbiont species) and generate visualizations to verify results.
+ - **Inputs:**
+ - `base_dir`: Folder containing combined TPMCalculator results. This is the output directory.
+ - `tpm_file`: Path to the merged TPM matrix.
+ - `reads_file`: Path to the merged reads matrix.
+ - **Outputs:**
+ - Summary CSV files of TPM and reads totals by organism (`TPM_totals_by_organism.csv`, `Reads_totals_by_organism.csv`).
+ - Barplots and summary figures (`organism_totals_barplots.png/pdf`, `organism_summary_stats.png/pdf`).
+
+
+## Recommended Directory Structure
+
+For clarity and organization, the pipeline outputs are recommended to be structured as follows:
+```
+├── sample_organism_name/
+│ ├── combined_results/
+│ │ ├── organism_analysis/
+│ │ │ ├──
+│ │ │ ├── ...
+│ │ ├── STAR_TPMCalculator__Reads_Matrix_Merged.csv
+│ │ ├── STAR_TPMCalculator__TPMs_Matrix_Merged.csv
+│ ├── tpmcalculator_output/
+│ │ ├── sample_1/
+│ │ │ ├──
+│ │ │ ├── ...
+│ │ ├── sample_2/
+│ │ │ ├──
+│ │ │ ├── ...
+│ │ ├── sample_.../
+│ │ │ ├──
+│ │ │ ├── ...
+│ ├── reference/
+│ │ ├── .gff3 or gff
+│ │ ├── .gtf
+```
diff --git a/post_processing/TPMCalculator/environment.yml b/post_processing/TPMCalculator/environment.yml
new file mode 100644
index 0000000..b3948ef
--- /dev/null
+++ b/post_processing/TPMCalculator/environment.yml
@@ -0,0 +1,217 @@
+name: gs-coral-env
+channels:
+ - bioconda
+ - conda-forge
+ - https://repo.anaconda.com/pkgs/main
+ - https://repo.anaconda.com/pkgs/r
+dependencies:
+ - _libgcc_mutex=0.1=conda_forge
+ - _openmp_mutex=4.5=2_gnu
+ - _r-mutex=1.0.1=anacondar_1
+ - agat=1.5.1=pl5321hdfd78af_0
+ - bamtools=2.5.3=he132191_0
+ - binutils_impl_linux-64=2.45=default_hfdba357_104
+ - bwidget=1.10.1=ha770c72_1
+ - bzip2=1.0.8=hda65f42_8
+ - c-ares=1.34.5=hb9d3cd8_0
+ - ca-certificates=2025.11.12=hbd8a1cb_0
+ - cairo=1.18.4=h3394656_0
+ - curl=8.16.0=h4e3cde8_0
+ - font-ttf-dejavu-sans-mono=2.37=hab24e00_0
+ - font-ttf-inconsolata=3.000=h77eed37_0
+ - font-ttf-source-code-pro=2.038=h77eed37_0
+ - font-ttf-ubuntu=0.83=h77eed37_3
+ - fontconfig=2.15.0=h7e30c49_1
+ - fonts-conda-ecosystem=1=0
+ - fonts-conda-forge=1=hc364b38_1
+ - freetype=2.14.1=ha770c72_0
+ - fribidi=1.0.16=hb03c661_0
+ - gcc_impl_linux-64=15.2.0=h6f0f26c_15
+ - gffread=0.12.7=h077b44d_6
+ - gfortran_impl_linux-64=15.2.0=h281d09f_15
+ - graphite2=1.3.14=hecca717_2
+ - gsl=2.7=he838d99_0
+ - gxx_impl_linux-64=15.2.0=hda75c37_15
+ - harfbuzz=12.2.0=h15599e2_0
+ - htslib=1.22.1=h566b1c6_0
+ - icu=75.1=he02047a_0
+ - jsoncpp=1.9.6=hf42df4d_1
+ - kernel-headers_linux-64=6.12.0=he073ed8_3
+ - keyutils=1.6.3=hb9d3cd8_0
+ - krb5=1.21.3=h659f571_0
+ - ld_impl_linux-64=2.45=default_hbd61a6d_104
+ - lerc=4.0.0=h0aef613_1
+ - libblas=3.11.0=4_h4a7cf45_openblas
+ - libcblas=3.11.0=4_h0358290_openblas
+ - libcurl=8.16.0=h4e3cde8_0
+ - libdb=6.2.32=h9c3ff4c_0
+ - libdeflate=1.24=h86f0d12_0
+ - libedit=3.1.20250104=pl5321h7949ede_0
+ - libev=4.33=hd590300_2
+ - libexpat=2.7.3=hecca717_0
+ - libffi=3.5.2=h9ec8514_0
+ - libfreetype=2.14.1=ha770c72_0
+ - libfreetype6=2.14.1=h73754d4_0
+ - libgcc=15.2.0=h767d61c_7
+ - libgcc-devel_linux-64=15.2.0=hcc6f6b0_115
+ - libgcc-ng=15.2.0=h69a702a_7
+ - libgfortran=15.2.0=h69a702a_15
+ - libgfortran-ng=15.2.0=h69a702a_15
+ - libgfortran5=15.2.0=h68bc16d_15
+ - libglib=2.86.2=h32235b2_0
+ - libgomp=15.2.0=h767d61c_7
+ - libiconv=1.18=h3b78370_2
+ - libjpeg-turbo=3.1.2=hb03c661_0
+ - liblapack=3.11.0=4_h47877c9_openblas
+ - liblzma=5.8.1=hb9d3cd8_2
+ - libnghttp2=1.67.0=had1ee68_0
+ - libopenblas=0.3.30=pthreads_h94d23a6_4
+ - libpng=1.6.53=h421ea60_0
+ - libsanitizer=15.2.0=h90f66d4_15
+ - libssh2=1.11.1=hcf80075_0
+ - libstdcxx=15.2.0=h8f9b012_7
+ - libstdcxx-devel_linux-64=15.2.0=hd446a21_115
+ - libstdcxx-ng=15.2.0=h4852527_7
+ - libtiff=4.7.1=h8261f1e_0
+ - libuuid=2.41.2=h5347b49_1
+ - libwebp-base=1.6.0=hd42ef1d_0
+ - libxcb=1.17.0=h8a09558_0
+ - libxcrypt=4.4.36=hd590300_1
+ - libzlib=1.3.1=hb9d3cd8_2
+ - make=4.4.1=hb9d3cd8_2
+ - ncurses=6.5=h2d0b736_3
+ - openssl=3.6.0=h26f9b46_0
+ - pango=1.56.4=hadf4263_0
+ - pcre2=10.46=h1321c63_0
+ - perl=5.32.1=7_hd590300_perl5
+ - perl-b-cow=0.007=pl5321hb9d3cd8_1
+ - perl-b-hooks-endofscope=0.28=pl5321ha770c72_1
+ - perl-base=2.23=pl5321hd8ed1ab_0
+ - perl-bioperl-core=1.7.8=pl5321hdfd78af_1
+ - perl-business-isbn=3.007=pl5321hd8ed1ab_0
+ - perl-business-isbn-data=20210112.006=pl5321hd8ed1ab_0
+ - perl-carp=1.50=pl5321hd8ed1ab_0
+ - perl-class-inspector=1.36=pl5321hdfd78af_0
+ - perl-class-load=0.25=pl5321ha770c72_2
+ - perl-class-load-xs=0.10=pl5321hb9d3cd8_0
+ - perl-class-methodmaker=2.25=pl5321h7b50bb2_1
+ - perl-clone=0.46=pl5321hb9d3cd8_1
+ - perl-compress-raw-bzip2=2.214=pl5321hda65f42_0
+ - perl-compress-raw-zlib=2.214=pl5321h4dac143_0
+ - perl-constant=1.33=pl5321hd8ed1ab_0
+ - perl-cpan-meta-check=0.014=pl5321ha770c72_0
+ - perl-data-dump=1.25=pl5321h7b50bb2_2
+ - perl-data-optlist=0.114=pl5321ha770c72_1
+ - perl-db_file=1.858=pl5321hb9d3cd8_0
+ - perl-devel-globaldestruction=0.14=pl5321ha770c72_0
+ - perl-devel-overloadinfo=0.007=pl5321ha770c72_1
+ - perl-devel-stacktrace=2.04=pl5321h296ab09_0
+ - perl-digest-hmac=1.05=pl5321hdfd78af_0
+ - perl-digest-md5=2.59=pl5321hb9d3cd8_3
+ - perl-dist-checkconflicts=0.11=pl5321ha770c72_0
+ - perl-encode=3.21=pl5321hb9d3cd8_1
+ - perl-encode-locale=1.05=pl5321hdfd78af_7
+ - perl-eval-closure=0.14=pl5321ha770c72_0
+ - perl-exporter=5.74=pl5321hd8ed1ab_0
+ - perl-exporter-tiny=1.002002=pl5321hd8ed1ab_0
+ - perl-extutils-makemaker=7.70=pl5321hd8ed1ab_0
+ - perl-file-listing=6.16=pl5321hdfd78af_0
+ - perl-file-pushd=1.016=pl5321ha770c72_0
+ - perl-file-share=0.25=pl5321hdfd78af_3
+ - perl-file-sharedir=1.118=pl5321hdfd78af_0
+ - perl-file-sharedir-install=0.14=pl5321hdfd78af_0
+ - perl-file-spec=3.48_01=pl5321hdfd78af_2
+ - perl-getopt-long=2.58=pl5321hdfd78af_0
+ - perl-graph=0.9735=pl5321hdfd78af_0
+ - perl-heap=0.80=pl5321hdfd78af_1
+ - perl-html-parser=3.81=pl5321h4ac6f70_1
+ - perl-html-tagset=3.24=pl5321hdfd78af_0
+ - perl-http-cookiejar-lwp=0.014=pl5321hdfd78af_0
+ - perl-http-cookies=6.11=pl5321hdfd78af_0
+ - perl-http-daemon=6.16=pl5321hdfd78af_0
+ - perl-http-date=6.06=pl5321hdfd78af_0
+ - perl-http-message=7.01=pl5321hdfd78af_0
+ - perl-http-negotiate=6.01=pl5321hdfd78af_4
+ - perl-inc-latest=0.500=pl5321ha770c72_0
+ - perl-io-html=1.004=pl5321hdfd78af_0
+ - perl-io-socket-ssl=2.075=pl5321hd8ed1ab_0
+ - perl-io-tty=1.20=pl5321hb9d3cd8_3
+ - perl-ipc-run=20250809.0=pl5321hdfd78af_0
+ - perl-libwww-perl=6.81=pl5321hdfd78af_0
+ - perl-list-moreutils=0.430=pl5321hdfd78af_0
+ - perl-list-moreutils-xs=0.430=pl5321h7b50bb2_5
+ - perl-lwp-mediatypes=6.04=pl5321hdfd78af_1
+ - perl-lwp-protocol-https=6.14=pl5321hdfd78af_1
+ - perl-mime-base64=3.16=pl5321hb9d3cd8_3
+ - perl-module-build=0.4234=pl5321ha770c72_1
+ - perl-module-implementation=0.09=pl5321ha770c72_1
+ - perl-module-load=0.34=pl5321hdfd78af_0
+ - perl-module-runtime=0.016=pl5321ha770c72_1
+ - perl-module-runtime-conflicts=0.003=pl5321ha770c72_0
+ - perl-moose=2.2207=pl5321hb9d3cd8_2
+ - perl-mozilla-ca=20250602=pl5321hdfd78af_0
+ - perl-mro-compat=0.15=pl5321ha770c72_0
+ - perl-namespace-clean=0.27=pl5321h296ab09_1
+ - perl-net-http=6.24=pl5321hdfd78af_0
+ - perl-net-ssleay=1.94=pl5321hf672d98_1
+ - perl-ntlm=1.09=pl5321hdfd78af_5
+ - perl-package-deprecationmanager=0.18=pl5321ha770c72_2
+ - perl-package-stash=0.40=pl5321ha770c72_2
+ - perl-package-stash-xs=0.30=pl5321hb9d3cd8_2
+ - perl-params-util=1.102=pl5321hb9d3cd8_1
+ - perl-parent=0.243=pl5321hd8ed1ab_0
+ - perl-pod-escapes=1.07=pl5321hdfd78af_2
+ - perl-regexp-common=2017060201=pl5321hd8ed1ab_0
+ - perl-safe=2.37=pl5321hdfd78af_2
+ - perl-scalar-list-utils=1.70=pl5321hb03c661_0
+ - perl-set-object=1.43=pl5321h7b50bb2_0
+ - perl-socket=2.027=pl5321h5c03b87_6
+ - perl-sort-naturally=1.03=pl5321hdfd78af_3
+ - perl-statistics-r=0.34=pl5321r44hdfd78af_7
+ - perl-storable=3.15=pl5321hb9d3cd8_2
+ - perl-sub-exporter=0.991=pl5321ha770c72_1
+ - perl-sub-exporter-progressive=0.001013=pl5321ha770c72_0
+ - perl-sub-identify=0.14=pl5321hb9d3cd8_2
+ - perl-sub-install=0.928=pl5321hd8ed1ab_0
+ - perl-sub-name=0.28=pl5321hb9d3cd8_1
+ - perl-term-progressbar=2.23=pl5321hdfd78af_0
+ - perl-test=1.26=pl5321hd8ed1ab_0
+ - perl-test-cleannamespaces=0.24=pl5321ha770c72_2
+ - perl-test-fatal=0.016=pl5321ha770c72_0
+ - perl-test-harness=3.44=pl5321hd8ed1ab_0
+ - perl-test-leaktrace=0.17=pl5321h7bf1ee8_1
+ - perl-test-needs=0.002009=pl5321hd8ed1ab_0
+ - perl-test-requiresinternet=0.05=pl5321hdfd78af_1
+ - perl-test-warnings=0.031=pl5321ha770c72_0
+ - perl-text-balanced=2.07=pl5321hdfd78af_0
+ - perl-text-wrap=2021.0814=pl5321hd8ed1ab_0
+ - perl-time-local=1.35=pl5321hdfd78af_0
+ - perl-timedate=2.33=pl5321hdfd78af_2
+ - perl-try-tiny=0.31=pl5321ha770c72_0
+ - perl-uri=5.34=pl5321ha770c72_0
+ - perl-url-encode=0.03=pl5321h9ee0642_1
+ - perl-variable-magic=0.64=pl5321hb9d3cd8_0
+ - perl-vars=1.03=pl5321hdfd78af_2
+ - perl-www-robotrules=6.02=pl5321hdfd78af_4
+ - perl-yaml=1.30=pl5321hdfd78af_0
+ - pixman=0.46.4=h54a6638_1
+ - pthread-stubs=0.4=hb9d3cd8_1002
+ - r-base=4.4.3=h14df4e6_4
+ - readline=8.2=h8c095d6_2
+ - samtools=1.22.1=h96c455f_0
+ - sed=4.9=h6688a6e_0
+ - sysroot_linux-64=2.39=hc4b9eeb_3
+ - tk=8.6.13=noxft_ha0e22de_103
+ - tktable=2.10=h8d826fa_7
+ - tpmcalculator=0.0.5=h2bd4fab_3
+ - tzdata=2025b=h78e105d_0
+ - xorg-libice=1.1.2=hb9d3cd8_0
+ - xorg-libsm=1.2.6=he73a12e_0
+ - xorg-libx11=1.8.12=h4f16b4b_0
+ - xorg-libxau=1.0.12=hb03c661_1
+ - xorg-libxdmcp=1.1.5=hb03c661_1
+ - xorg-libxext=1.3.6=hb9d3cd8_0
+ - xorg-libxrender=0.9.12=hb9d3cd8_0
+ - xorg-libxt=1.3.1=hb9d3cd8_0
+ - zlib=1.3.1=hb9d3cd8_2
+ - zstd=1.5.7=hb8e6e7a_2
diff --git a/post_processing/TPMCalculator/scripts/analyze_organism_totals.py b/post_processing/TPMCalculator/scripts/analyze_organism_totals.py
new file mode 100644
index 0000000..a0a3948
--- /dev/null
+++ b/post_processing/TPMCalculator/scripts/analyze_organism_totals.py
@@ -0,0 +1,340 @@
+#!/usr/bin/env python3
+"""
+analyze_organism_totals.py
+
+This script takes merged TPM and Reads matrices produced by the previous
+TPMCalculator combination step and verifes that expression values were
+assinged correctly to each organism for the multi-organism dataset.
+
+It also creates summary dataframes and barplots to verify processing worked correctly,
+and generates the following outputs:
+ - TPM_totals_by_organism.csv
+ - Reads_totals_by_organism.csv
+ - organism_totals_barplots.png/.pdf
+ - organism_summary_stats.png/.pdf
+
+Functions:
+ - analyze_organism_totals(tpm_file, reads_file, output_dir, organisms)
+ - create_barplots(tpm_totals, reads_totals, output_dir, organisms)
+ - create_summary_stats_plot(tpm_totals, reads_totals, output_dir, organisms)
+ - main()
+
+Usage:
+ python3 analyze_organism_totals.py
+"""
+
+import os
+import pandas as pd
+import matplotlib.pyplot as plt
+import numpy as np
+
+def analyze_organism_totals(tpm_file, reads_file, output_dir, organisms):
+ """
+ Analyze total TPMs and reads by organism prefix.
+
+ Args:
+ - tpm_file (str): Path to TPM matrix CSV file
+ - reads_file (str): Path to reads matrix CSV file
+ - output_dir (str): Directory to save outputs
+ - organisms (dict): Dictionary containing the names and desired colors used
+ for graph labeling
+
+ Returns:
+ - tpm_totals (pandas.DataFrame): DataFrame with TPM totals by organism
+ - reads_totals (pandas.DataFrame): DataFrame with reads totals by organism
+ """
+ print("Loading TPM and reads matrices...")
+
+ # Load matrices
+ tpm_df = pd.read_csv(tpm_file)
+ reads_df = pd.read_csv(reads_file)
+
+ print(f"TPM matrix shape: {tpm_df.shape}")
+ print(f"Reads matrix shape: {reads_df.shape}")
+
+ # Get sample columns (all except Gene_Id)
+ sample_columns = [col for col in tpm_df.columns if col != 'Gene_Id']
+ print(f"Found {len(sample_columns)} samples")
+
+ # Define organism prefixes
+ organism_prefixes = organisms["names"]
+
+ # Initialize result dataframes
+ tpm_totals = pd.DataFrame(index=sample_columns, columns=organism_prefixes)
+ reads_totals = pd.DataFrame(index=sample_columns, columns=organism_prefixes)
+
+ print("Calculating totals by organism...")
+
+ # Calculate totals for each organism and sample
+ for sample in sample_columns:
+ for prefix in organism_prefixes:
+ # Filter genes by prefix
+ gene_mask = tpm_df['Gene_Id'].str.startswith(prefix)
+
+ # Calculate totals
+ tpm_total = tpm_df.loc[gene_mask, sample].sum()
+ reads_total = reads_df.loc[gene_mask, sample].sum()
+
+ tpm_totals.loc[sample, prefix] = tpm_total
+ reads_totals.loc[sample, prefix] = reads_total
+
+ # Convert to numeric
+ tpm_totals = tpm_totals.astype(float)
+ reads_totals = reads_totals.astype(float)
+
+ # Add total columns
+ tpm_totals['Total'] = tpm_totals.sum(axis=1)
+ reads_totals['Total'] = reads_totals.sum(axis=1)
+
+ print("Summary statistics:")
+ print("\nTPM totals:")
+ print(tpm_totals.describe())
+ print("\nReads totals:")
+ print(reads_totals.describe())
+
+ # Save summary dataframes
+ tpm_output = os.path.join(output_dir, "TPM_totals_by_organism.csv")
+ reads_output = os.path.join(output_dir, "Reads_totals_by_organism.csv")
+
+ tpm_totals.to_csv(tpm_output)
+ reads_totals.to_csv(reads_output)
+
+ print(f"\nSaved summary files:")
+ print(f"TPM totals: {tpm_output}")
+ print(f"Reads totals: {reads_output}")
+
+ # Create visualizations
+ create_barplots(tpm_totals, reads_totals, output_dir, organisms)
+
+ return tpm_totals, reads_totals
+
+def create_barplots(tpm_totals, reads_totals, output_dir, organisms):
+ """
+ Create barplots for TPM and reads totals by organism.
+
+ Args:
+ - tpm_totals (pandas.DataFrame): DataFrame with TPM totals by organism
+ - reads_totals (pandas.DataFrame): DataFrame with reads totals by organism
+ - output_dir (str): Directory to save plots
+ - organisms (dict): Dictionary containing the names and desired colors used
+ for graph labeling
+
+ Returns:
+ - N/A
+ """
+ print("Creating visualizations...")
+
+ # Set up the plotting style
+ plt.style.use('default')
+ colors = organisms["colors"]
+
+ # Remove Total column for plotting
+ organism_cols = organisms["names"]
+
+ # Create figure with subplots
+ fig, axes = plt.subplots(2, 2, figsize=(15, 12))
+ fig.suptitle('TPM and Reads Distribution by Organism', fontsize=16, fontweight='bold')
+
+ # Plot 1: TPM totals by organism (stacked bar)
+ ax1 = axes[0, 0]
+ tpm_plot_data = tpm_totals[organism_cols]
+ tpm_plot_data.plot(kind='bar', stacked=True, ax=ax1, width=0.8, color=colors)
+ ax1.set_title('TPM Totals by Organism (Stacked)', fontweight='bold')
+ ax1.set_xlabel('Sample Index')
+ ax1.set_ylabel('TPM')
+ ax1.legend(title='Organism', bbox_to_anchor=(1.05, 1), loc='upper left')
+ ax1.set_xticks([]) # Remove x-axis labels for readability
+
+ # Plot 2: Reads totals by organism (stacked bar)
+ ax2 = axes[0, 1]
+ reads_plot_data = reads_totals[organism_cols]
+ reads_plot_data.plot(kind='bar', stacked=True, ax=ax2, width=0.8, color=colors)
+ ax2.set_title('Reads Totals by Organism (Stacked)', fontweight='bold')
+ ax2.set_xlabel('Sample Index')
+ ax2.set_ylabel('Reads')
+ ax2.legend(title='Organism', bbox_to_anchor=(1.05, 1), loc='upper left')
+ ax2.set_xticks([]) # Remove x-axis labels for readability
+
+ # Plot 3: TPM percentage by organism
+ ax3 = axes[1, 0]
+ tpm_percentages = tpm_plot_data.div(tpm_plot_data.sum(axis=1), axis=0) * 100
+ tpm_percentages.plot(kind='bar', stacked=True, ax=ax3, width=0.8, color=colors)
+ ax3.set_title('TPM Distribution by Organism (%)', fontweight='bold')
+ ax3.set_xlabel('Sample Index')
+ ax3.set_ylabel('Percentage (%)')
+ ax3.legend(title='Organism', bbox_to_anchor=(1.05, 1), loc='upper left')
+ ax3.set_xticks([]) # Remove x-axis labels for readability
+
+ # Plot 4: Reads percentage by organism
+ ax4 = axes[1, 1]
+ reads_percentages = reads_plot_data.div(reads_plot_data.sum(axis=1), axis=0) * 100
+ reads_percentages.plot(kind='bar', stacked=True, ax=ax4, width=0.8, color=colors)
+ ax4.set_title('Reads Distribution by Organism (%)', fontweight='bold')
+ ax4.set_xlabel('Sample Index')
+ ax4.set_ylabel('Percentage (%)')
+ ax4.legend(title='Organism', bbox_to_anchor=(1.05, 1), loc='upper left')
+ ax4.set_xticks([]) # Remove x-axis labels for readability
+
+ # Adjust layout
+ plt.tight_layout()
+
+ # Save plot
+ plot_output = os.path.join(output_dir, "organism_totals_barplots.png")
+ plt.savefig(plot_output, dpi=300, bbox_inches='tight')
+ plt.savefig(plot_output.replace('.png', '.pdf'), bbox_inches='tight')
+
+ print(f"Saved plots: {plot_output}")
+ plt.close()
+
+ # Create summary statistics plot
+ create_summary_stats_plot(tpm_totals, reads_totals, output_dir, organisms)
+
+def create_summary_stats_plot(tpm_totals, reads_totals, output_dir, organisms):
+ """
+ Create summary statistics visualization.
+
+ Args:
+ - tpm_totals (pandas.DataFrame): DataFrame with TPM totals by organism
+ - reads_totals (pandas.DataFrame): DataFrame with reads totals by organism
+ - output_dir (str): Directory to save plots
+ - organisms (dict): Dictionary containing the names and desired colors used
+ for graph labeling
+
+ Returns:
+ - N/A
+ """
+ organism_cols = organisms["names"]
+ color_cols = organisms["colors"]
+
+ fig, axes = plt.subplots(1, 2, figsize=(12, 5))
+ fig.suptitle('Summary Statistics by Organism', fontsize=16, fontweight='bold')
+
+ # TPM summary
+ ax1 = axes[0]
+ tpm_summary = tpm_totals[organism_cols].mean()
+ bars1 = ax1.bar(range(len(organism_cols)), tpm_summary.values,
+ color=color_cols)
+ ax1.set_title('Mean TPM by Organism', fontweight='bold')
+ ax1.set_xlabel('Organism')
+ ax1.set_ylabel('Mean TPM')
+ ax1.set_xticks(range(len(organism_cols)))
+ ax1.set_xticklabels(organism_cols, rotation=45)
+
+ # Add value labels on bars
+ for i, (bar, val) in enumerate(zip(bars1, tpm_summary.values)):
+ ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + val*0.01,
+ f'{val:.0f}', ha='center', va='bottom')
+
+ # Reads summary
+ ax2 = axes[1]
+ reads_summary = reads_totals[organism_cols].mean()
+ bars2 = ax2.bar(range(len(organism_cols)), reads_summary.values,
+ color=color_cols)
+ ax2.set_title('Mean Reads by Organism', fontweight='bold')
+ ax2.set_xlabel('Organism')
+ ax2.set_ylabel('Mean Reads')
+ ax2.set_xticks(range(len(organism_cols)))
+ ax2.set_xticklabels(organism_cols, rotation=45)
+
+ # Add value labels on bars
+ for i, (bar, val) in enumerate(zip(bars2, reads_summary.values)):
+ ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + val*0.01,
+ f'{val:.0f}', ha='center', va='bottom')
+
+ plt.tight_layout()
+
+ # Save plot
+ summary_output = os.path.join(output_dir, "organism_summary_stats.png")
+ plt.savefig(summary_output, dpi=300, bbox_inches='tight')
+ plt.savefig(summary_output.replace('.png', '.pdf'), bbox_inches='tight')
+
+ print(f"Saved summary plot: {summary_output}")
+ plt.close()
+
+def main():
+ """
+ Genenerates analysis and visualization written to the
+ base_dir/organism_analysis/ directory.
+
+ Expected Input Directory Structure
+ ----------------------------------
+ The script expects the following structure:
+
+ base_dir/
+ └── combined_results/
+ │ ├── STAR_TPMcalculator_*_TPMs_Matrix_Merged.csv
+ │ ├── STAR_TPMcalculator_*_Reads_Matrix_Merged.csv
+ └── tpmcalculator_output/
+ └── ...
+
+ Organism Dictionary
+ -------------------
+ organisms contains two fields used throughout the analysis:
+
+ names
+ List of gene ID prefixes used to identify which organism each
+ gene belongs to. For example, a gene ID like "Sym_Cgor_gene123"
+ should be labeled as "Sym_Cgor_".
+
+ colors
+ List of colors corresponding to each organism prefix.
+
+ Required Paths
+ --------------------
+ req_paths contains user-defined paths and output file names:
+
+ combined_results_dir
+ Directory containing the 'combined_results/' folder from the
+ previous script in the pipeline.
+
+ tpm_file
+ File name for the merged TPM matrix.
+
+ reads_file
+ File name for the merged Reads matrix.
+ """
+
+ ####################
+ # Define Organisms
+ ####################
+ organisms = {
+ "names": ['Host_', 'Sym_Cgor_', 'Sym_Dtre_', 'Sym_Smic_'],
+ "colors": ['#1f77b4', '#ff7f0e', '#2ca02c' ,'#d62728']
+ }
+
+ ####################
+ # Define paths
+ ####################
+ req_paths = {
+ "combined_results_dir": "",
+ "tpm_file": "STAR_TPMcalculator_TPMs_Matrix_Merged_rerun.csv",
+ "reads_file": "STAR_TPMcalculator_Reads_Matrix_Merged_rerun.csv"
+ }
+
+ tpm_file = os.path.join(req_paths["combined_results_dir"], req_paths["tpm_file"])
+ reads_file = os.path.join(req_paths["combined_results_dir"], req_paths["reads_file"])
+
+ if os.path.exists(reads_file):
+ print("Using original reads matrix (may have different gene set)")
+ else:
+ reads_file = tpm_file
+ print("No reads matrix found, using TPM matrix for both analyses")
+
+ # Check if input files exist
+ if not os.path.exists(tpm_file):
+ print(f"TPM matrix file not found: {tpm_file}")
+ print("Please run combine_tpmcalculator_results.py first")
+ return
+
+ # Create output directory for analysis results
+ output_dir = os.path.join(req_paths["combined_results_dir"], "organism_analysis")
+ os.makedirs(output_dir, exist_ok=True)
+
+ # Analyze organism totals
+ tpm_totals, reads_totals = analyze_organism_totals(tpm_file, reads_file, output_dir, organisms)
+
+ print("\nAnalysis completed successfully!")
+ print(f"Results saved in: {output_dir}")
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/post_processing/TPMCalculator/scripts/check_bam_sort.py b/post_processing/TPMCalculator/scripts/check_bam_sort.py
new file mode 100644
index 0000000..fef7e42
--- /dev/null
+++ b/post_processing/TPMCalculator/scripts/check_bam_sort.py
@@ -0,0 +1,138 @@
+#!/usr/bin/env python3
+"""
+check_bam_sort.py
+
+The script scans each sample directory, identifies the STAR-aligned BAM file,
+determines its sort order using `samtools view -H`, and logs the results to
+a TSV file for quick verification of alignment output consistency.
+
+Functions:
+ - get_sort_status(bam_path)
+ - main()
+
+Usage:
+ python3 check_bam_sort.py
+"""
+
+import os
+import glob
+import subprocess
+import csv
+
+def get_sort_status(bam_path):
+ """
+ Returns samtools sort order: "coordinate", "queryname", "unsorted",
+ or "unknown".
+
+ Args:
+ - bam_path (string): Path to bam file.
+
+ Returns:
+ - (string): samtools sort order.
+ """
+ try:
+
+ # Run samtools command
+ header = subprocess.run(
+ ["samtools", "view", "-H", bam_path],
+ stdout=subprocess.PIPE,
+ stderr=subprocess.PIPE,
+ text=True
+ )
+
+ # Return sort order
+ if header.returncode != 0:
+ return "unknown"
+
+ for line in header.stdout.splitlines():
+ if line.startswith("@HD"):
+ if "SO:coordinate" in line:
+ return "coordinate"
+ elif "SO:queryname" in line:
+ return "queryname"
+ elif "SO:unsorted" in line:
+ return "unsorted"
+ else:
+ return "unsorted"
+
+ return "unsorted"
+ except Exception:
+ return "unknown"
+
+
+def main():
+ """
+ Finds bam files within a provided folder and logs the
+ samtools sort order.
+
+ Expected Input Directory Structure
+ ----------------------------------
+ The script expects the following directory structure:
+
+ base_dir/
+ ├── PerSample/
+ │ ├── SAMPLE_1/
+ │ │ └── results_STAR_htseq/
+ │ │ └── *_dedup_NoSingletonCollated.out.bam
+ │ │
+ │ ├── SAMPLE_2/
+ │ │ └── results_STAR_htseq/
+ │ │ └── *_dedup_NoSingletonCollated.out.bam
+ │ │
+ │ └── ...
+ """
+
+ #################
+ # Define paths
+ ##################
+ base_dir = ""
+ per_sample_dir = os.path.join(base_dir, "PerSample")
+
+ log_file_name = "bam_sort_status_" + os.path.basename(base_dir) + ".tsv"
+ log_path = os.path.join(os.getcwd(), log_file_name)
+
+ # Check if PerSample directory exists
+ if not os.path.exists(per_sample_dir):
+ print(f"Error: PerSample directory not found at {per_sample_dir}")
+ return
+
+ # Find all sample directories
+ sample_dirs = [d for d in os.listdir(per_sample_dir)
+ if os.path.isdir(os.path.join(per_sample_dir, d))]
+ print(f"\nFound {len(sample_dirs)} sample directories")
+
+ # Open TSV
+ with open(log_path, "w", newline="") as log_file:
+ writer = csv.writer(log_file, delimiter="\t")
+ writer.writerow(["sample_dir", "bam_file", "sort_status"])
+
+ # Loop over samples
+ for sample_dir in sample_dirs:
+ sample_path = os.path.join(per_sample_dir, sample_dir)
+ results_star_htseq_path = os.path.join(sample_path, "results_STAR_htseq")
+
+ if not os.path.exists(results_star_htseq_path):
+ print(f"Warning: results_STAR_htseq directory not found for {sample_dir}")
+ continue
+
+ # Find BAM file
+ bam_pattern = os.path.join(results_star_htseq_path, "*dedup_NoSingletonCollated.out.bam")
+ bam_files = glob.glob(bam_pattern)
+
+ if not bam_files:
+ print(f"Warning: No BAM file found for {sample_dir}")
+ continue
+ elif len(bam_files) > 1:
+ print(f"Warning: Multiple BAM files found for {sample_dir}, using the first one")
+
+ bam_file = bam_files[0]
+
+ # Check sort status and log result
+ sort_status = get_sort_status(bam_file)
+ writer.writerow([sample_dir, bam_file, sort_status])
+
+ print(f"\nLogged BAM sort status to {log_path}\n")
+
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/post_processing/TPMCalculator/scripts/combine_tpmcalculator_results.py b/post_processing/TPMCalculator/scripts/combine_tpmcalculator_results.py
new file mode 100644
index 0000000..ded8307
--- /dev/null
+++ b/post_processing/TPMCalculator/scripts/combine_tpmcalculator_results.py
@@ -0,0 +1,258 @@
+#!/usr/bin/env python3
+"""
+combine_tpmcalculator_results.py
+
+Combine TPMCalculator output files from multiple RNA-seq samples into merged matrices.
+This script scans the TPMCalculator output directories and aligns TPM and Reads
+values across all samples by gene. Two merged CSV are generated.
+
+Functions:
+ - extract_sample_name(filepath)
+ - find_tpmcalculator_outputs(base_dir)
+ - combine_tpm_matrices(output_files, output_dir, req_paths)
+ - main()
+
+Usage:
+ python3 combine_tpmcalculator_results.py
+"""
+
+import os
+import pandas as pd
+import glob
+import re
+from pathlib import Path
+
+def extract_sample_name(filepath):
+ """
+ Extract sample name from filepath by removing '_star_10_0.3_0.66_0_dedup' part.
+
+ Args:
+ - filepath (str): Path to the TPMCalculator output file
+
+ Returns:
+ - sample_name (str): Sample name
+ """
+ filename = os.path.basename(filepath)
+ # Remove the file extension and TPMCalculator suffix
+ base_name = filename.replace('_tpmcalculator.txt_genes.out', '')
+ # Remove the STAR alignment suffix
+ sample_name = re.sub(r'_star_10_0\.3_0\.66_0_dedup.*', '', base_name)
+ return sample_name
+
+def find_tpmcalculator_outputs(base_dir):
+ """
+ Find all TPMCalculator output files with various possible patterns.
+
+ Args:
+ - base_dir (str): Base directory containing PerSample folders
+
+ Returns:
+ - list: List of file paths
+ """
+ # per_sample_dir = os.path.join(base_dir, "PerSample")
+ per_sample_dir = os.path.join(base_dir, "tpmcalculator_output")
+
+ # Try multiple patterns for TPMCalculator output files
+ patterns = [
+ os.path.join(per_sample_dir, "*", "*Collated.out_genes"),
+ os.path.join(per_sample_dir, "*", "*_genes.out"),
+ os.path.join(per_sample_dir, "*", "*NoSingletonCollated.out_genes.out"),
+ os.path.join(per_sample_dir, "*", "*dedup*.out_genes.out"),
+ ]
+
+ print(patterns)
+
+ output_files = []
+ for pattern in patterns:
+ files = glob.glob(pattern)
+ if files:
+ output_files = files
+ print(f"Found files with pattern: {pattern}")
+ break
+
+ return sorted(output_files)
+
+def combine_tpm_matrices(output_files, output_dir, req_paths):
+ """
+ Combine TPMCalculator results into TPM and Reads matrices.
+
+ Args:
+ - output_files (list): List of TPMCalculator output file paths
+ - output_dir (str): Directory to save the combined matrices
+ - req_paths (dict): Dictionary containing names of TPM and Reads CSVs
+ """
+ tpm_data = {}
+ reads_data = {}
+ gene_ids = None
+
+ print(f"Processing {len(output_files)} TPMCalculator output files...")
+
+ for i, filepath in enumerate(output_files):
+ try:
+ # Extract sample name
+ sample_name = extract_sample_name(filepath)
+ print(f"Processing {i+1}/{len(output_files)}: {sample_name}")
+
+ # Check file size first
+ file_size = os.path.getsize(filepath)
+ if file_size < 300: # Less than 300 bytes is likely empty (just headers)
+ print(f"Warning: File {filepath} is too small ({file_size} bytes), likely empty")
+ continue
+
+ # Read the TPMCalculator output file
+ df = pd.read_csv(filepath, sep='\t')
+
+ # Identify duplicated Gene_Id
+ dup_mask = df['Gene_Id'].duplicated(keep=False)
+ if dup_mask.any():
+ skipped_file = os.path.join(output_dir, "Skipped_Duplicate_Gene_Ids.csv")
+
+ # Save skipped rows
+ df[dup_mask].assign(sample=sample_name).to_csv(
+ skipped_file,
+ sep=",",
+ index=False,
+ mode="a",
+ header=not os.path.exists(skipped_file)
+ )
+ df = df[~dup_mask]
+
+ # Check if dataframe is empty
+ if df.empty:
+ print(f"Warning: File {filepath} contains no data")
+ continue
+
+ # Check if required columns exist
+ required_cols = ['Gene_Id', 'TPM', 'Reads']
+ missing_cols = [col for col in required_cols if col not in df.columns]
+
+ if missing_cols:
+ print(f"Warning: Missing columns {missing_cols} in {filepath}")
+ print(f"Available columns: {list(df.columns)}")
+ continue
+
+ # Check if there are any genes
+ if len(df) == 0:
+ print(f"Warning: No genes found in {filepath}")
+ continue
+
+ # Set gene IDs from first file
+ if gene_ids is None:
+ gene_ids = df['Gene_Id'].tolist()
+ print(f"Using gene list from {sample_name} with {len(gene_ids)} genes")
+
+ # Check if gene lists match
+ if gene_ids != df['Gene_Id'].tolist():
+ print(f"Warning: Gene list mismatch in {sample_name}")
+ # Try to align with existing gene_ids
+ df_aligned = df.set_index('Gene_Id').reindex(gene_ids)
+ tpm_data[sample_name] = df_aligned['TPM'].fillna(0).tolist()
+ reads_data[sample_name] = df_aligned['Reads'].fillna(0).tolist()
+ else:
+ # Extract TPM and Reads columns
+ tpm_data[sample_name] = df['TPM'].tolist()
+ reads_data[sample_name] = df['Reads'].tolist()
+
+ except Exception as e:
+ print(f"Error processing {filepath}: {e}")
+ continue
+
+ if not tpm_data:
+ print("No valid data found. Exiting.")
+ return
+
+ print(f"Successfully processed {len(tpm_data)} samples")
+ print(f"Found {len(gene_ids)} genes")
+
+ # Create TPM matrix
+ tpm_df = pd.DataFrame(tpm_data)
+ tpm_df.insert(0, 'Gene_Id', gene_ids)
+
+ # Create Reads matrix
+ reads_df = pd.DataFrame(reads_data)
+ reads_df.insert(0, 'Gene_Id', gene_ids)
+
+ # Create output directory if it doesn't exist
+ os.makedirs(output_dir, exist_ok=True)
+
+ # Save matrices
+ tpm_output = os.path.join(output_dir, req_paths["tpm_csv"])
+ reads_output = os.path.join(output_dir, req_paths["reads_csv"])
+
+ tpm_df.to_csv(tpm_output, index=False)
+ reads_df.to_csv(reads_output, index=False)
+
+ print(f"\nMatrices saved:")
+ print(f"TPM matrix: {tpm_output}")
+ print(f"Reads matrix: {reads_output}")
+ print(f"Matrix dimensions: {tpm_df.shape}")
+ print(f"Samples: {list(tpm_data.keys())}")
+
+def main():
+ """
+ This function combines the TPMCalculator results across all samples,
+ by gene, into merged TPM and Reads matrices.
+
+ Expected Input Directory Structure
+ ----------------------------------
+ The script expects the following structure:
+
+ base_dir/
+ └── tpmcalculator_output/
+ ├── SAMPLE_1/
+ │ └── SAMPLE_1...out_genes.out
+ ├── SAMPLE_2/
+ │ └── SAMPLE_2...out_genes.out
+ └── ...
+
+ Required Paths
+ --------------------
+ req_paths contains user-defined paths and output file names:
+
+ base_dir
+ Root directory containing 'tpmcalculator_output/'. This is the output directory
+ from the previous script.
+
+ tpm_csv
+ File name for the merged TPM matrix.
+
+ reads_csv
+ File name for the merged Reads matrix.
+ """
+
+ #################
+ # Define paths
+ ##################
+ req_paths = {
+ "base_dir": "",
+ "tpm_csv": "STAR_TPMcalculator_TPMs_Matrix_Merged_rerun.csv",
+ "reads_csv": "STAR_TPMcalculator_Reads_Matrix_Merged_rerun.csv"
+ }
+ output_dir = req_paths["base_dir"] + "/combined_results"
+
+ # Find TPMCalculator output files
+ print("Searching for TPMCalculator output files...")
+ output_files = find_tpmcalculator_outputs(req_paths["base_dir"])
+
+ if not output_files:
+ print("No TPMCalculator output files found!")
+ print("Expected pattern: tpmcalculator/*/*Collated.out_sorted_genes")
+ return
+
+ print(f"Found {len(output_files)} TPMCalculator output files")
+
+ # Show first few files for verification
+ print("\nFirst few files found:")
+ for f in output_files[:5]:
+ sample_name = extract_sample_name(f)
+ print(f" {sample_name}: {f}")
+
+ if len(output_files) > 5:
+ print(f" ... and {len(output_files) - 5} more")
+
+ # Combine matrices
+ combine_tpm_matrices(output_files, output_dir, req_paths)
+
+
+if __name__ == "__main__":
+ main()
diff --git a/post_processing/TPMCalculator/scripts/generate_tpmcalculator_jobs.py b/post_processing/TPMCalculator/scripts/generate_tpmcalculator_jobs.py
new file mode 100644
index 0000000..96f39e7
--- /dev/null
+++ b/post_processing/TPMCalculator/scripts/generate_tpmcalculator_jobs.py
@@ -0,0 +1,202 @@
+#!/usr/bin/env python3
+"""
+generate_tpmcalculator_jobs.py
+
+This module generates and submits SLURM job scripts for running TPMCalculator on
+multiple RNA-seq samples. It scans a PerSample directory containing sample-specific
+subdirectories, identifies the appropriate BAM file for each sample,
+and generates a SLURM job script that runs TPMCalculator using a
+provided GTF annotation file.
+
+Functions:
+ - create_slurm_job_script(sample_id, bam_file, gtf_file, tmp_output_dir)
+ - main()
+
+Usage:
+ python3 generate_tpmcalculator_slurm_jobs.py
+"""
+
+import os
+import glob
+import subprocess
+
+def create_slurm_job_script(sample_id, bam_file, gtf_file, tmp_output_dir):
+ """
+ Create a SLURM job script for TPMCalculator analysis.
+
+ Args:
+ - sample_id (string): Sample identifier
+ - bam_file (string): Path to the BAM file
+ - gtf_file (string): Path to the GTF file
+ - tmp_output_dir (string): Directory to save TPMCalculator output
+
+ Returns:
+ - job_script_path (string): Path to SLURM job bash script
+ """
+
+ job_script_content = f"""#!/bin/bash
+#SBATCH --job-name=tpmcalc_{sample_id}
+#SBATCH --cpus-per-task=4
+#SBATCH --mem=16G
+#SBATCH --time=08:00:00
+#SBATCH --output={tmp_output_dir}/{sample_id}_tpmcalculator_%j.out
+#SBATCH --error={tmp_output_dir}/{sample_id}_tpmcalculator_%j.err
+
+# TPMCalculator is available via conda installation
+
+# Define variables
+BAM="{bam_file}"
+GTF="{gtf_file}"
+OUTPUT_DIR="{tmp_output_dir}"
+
+# Run TPMCalculator
+echo "Starting TPMCalculator analysis for {sample_id}..."
+mkdir -p $OUTPUT_DIR
+cd $OUTPUT_DIR
+TPMCalculator -g $GTF -b $BAM -o $OUTPUT_DIR/{sample_id}_tpmcalculator.txt
+
+echo "TPMCalculator analysis completed for {sample_id}"
+"""
+
+ # Create the bash file containing the SLURM job
+ job_script_path = os.path.join(tmp_output_dir, f"{sample_id}_tpmcalculator_job.sh")
+
+ with open(job_script_path, 'w') as f:
+ f.write(job_script_content)
+
+ # Make the script executable
+ os.chmod(job_script_path, 0o755)
+
+ return job_script_path
+
+def main():
+ """
+ This function scans a project results directory, identifies sample
+ BAM files produced by the STAR pipeline, generates a SLURM job script
+ for each sample, and submits those jobs to the cluster.
+
+ Expected Input Directory Structure
+ ----------------------------------
+ The script expects the following directory structure:
+
+ base_dir/
+ ├── PerSample/
+ │ ├── SAMPLE_1/
+ │ │ └── results_STAR_htseq/
+ │ │ └── *_dedup_NoSingletonCollated.out.bam
+ │ │
+ │ ├── SAMPLE_2/
+ │ │ └── results_STAR_htseq/
+ │ │ └── *_dedup_NoSingletonCollated.out.bam
+ │ │
+ │ └── ...
+
+ Where:
+ - Each directory inside `PerSample/` represents one RNA-seq sample.
+ - Each sample directory must contain a `results_STAR_htseq/` folder.
+ - The BAM file used for TPMCalculator must match the pattern:
+ `*dedup_NoSingletonCollated.out.bam`.
+
+ Required Paths
+ --------------
+ req_paths is a dictionary containing required input and output paths:
+
+ base_dir
+ Root directory of the RNA-seq project results. This directory
+ must contain the `PerSample/` directory with all sample folders.
+
+ gtf_file
+ Path to the reference GTF annotation file used by TPMCalculator
+ to assign reads to genes and transcripts.
+
+ output_dir
+ Directory where TPMCalculator outputs and generated SLURM job
+ scripts will be written. A subdirectory for each sample will
+ be created automatically if it does not already exist.
+ """
+
+ #################
+ # Define paths
+ ##################
+ req_paths ={
+ "base_dir": "",
+ "gtf_file": "",
+ "output_dir": ""
+ }
+ per_sample_dir = os.path.join(req_paths["base_dir"], "PerSample")
+
+ # Check if GTF file exists
+ if not os.path.exists(req_paths["gtf_file"]):
+ print(f"Error: GTF file not found at {req_paths["gtf_file"]}")
+ return
+
+ # Check if PerSample directory exists
+ if not os.path.exists(per_sample_dir):
+ print(f"Error: PerSample directory not found at {per_sample_dir}")
+ return
+
+ # Find all sample directories
+ sample_dirs = [d for d in os.listdir(per_sample_dir)
+ if os.path.isdir(os.path.join(per_sample_dir, d))]
+
+ print(f"Found {len(sample_dirs)} sample directories")
+
+ job_scripts_created = []
+
+ for sample_dir in sample_dirs:
+ sample_path = os.path.join(per_sample_dir, sample_dir)
+ results_star_htseq_path = os.path.join(sample_path, "results_STAR_htseq")
+
+ # Check if results_STAR_htseq directory exists
+ if not os.path.exists(results_star_htseq_path):
+ print(f"Warning: results_STAR_htseq directory not found for {sample_dir}")
+ continue
+
+ # Find BAM file
+ bam_pattern = os.path.join(results_star_htseq_path, "*dedup_NoSingletonCollated.out.bam")
+ bam_files = glob.glob(bam_pattern)
+
+ if not bam_files:
+ print(f"Warning: No BAM file found for {sample_dir}")
+ continue
+ elif len(bam_files) > 1:
+ print(f"Warning: Multiple BAM files found for {sample_dir}, using the first one")
+
+ bam_file = bam_files[0]
+
+ # Create tpmcalculator directory
+ tpmcalculator_dir = req_paths["output_dir"] + sample_dir
+ os.makedirs(tpmcalculator_dir, exist_ok=True)
+
+ # Extract sample ID from directory name
+ sample_id = sample_dir
+
+ # Create SLURM job script
+ job_script_path = create_slurm_job_script(sample_id, bam_file, req_paths["gtf_file"], tpmcalculator_dir)
+ job_scripts_created.append(job_script_path)
+
+ print(f"Created job script for {sample_id}: {job_script_path}")
+
+ print(f"\nTotal job scripts created: {len(job_scripts_created)}")
+
+ # Submit all jobs automatically
+ print("\nSubmitting jobs to SLURM cluster...")
+ submitted_jobs = []
+
+ for job_script in job_scripts_created:
+ try:
+ result = subprocess.run(['sbatch', job_script], stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True)
+ if result.returncode == 0:
+ job_id = result.stdout.strip().split()[-1]
+ submitted_jobs.append(job_id)
+ print(f"Submitted job {job_id} for {job_script}")
+ else:
+ print(f"Failed to submit {job_script}: {result.stderr.strip()}")
+ except Exception as e:
+ print(f"Error submitting {job_script}: {e}")
+
+ print(f"\nSuccessfully submitted {len(submitted_jobs)} jobs out of {len(job_scripts_created)} total jobs.")
+
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf.py b/post_processing/TPMCalculator/scripts/gff3_to_gtf.py
new file mode 100644
index 0000000..9ac6a09
--- /dev/null
+++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf.py
@@ -0,0 +1,162 @@
+#!/usr/bin/env python3
+"""
+gff3_to_gtf.py
+
+This script parses a GFF3 file and converts supported feature types
+(gene, mRNA/transcript, exon, CDS) into GTF format while maintaining
+the correct gene_id and transcript_id relationships.
+
+Functions:
+ - parse_attributes(attr_string)
+ - format_gtf_attributes(attributes, feature_type, gene_id=None, transcript_id=None)
+ - convert_gff3_to_gtf(input_file, output_file)
+ - main()
+
+Usage:
+ python3 gff3_to_gtf.py
+"""
+
+import sys
+import re
+from pathlib import Path
+
+def parse_attributes(attr_string):
+ """
+ GFF3 attributes are formatted as key-value pairs separated by
+ semicolons. This function converts the attribute string into a
+ Python dictionary for easier access.
+
+ Args:
+ - attr_string (str): Attribute column from a GFF3 file.
+
+ Returns:
+ - attributes (dict): Dictionary mapping attribute keys to values.
+ """
+ attributes = {}
+ if not attr_string or attr_string == '.':
+ return attributes
+
+ for item in attr_string.split(';'):
+ if '=' in item:
+ key, value = item.split('=', 1)
+ attributes[key] = value
+ return attributes
+
+def format_gtf_attributes(attributes, feature_type, gene_id=None, transcript_id=None):
+ """
+ Constructs the GTF attribute string using gene_id and transcript_id
+ along with optional gene_name or transcript_name fields if present.
+
+ Args:
+ - attributes (dict): Parsed GFF3 attributes.
+ - feature_type (str): Feature type (gene, transcript, exon, CDS).
+ - gene_id (str, optional): Gene identifier.
+ - transcript_id (str, optional): Transcript identifier.
+
+ Returns:
+ - str: Properly formatted GTF attribute string.
+ """
+ gtf_attrs = []
+
+ if gene_id:
+ gtf_attrs.append(f'gene_id "{gene_id}"')
+
+ if transcript_id:
+ gtf_attrs.append(f'transcript_id "{transcript_id}"')
+
+ if 'Name' in attributes:
+ if feature_type == 'gene':
+ gtf_attrs.append(f'gene_name "{attributes["Name"]}"')
+ elif feature_type in ['transcript', 'mRNA']:
+ gtf_attrs.append(f'transcript_name "{attributes["Name"]}"')
+
+ return '; '.join(gtf_attrs) + ';'
+
+def convert_gff3_to_gtf(input_file, output_file):
+ """
+ Reads a GFF3 file line by line, parses feature attributes, and writes
+ supported features (gene, transcript/mRNA, exon, CDS) to a GTF file.
+ Transcript-to-gene relationships are tracked to correctly assign
+ gene_id and transcript_id values.
+
+ Args:
+ - input_file (str): Path to the input GFF3 file.
+ - output_file (str): Path to the output GTF file.
+
+ Returns:
+ - N/A
+ """
+ transcript_to_gene = {}
+
+ with open(input_file, 'r') as infile, open(output_file, 'w') as outfile:
+ for line_num, line in enumerate(infile, 1):
+ if line_num % 100000 == 0:
+ print(f"Processed {line_num:,} lines...")
+
+ line = line.strip()
+
+ if line.startswith('#') or not line:
+ continue
+
+ fields = line.split('\t')
+ if len(fields) != 9:
+ continue
+
+ seqname, source, feature, start, end, score, strand, frame, attributes = fields
+ attr_dict = parse_attributes(attributes)
+
+ if feature == 'gene':
+ gene_id = attr_dict.get('ID', '')
+
+ gtf_attributes = format_gtf_attributes(attr_dict, feature, gene_id=gene_id)
+ gtf_line = f"{seqname}\t{source}\tgene\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{gtf_attributes}"
+ outfile.write(gtf_line + '\n')
+
+ elif feature in ['mRNA', 'transcript']:
+ transcript_id = attr_dict.get('ID', '')
+ parent_gene = attr_dict.get('Parent', '')
+
+ transcript_to_gene[transcript_id] = parent_gene
+
+ gtf_attributes = format_gtf_attributes(attr_dict, feature, gene_id=parent_gene, transcript_id=transcript_id)
+ gtf_line = f"{seqname}\t{source}\ttranscript\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{gtf_attributes}"
+ outfile.write(gtf_line + '\n')
+
+ elif feature in ['exon', 'CDS']:
+ parent_id = attr_dict.get('Parent', '')
+
+ # Check if parent is a transcript/mRNA or directly a gene
+ if parent_id in transcript_to_gene:
+ # Parent is a transcript
+ parent_gene = transcript_to_gene[parent_id]
+ transcript_id = parent_id
+ else:
+ # Parent might be a gene (some exons directly reference genes)
+ parent_gene = parent_id
+ transcript_id = parent_id # Use gene ID as transcript ID for these cases
+
+ gtf_attributes = format_gtf_attributes(attr_dict, feature, gene_id=parent_gene, transcript_id=transcript_id)
+ gtf_line = f"{seqname}\t{source}\t{feature}\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{gtf_attributes}"
+ outfile.write(gtf_line + '\n')
+
+def main():
+ if len(sys.argv) != 3:
+ print("Usage: python3 gff3_to_gtf.py ")
+ sys.exit(1)
+
+ input_file = sys.argv[1]
+ output_file = sys.argv[2]
+
+ if not Path(input_file).exists():
+ print(f"Error: Input file '{input_file}' not found.")
+ sys.exit(1)
+
+ try:
+ convert_gff3_to_gtf(input_file, output_file)
+ print(f"Successfully converted {input_file} to {output_file}")
+ except Exception as e:
+ print(f"Error during conversion: {e}")
+ sys.exit(1)
+
+if __name__ == "__main__":
+ main()
\ No newline at end of file
diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/README.md b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/README.md
new file mode 100644
index 0000000..f0323e5
--- /dev/null
+++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/README.md
@@ -0,0 +1,11 @@
+# GFF3 to GTF Fixes
+
+This directory contains small helper scripts used to make manual fixes to GTF files after converting annotations from **GFF3 → GTF**.
+
+While the main conversion script performs the bulk of the format conversion, some reference annotations required small adjustments to ensure compatibility with downstream tools such as **TPMCalculator**. These fixes were typically related to feature naming or attribute formatting.
+
+Examples of fixes performed:
+- Converting CDS to exon label
+- Fixing GTF chromosome names to match BAM headers
+
+Each script documents the specific transformation it performs. These scripts were run **after the initial GFF3 → GTF conversion step** and before running downstream expression quantification.
\ No newline at end of file
diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/cds_to_exon_gtf.sh b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/cds_to_exon_gtf.sh
new file mode 100755
index 0000000..a165fe3
--- /dev/null
+++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/cds_to_exon_gtf.sh
@@ -0,0 +1,29 @@
+#!/bin/bash
+# Convert CDS to exon for host so TPMCalculator can count them
+
+GTF="Valid_Plob_Cgor_concat.gtf"
+FIXED_GTF="Valid_Plob_Cgor_concat_fixed.gtf"
+
+awk 'BEGIN{OFS="\t"}
+{
+ # Split line into first 8 fields + rest as attributes
+ attrs = ""
+ if(NF>8) {
+ attrs = $9
+ for(i=10;i<=NF;i++) attrs = attrs " " $i
+ }
+
+ if($3=="CDS") {
+ $3="exon"
+ # Extract transcript_id
+ match(attrs,/transcript_id "[^"]+"/)
+ tid=substr(attrs,RSTART+15,RLENGTH-16) # get only the value inside quotes
+ # Prepend gene_id
+ attrs="gene_id \"" tid "\"; " attrs
+ }
+
+ # Rebuild the line: columns 1-8 + fixed attrs
+ print $1,$2,$3,$4,$5,$6,$7,$8,attrs
+}' "$GTF" > "$FIXED_GTF"
+
+echo "Fixed GTF created: $FIXED_GTF"
\ No newline at end of file
diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/fix_gtf_scaffold_names.sh b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/fix_gtf_scaffold_names.sh
new file mode 100755
index 0000000..9e86284
--- /dev/null
+++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/fix_gtf_scaffold_names.sh
@@ -0,0 +1,19 @@
+#!/bin/bash
+# Fix GTF chromosome names to match BAM headers for TPMCalculator
+# Original issue: BAM and GTF scaffold names did not match, causing TPMCalculator to run without producing gene output.
+
+INPUT_GTF="Valid_Phar_Cgor_Dtre_Smic_concat.gtf"
+OUTPUT_GTF="Valid_Phar_Cgor_Dtre_Smic_concat_fixed.gtf"
+
+awk 'BEGIN{FS=OFS="\t"}
+{
+ match($1, /JBDLLT010000[0-9]+/)
+ if (RSTART) {
+ id=substr($1, RSTART, RLENGTH)
+ sub(/JBDLLT010000/, "", id)
+ $1="Host_Phar_gnl|WGS:JBDLLT|Phar_scaff_" id "|gb|JBDLLT010000" id
+ }
+ print
+}' "$INPUT_GTF" > "$OUTPUT_GTF"
+
+echo "Fixed GTF written to $OUTPUT_GTF"
\ No newline at end of file
diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/underscore_to_pipe.sh b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/underscore_to_pipe.sh
new file mode 100755
index 0000000..3ee6a94
--- /dev/null
+++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/underscore_to_pipe.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+# Convert underscore naming convention to piped for Host
+
+GTF="Valid_Pdae_Cgor_Dtre_Smic_concat.gtf"
+FIXED_GTF="Valid_Pdae_Cgor_Dtre_Smic_concat_fixed.gtf"
+
+# Replace '_size' to '|size'
+awk -F'\t' -v OFS='\t' '
+ /^#/ {print; next}
+ {
+ # fix first column
+ sub(/_size/, "|size", $1)
+
+ # combine everything after column 8 into column 9
+ if (NF > 9) {
+ $9 = "";
+ for(i=9; i<=NF; i++) {
+ $9 = $9 $i
+ if(i "$FIXED_GTF"
+
+echo "Fixed GTF written to $FIXED_GTF"
\ No newline at end of file