From c479a8a51e3a2e11d1ed5de35b8d76d9287893e3 Mon Sep 17 00:00:00 2001 From: Yamina Katariya Date: Thu, 19 Mar 2026 13:56:15 -0700 Subject: [PATCH 1/4] publishing branch --- post_processing/TPMCalculator/README.md | 1 + 1 file changed, 1 insertion(+) create mode 100644 post_processing/TPMCalculator/README.md diff --git a/post_processing/TPMCalculator/README.md b/post_processing/TPMCalculator/README.md new file mode 100644 index 0000000..9c7f9d4 --- /dev/null +++ b/post_processing/TPMCalculator/README.md @@ -0,0 +1 @@ +## Hi \ No newline at end of file From ac1e40abef09be1f9b7e13ed9ebd403ec7120783 Mon Sep 17 00:00:00 2001 From: Yamina Katariya Date: Fri, 20 Mar 2026 10:57:14 -0700 Subject: [PATCH 2/4] pushing all files in the pipeline and associated READMEs for initial commit --- post_processing/TPMCalculator/README.md | 132 ++++++- post_processing/TPMCalculator/environment.yml | 217 +++++++++++ .../scripts/analyze_organism_totals.py | 340 ++++++++++++++++++ .../TPMCalculator/scripts/check_bam_sort.py | 138 +++++++ .../scripts/combine_tpmcalculator_results.py | 258 +++++++++++++ .../scripts/generate_tpmcalculator_jobs.py | 202 +++++++++++ .../TPMCalculator/scripts/gff3_to_gtf.py | 162 +++++++++ .../scripts/gff3_to_gtf_fixes/README.md | 11 + .../gff3_to_gtf_fixes/cds_to_exon_gtf.sh | 29 ++ .../fix_gtf_scaffold_names.sh | 19 + .../gff3_to_gtf_fixes/underscore_to_pipe.sh | 26 ++ 11 files changed, 1533 insertions(+), 1 deletion(-) create mode 100644 post_processing/TPMCalculator/environment.yml create mode 100644 post_processing/TPMCalculator/scripts/analyze_organism_totals.py create mode 100644 post_processing/TPMCalculator/scripts/check_bam_sort.py create mode 100644 post_processing/TPMCalculator/scripts/combine_tpmcalculator_results.py create mode 100644 post_processing/TPMCalculator/scripts/generate_tpmcalculator_jobs.py create mode 100644 post_processing/TPMCalculator/scripts/gff3_to_gtf.py create mode 100644 post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/README.md create mode 100755 post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/cds_to_exon_gtf.sh create mode 100755 post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/fix_gtf_scaffold_names.sh create mode 100755 post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/underscore_to_pipe.sh diff --git a/post_processing/TPMCalculator/README.md b/post_processing/TPMCalculator/README.md index 9c7f9d4..a7ea252 100644 --- a/post_processing/TPMCalculator/README.md +++ b/post_processing/TPMCalculator/README.md @@ -1 +1,131 @@ -## Hi \ No newline at end of file + +# mRNA Quantification using TPMCalculator + +## Introduction + +This pipeline quantifies mRNA abundance directly from BAM file alignments using **TPMCalculator** and generates visualizations to assess the results. It is designed to process multiple samples and combine results into comprehensive matrices for downstream analysis. + +The pipeline is organized into four scripts, each responsible for a different stage of processing and analysis. These scripts require access to a **SLURM** job manager for submitting and running the workflow. + +## Installation +This repository can be cloned locally by running the following `git` command: +```bash +git clone https://github.com/baliga-lab/Global_Search.git +``` +Please note that Git is required to run the above command. For instructions on downloading Git, please see [the Git Guide](https://github.com/git-guides/install-git). + + +### Environment + +#### Conda +This application is built on top of multiple Python packages with specific version requirements. We recommend using `conda` to create an isolated Python environment with all necessary packages. If you do not have it installed, follow the [Miniconda Instillation Instructions](https://www.anaconda.com/docs/getting-started/miniconda/install). The list of necessary packages can be found at in the [`environment.yml`](./environment.yml) file. + +To create the specified `gs-coral-env` Conda environment, run the following command: +```bash +conda env create -f environment.yml +``` + +Once the Conda environment is created, it can be activated by: +```bash +conda activate gs-coral-env +``` +After coding inside the environment, it can be deactivated with the command: +```bash +conda deactivate +``` + +**Please note! This environment file does not contain the packages required to run the whole Global_Search pipeline. The packages included are only enough to run the scripts in the TPMCalculator section**. Installation instructions for the Gloabl_Search pipleine can be found in the main [README](../../README.md) + +## TPMCalculator + +TPMCalculator requires two inputs: +- A **GTF annotation file** describing gene models. +- **BAM alignment files** for each sample. + +It produces four output files per sample containing TPM values and raw read counts for: +- **Genes** +- **Transcripts** +- **Exons** +- **Introns** + +## GFF to GTF + +While there are several ways to convert a GFF file to a GTF file, a custom script was used to facilitate this conversion. The [`gff3_to_gtf.py`](./scripts/gff3_to_gtf.py) script... +- Reads a GFF3 file +- Converts `gene`, `mRNA/transcript`, `exon`, and `CDS` features into GTF format +- Constructs valid `gene_id` and `transcript_id` fields +- Writes a new GTF file + +Run the following command to convert a GFF3 file to a GTF file: +```bash +python3 gff3_to_gtf.py +``` + +## Pipeline Overview + +The scripts should be run in the following order: + +0. **Generate the GTF file (if applicable)** + - See instructions above. + +1. **check_bam_sorted.py** + - **Purpose:** Verify that all BAM files are correctly sorted and log their sort order. + - **Inputs:** + - `base_dir`: Path to the folder containing BAM files. BAM files can be nested within sample directories. + - **Outputs:** + - TSV file logging the sort order of each BAM file (`bam_sort_status_.tsv`). + +2. **generate_tpmcalculator_results.py** + - **Purpose:** Generate TPMCalculator results for each sample and create SLURM job scripts to run the analyses. + - **Inputs:** + - `base_dir`: Path to the folder containing BAM files. Bam files may be nested. + - `gtf_file`: Path to the reference GTF annotation file. + - `tpmcalculator_dir`: Desired output directory for TPMCalculator results. + - **Outputs:** + - TPMCalculator output files for each sample (genes, transcripts, exons, introns). + - SLURM job scripts for each sample (`*_tpmcalculator_job.sh`). + +3. **combine_tpmcalculator_results.py** + - **Purpose:** Combine TPMCalculator results from all samples into unified TPM and reads matrices. + - **Inputs:** + - `base_dir`: Folder containing all TPMCalculator outputs for all samples. Output files may be nested. + - `output_dir`: Desired output directory for combined matrices. + - **Outputs:** + - TPM matrix (`STAR_TPMcalculator__TPMs_Matrix_Merged.csv`). + - Reads matrix (`STAR_TPMcalculator__Reads_Matrix_Merged.csv`). + +4. **analyze_organism_totals.py** + - **Purpose:** Summarize total TPMs and reads by organism (e.g., Host, Symbiont species) and generate visualizations to verify results. + - **Inputs:** + - `base_dir`: Folder containing combined TPMCalculator results. This is the output directory. + - `tpm_file`: Path to the merged TPM matrix. + - `reads_file`: Path to the merged reads matrix. + - **Outputs:** + - Summary CSV files of TPM and reads totals by organism (`TPM_totals_by_organism.csv`, `Reads_totals_by_organism.csv`). + - Barplots and summary figures (`organism_totals_barplots.png/pdf`, `organism_summary_stats.png/pdf`). + +## Recommended Directory Structure + +For clarity and organization, the pipeline outputs are recommended to be structured as follows: +``` +├── sample_organism_name/ +│ ├── combined_results/ +│ │ ├── organism_analysis/ +│ │ │ ├── +│ │ │ ├── ... +│ │ ├── STAR_TPMCalculator__Reads_Matrix_Merged.csv +│ │ ├── STAR_TPMCalculator__TPMs_Matrix_Merged.csv +│ ├── tpmcalculator/ +│ │ ├── sample_1/ +│ │ │ ├── +│ │ │ ├── ... +│ │ ├── sample_2/ +│ │ │ ├── +│ │ │ ├── ... +│ │ ├── sample_.../ +│ │ │ ├── +│ │ │ ├── ... +│ ├── reference/ +│ │ ├── .gff3 or gff +│ │ ├── .gtf +``` diff --git a/post_processing/TPMCalculator/environment.yml b/post_processing/TPMCalculator/environment.yml new file mode 100644 index 0000000..b3948ef --- /dev/null +++ b/post_processing/TPMCalculator/environment.yml @@ -0,0 +1,217 @@ +name: gs-coral-env +channels: + - bioconda + - conda-forge + - https://repo.anaconda.com/pkgs/main + - https://repo.anaconda.com/pkgs/r +dependencies: + - _libgcc_mutex=0.1=conda_forge + - _openmp_mutex=4.5=2_gnu + - _r-mutex=1.0.1=anacondar_1 + - agat=1.5.1=pl5321hdfd78af_0 + - bamtools=2.5.3=he132191_0 + - binutils_impl_linux-64=2.45=default_hfdba357_104 + - bwidget=1.10.1=ha770c72_1 + - bzip2=1.0.8=hda65f42_8 + - c-ares=1.34.5=hb9d3cd8_0 + - ca-certificates=2025.11.12=hbd8a1cb_0 + - cairo=1.18.4=h3394656_0 + - curl=8.16.0=h4e3cde8_0 + - font-ttf-dejavu-sans-mono=2.37=hab24e00_0 + - font-ttf-inconsolata=3.000=h77eed37_0 + - font-ttf-source-code-pro=2.038=h77eed37_0 + - font-ttf-ubuntu=0.83=h77eed37_3 + - fontconfig=2.15.0=h7e30c49_1 + - fonts-conda-ecosystem=1=0 + - fonts-conda-forge=1=hc364b38_1 + - freetype=2.14.1=ha770c72_0 + - fribidi=1.0.16=hb03c661_0 + - gcc_impl_linux-64=15.2.0=h6f0f26c_15 + - gffread=0.12.7=h077b44d_6 + - gfortran_impl_linux-64=15.2.0=h281d09f_15 + - graphite2=1.3.14=hecca717_2 + - gsl=2.7=he838d99_0 + - gxx_impl_linux-64=15.2.0=hda75c37_15 + - harfbuzz=12.2.0=h15599e2_0 + - htslib=1.22.1=h566b1c6_0 + - icu=75.1=he02047a_0 + - jsoncpp=1.9.6=hf42df4d_1 + - kernel-headers_linux-64=6.12.0=he073ed8_3 + - keyutils=1.6.3=hb9d3cd8_0 + - krb5=1.21.3=h659f571_0 + - ld_impl_linux-64=2.45=default_hbd61a6d_104 + - lerc=4.0.0=h0aef613_1 + - libblas=3.11.0=4_h4a7cf45_openblas + - libcblas=3.11.0=4_h0358290_openblas + - libcurl=8.16.0=h4e3cde8_0 + - libdb=6.2.32=h9c3ff4c_0 + - libdeflate=1.24=h86f0d12_0 + - libedit=3.1.20250104=pl5321h7949ede_0 + - libev=4.33=hd590300_2 + - libexpat=2.7.3=hecca717_0 + - libffi=3.5.2=h9ec8514_0 + - libfreetype=2.14.1=ha770c72_0 + - libfreetype6=2.14.1=h73754d4_0 + - libgcc=15.2.0=h767d61c_7 + - libgcc-devel_linux-64=15.2.0=hcc6f6b0_115 + - libgcc-ng=15.2.0=h69a702a_7 + - libgfortran=15.2.0=h69a702a_15 + - libgfortran-ng=15.2.0=h69a702a_15 + - libgfortran5=15.2.0=h68bc16d_15 + - libglib=2.86.2=h32235b2_0 + - libgomp=15.2.0=h767d61c_7 + - libiconv=1.18=h3b78370_2 + - libjpeg-turbo=3.1.2=hb03c661_0 + - liblapack=3.11.0=4_h47877c9_openblas + - liblzma=5.8.1=hb9d3cd8_2 + - libnghttp2=1.67.0=had1ee68_0 + - libopenblas=0.3.30=pthreads_h94d23a6_4 + - libpng=1.6.53=h421ea60_0 + - libsanitizer=15.2.0=h90f66d4_15 + - libssh2=1.11.1=hcf80075_0 + - libstdcxx=15.2.0=h8f9b012_7 + - libstdcxx-devel_linux-64=15.2.0=hd446a21_115 + - libstdcxx-ng=15.2.0=h4852527_7 + - libtiff=4.7.1=h8261f1e_0 + - libuuid=2.41.2=h5347b49_1 + - libwebp-base=1.6.0=hd42ef1d_0 + - libxcb=1.17.0=h8a09558_0 + - libxcrypt=4.4.36=hd590300_1 + - libzlib=1.3.1=hb9d3cd8_2 + - make=4.4.1=hb9d3cd8_2 + - ncurses=6.5=h2d0b736_3 + - openssl=3.6.0=h26f9b46_0 + - pango=1.56.4=hadf4263_0 + - pcre2=10.46=h1321c63_0 + - perl=5.32.1=7_hd590300_perl5 + - perl-b-cow=0.007=pl5321hb9d3cd8_1 + - perl-b-hooks-endofscope=0.28=pl5321ha770c72_1 + - perl-base=2.23=pl5321hd8ed1ab_0 + - perl-bioperl-core=1.7.8=pl5321hdfd78af_1 + - perl-business-isbn=3.007=pl5321hd8ed1ab_0 + - perl-business-isbn-data=20210112.006=pl5321hd8ed1ab_0 + - perl-carp=1.50=pl5321hd8ed1ab_0 + - perl-class-inspector=1.36=pl5321hdfd78af_0 + - perl-class-load=0.25=pl5321ha770c72_2 + - perl-class-load-xs=0.10=pl5321hb9d3cd8_0 + - perl-class-methodmaker=2.25=pl5321h7b50bb2_1 + - perl-clone=0.46=pl5321hb9d3cd8_1 + - perl-compress-raw-bzip2=2.214=pl5321hda65f42_0 + - perl-compress-raw-zlib=2.214=pl5321h4dac143_0 + - perl-constant=1.33=pl5321hd8ed1ab_0 + - perl-cpan-meta-check=0.014=pl5321ha770c72_0 + - perl-data-dump=1.25=pl5321h7b50bb2_2 + - perl-data-optlist=0.114=pl5321ha770c72_1 + - perl-db_file=1.858=pl5321hb9d3cd8_0 + - perl-devel-globaldestruction=0.14=pl5321ha770c72_0 + - perl-devel-overloadinfo=0.007=pl5321ha770c72_1 + - perl-devel-stacktrace=2.04=pl5321h296ab09_0 + - perl-digest-hmac=1.05=pl5321hdfd78af_0 + - perl-digest-md5=2.59=pl5321hb9d3cd8_3 + - perl-dist-checkconflicts=0.11=pl5321ha770c72_0 + - perl-encode=3.21=pl5321hb9d3cd8_1 + - perl-encode-locale=1.05=pl5321hdfd78af_7 + - perl-eval-closure=0.14=pl5321ha770c72_0 + - perl-exporter=5.74=pl5321hd8ed1ab_0 + - perl-exporter-tiny=1.002002=pl5321hd8ed1ab_0 + - perl-extutils-makemaker=7.70=pl5321hd8ed1ab_0 + - perl-file-listing=6.16=pl5321hdfd78af_0 + - perl-file-pushd=1.016=pl5321ha770c72_0 + - perl-file-share=0.25=pl5321hdfd78af_3 + - perl-file-sharedir=1.118=pl5321hdfd78af_0 + - perl-file-sharedir-install=0.14=pl5321hdfd78af_0 + - perl-file-spec=3.48_01=pl5321hdfd78af_2 + - perl-getopt-long=2.58=pl5321hdfd78af_0 + - perl-graph=0.9735=pl5321hdfd78af_0 + - perl-heap=0.80=pl5321hdfd78af_1 + - perl-html-parser=3.81=pl5321h4ac6f70_1 + - perl-html-tagset=3.24=pl5321hdfd78af_0 + - perl-http-cookiejar-lwp=0.014=pl5321hdfd78af_0 + - perl-http-cookies=6.11=pl5321hdfd78af_0 + - perl-http-daemon=6.16=pl5321hdfd78af_0 + - perl-http-date=6.06=pl5321hdfd78af_0 + - perl-http-message=7.01=pl5321hdfd78af_0 + - perl-http-negotiate=6.01=pl5321hdfd78af_4 + - perl-inc-latest=0.500=pl5321ha770c72_0 + - perl-io-html=1.004=pl5321hdfd78af_0 + - perl-io-socket-ssl=2.075=pl5321hd8ed1ab_0 + - perl-io-tty=1.20=pl5321hb9d3cd8_3 + - perl-ipc-run=20250809.0=pl5321hdfd78af_0 + - perl-libwww-perl=6.81=pl5321hdfd78af_0 + - perl-list-moreutils=0.430=pl5321hdfd78af_0 + - perl-list-moreutils-xs=0.430=pl5321h7b50bb2_5 + - perl-lwp-mediatypes=6.04=pl5321hdfd78af_1 + - perl-lwp-protocol-https=6.14=pl5321hdfd78af_1 + - perl-mime-base64=3.16=pl5321hb9d3cd8_3 + - perl-module-build=0.4234=pl5321ha770c72_1 + - perl-module-implementation=0.09=pl5321ha770c72_1 + - perl-module-load=0.34=pl5321hdfd78af_0 + - perl-module-runtime=0.016=pl5321ha770c72_1 + - perl-module-runtime-conflicts=0.003=pl5321ha770c72_0 + - perl-moose=2.2207=pl5321hb9d3cd8_2 + - perl-mozilla-ca=20250602=pl5321hdfd78af_0 + - perl-mro-compat=0.15=pl5321ha770c72_0 + - perl-namespace-clean=0.27=pl5321h296ab09_1 + - perl-net-http=6.24=pl5321hdfd78af_0 + - perl-net-ssleay=1.94=pl5321hf672d98_1 + - perl-ntlm=1.09=pl5321hdfd78af_5 + - perl-package-deprecationmanager=0.18=pl5321ha770c72_2 + - perl-package-stash=0.40=pl5321ha770c72_2 + - perl-package-stash-xs=0.30=pl5321hb9d3cd8_2 + - perl-params-util=1.102=pl5321hb9d3cd8_1 + - perl-parent=0.243=pl5321hd8ed1ab_0 + - perl-pod-escapes=1.07=pl5321hdfd78af_2 + - perl-regexp-common=2017060201=pl5321hd8ed1ab_0 + - perl-safe=2.37=pl5321hdfd78af_2 + - perl-scalar-list-utils=1.70=pl5321hb03c661_0 + - perl-set-object=1.43=pl5321h7b50bb2_0 + - perl-socket=2.027=pl5321h5c03b87_6 + - perl-sort-naturally=1.03=pl5321hdfd78af_3 + - perl-statistics-r=0.34=pl5321r44hdfd78af_7 + - perl-storable=3.15=pl5321hb9d3cd8_2 + - perl-sub-exporter=0.991=pl5321ha770c72_1 + - perl-sub-exporter-progressive=0.001013=pl5321ha770c72_0 + - perl-sub-identify=0.14=pl5321hb9d3cd8_2 + - perl-sub-install=0.928=pl5321hd8ed1ab_0 + - perl-sub-name=0.28=pl5321hb9d3cd8_1 + - perl-term-progressbar=2.23=pl5321hdfd78af_0 + - perl-test=1.26=pl5321hd8ed1ab_0 + - perl-test-cleannamespaces=0.24=pl5321ha770c72_2 + - perl-test-fatal=0.016=pl5321ha770c72_0 + - perl-test-harness=3.44=pl5321hd8ed1ab_0 + - perl-test-leaktrace=0.17=pl5321h7bf1ee8_1 + - perl-test-needs=0.002009=pl5321hd8ed1ab_0 + - perl-test-requiresinternet=0.05=pl5321hdfd78af_1 + - perl-test-warnings=0.031=pl5321ha770c72_0 + - perl-text-balanced=2.07=pl5321hdfd78af_0 + - perl-text-wrap=2021.0814=pl5321hd8ed1ab_0 + - perl-time-local=1.35=pl5321hdfd78af_0 + - perl-timedate=2.33=pl5321hdfd78af_2 + - perl-try-tiny=0.31=pl5321ha770c72_0 + - perl-uri=5.34=pl5321ha770c72_0 + - perl-url-encode=0.03=pl5321h9ee0642_1 + - perl-variable-magic=0.64=pl5321hb9d3cd8_0 + - perl-vars=1.03=pl5321hdfd78af_2 + - perl-www-robotrules=6.02=pl5321hdfd78af_4 + - perl-yaml=1.30=pl5321hdfd78af_0 + - pixman=0.46.4=h54a6638_1 + - pthread-stubs=0.4=hb9d3cd8_1002 + - r-base=4.4.3=h14df4e6_4 + - readline=8.2=h8c095d6_2 + - samtools=1.22.1=h96c455f_0 + - sed=4.9=h6688a6e_0 + - sysroot_linux-64=2.39=hc4b9eeb_3 + - tk=8.6.13=noxft_ha0e22de_103 + - tktable=2.10=h8d826fa_7 + - tpmcalculator=0.0.5=h2bd4fab_3 + - tzdata=2025b=h78e105d_0 + - xorg-libice=1.1.2=hb9d3cd8_0 + - xorg-libsm=1.2.6=he73a12e_0 + - xorg-libx11=1.8.12=h4f16b4b_0 + - xorg-libxau=1.0.12=hb03c661_1 + - xorg-libxdmcp=1.1.5=hb03c661_1 + - xorg-libxext=1.3.6=hb9d3cd8_0 + - xorg-libxrender=0.9.12=hb9d3cd8_0 + - xorg-libxt=1.3.1=hb9d3cd8_0 + - zlib=1.3.1=hb9d3cd8_2 + - zstd=1.5.7=hb8e6e7a_2 diff --git a/post_processing/TPMCalculator/scripts/analyze_organism_totals.py b/post_processing/TPMCalculator/scripts/analyze_organism_totals.py new file mode 100644 index 0000000..a0a3948 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/analyze_organism_totals.py @@ -0,0 +1,340 @@ +#!/usr/bin/env python3 +""" +analyze_organism_totals.py + +This script takes merged TPM and Reads matrices produced by the previous +TPMCalculator combination step and verifes that expression values were +assinged correctly to each organism for the multi-organism dataset. + +It also creates summary dataframes and barplots to verify processing worked correctly, +and generates the following outputs: + - TPM_totals_by_organism.csv + - Reads_totals_by_organism.csv + - organism_totals_barplots.png/.pdf + - organism_summary_stats.png/.pdf + +Functions: + - analyze_organism_totals(tpm_file, reads_file, output_dir, organisms) + - create_barplots(tpm_totals, reads_totals, output_dir, organisms) + - create_summary_stats_plot(tpm_totals, reads_totals, output_dir, organisms) + - main() + +Usage: + python3 analyze_organism_totals.py +""" + +import os +import pandas as pd +import matplotlib.pyplot as plt +import numpy as np + +def analyze_organism_totals(tpm_file, reads_file, output_dir, organisms): + """ + Analyze total TPMs and reads by organism prefix. + + Args: + - tpm_file (str): Path to TPM matrix CSV file + - reads_file (str): Path to reads matrix CSV file + - output_dir (str): Directory to save outputs + - organisms (dict): Dictionary containing the names and desired colors used + for graph labeling + + Returns: + - tpm_totals (pandas.DataFrame): DataFrame with TPM totals by organism + - reads_totals (pandas.DataFrame): DataFrame with reads totals by organism + """ + print("Loading TPM and reads matrices...") + + # Load matrices + tpm_df = pd.read_csv(tpm_file) + reads_df = pd.read_csv(reads_file) + + print(f"TPM matrix shape: {tpm_df.shape}") + print(f"Reads matrix shape: {reads_df.shape}") + + # Get sample columns (all except Gene_Id) + sample_columns = [col for col in tpm_df.columns if col != 'Gene_Id'] + print(f"Found {len(sample_columns)} samples") + + # Define organism prefixes + organism_prefixes = organisms["names"] + + # Initialize result dataframes + tpm_totals = pd.DataFrame(index=sample_columns, columns=organism_prefixes) + reads_totals = pd.DataFrame(index=sample_columns, columns=organism_prefixes) + + print("Calculating totals by organism...") + + # Calculate totals for each organism and sample + for sample in sample_columns: + for prefix in organism_prefixes: + # Filter genes by prefix + gene_mask = tpm_df['Gene_Id'].str.startswith(prefix) + + # Calculate totals + tpm_total = tpm_df.loc[gene_mask, sample].sum() + reads_total = reads_df.loc[gene_mask, sample].sum() + + tpm_totals.loc[sample, prefix] = tpm_total + reads_totals.loc[sample, prefix] = reads_total + + # Convert to numeric + tpm_totals = tpm_totals.astype(float) + reads_totals = reads_totals.astype(float) + + # Add total columns + tpm_totals['Total'] = tpm_totals.sum(axis=1) + reads_totals['Total'] = reads_totals.sum(axis=1) + + print("Summary statistics:") + print("\nTPM totals:") + print(tpm_totals.describe()) + print("\nReads totals:") + print(reads_totals.describe()) + + # Save summary dataframes + tpm_output = os.path.join(output_dir, "TPM_totals_by_organism.csv") + reads_output = os.path.join(output_dir, "Reads_totals_by_organism.csv") + + tpm_totals.to_csv(tpm_output) + reads_totals.to_csv(reads_output) + + print(f"\nSaved summary files:") + print(f"TPM totals: {tpm_output}") + print(f"Reads totals: {reads_output}") + + # Create visualizations + create_barplots(tpm_totals, reads_totals, output_dir, organisms) + + return tpm_totals, reads_totals + +def create_barplots(tpm_totals, reads_totals, output_dir, organisms): + """ + Create barplots for TPM and reads totals by organism. + + Args: + - tpm_totals (pandas.DataFrame): DataFrame with TPM totals by organism + - reads_totals (pandas.DataFrame): DataFrame with reads totals by organism + - output_dir (str): Directory to save plots + - organisms (dict): Dictionary containing the names and desired colors used + for graph labeling + + Returns: + - N/A + """ + print("Creating visualizations...") + + # Set up the plotting style + plt.style.use('default') + colors = organisms["colors"] + + # Remove Total column for plotting + organism_cols = organisms["names"] + + # Create figure with subplots + fig, axes = plt.subplots(2, 2, figsize=(15, 12)) + fig.suptitle('TPM and Reads Distribution by Organism', fontsize=16, fontweight='bold') + + # Plot 1: TPM totals by organism (stacked bar) + ax1 = axes[0, 0] + tpm_plot_data = tpm_totals[organism_cols] + tpm_plot_data.plot(kind='bar', stacked=True, ax=ax1, width=0.8, color=colors) + ax1.set_title('TPM Totals by Organism (Stacked)', fontweight='bold') + ax1.set_xlabel('Sample Index') + ax1.set_ylabel('TPM') + ax1.legend(title='Organism', bbox_to_anchor=(1.05, 1), loc='upper left') + ax1.set_xticks([]) # Remove x-axis labels for readability + + # Plot 2: Reads totals by organism (stacked bar) + ax2 = axes[0, 1] + reads_plot_data = reads_totals[organism_cols] + reads_plot_data.plot(kind='bar', stacked=True, ax=ax2, width=0.8, color=colors) + ax2.set_title('Reads Totals by Organism (Stacked)', fontweight='bold') + ax2.set_xlabel('Sample Index') + ax2.set_ylabel('Reads') + ax2.legend(title='Organism', bbox_to_anchor=(1.05, 1), loc='upper left') + ax2.set_xticks([]) # Remove x-axis labels for readability + + # Plot 3: TPM percentage by organism + ax3 = axes[1, 0] + tpm_percentages = tpm_plot_data.div(tpm_plot_data.sum(axis=1), axis=0) * 100 + tpm_percentages.plot(kind='bar', stacked=True, ax=ax3, width=0.8, color=colors) + ax3.set_title('TPM Distribution by Organism (%)', fontweight='bold') + ax3.set_xlabel('Sample Index') + ax3.set_ylabel('Percentage (%)') + ax3.legend(title='Organism', bbox_to_anchor=(1.05, 1), loc='upper left') + ax3.set_xticks([]) # Remove x-axis labels for readability + + # Plot 4: Reads percentage by organism + ax4 = axes[1, 1] + reads_percentages = reads_plot_data.div(reads_plot_data.sum(axis=1), axis=0) * 100 + reads_percentages.plot(kind='bar', stacked=True, ax=ax4, width=0.8, color=colors) + ax4.set_title('Reads Distribution by Organism (%)', fontweight='bold') + ax4.set_xlabel('Sample Index') + ax4.set_ylabel('Percentage (%)') + ax4.legend(title='Organism', bbox_to_anchor=(1.05, 1), loc='upper left') + ax4.set_xticks([]) # Remove x-axis labels for readability + + # Adjust layout + plt.tight_layout() + + # Save plot + plot_output = os.path.join(output_dir, "organism_totals_barplots.png") + plt.savefig(plot_output, dpi=300, bbox_inches='tight') + plt.savefig(plot_output.replace('.png', '.pdf'), bbox_inches='tight') + + print(f"Saved plots: {plot_output}") + plt.close() + + # Create summary statistics plot + create_summary_stats_plot(tpm_totals, reads_totals, output_dir, organisms) + +def create_summary_stats_plot(tpm_totals, reads_totals, output_dir, organisms): + """ + Create summary statistics visualization. + + Args: + - tpm_totals (pandas.DataFrame): DataFrame with TPM totals by organism + - reads_totals (pandas.DataFrame): DataFrame with reads totals by organism + - output_dir (str): Directory to save plots + - organisms (dict): Dictionary containing the names and desired colors used + for graph labeling + + Returns: + - N/A + """ + organism_cols = organisms["names"] + color_cols = organisms["colors"] + + fig, axes = plt.subplots(1, 2, figsize=(12, 5)) + fig.suptitle('Summary Statistics by Organism', fontsize=16, fontweight='bold') + + # TPM summary + ax1 = axes[0] + tpm_summary = tpm_totals[organism_cols].mean() + bars1 = ax1.bar(range(len(organism_cols)), tpm_summary.values, + color=color_cols) + ax1.set_title('Mean TPM by Organism', fontweight='bold') + ax1.set_xlabel('Organism') + ax1.set_ylabel('Mean TPM') + ax1.set_xticks(range(len(organism_cols))) + ax1.set_xticklabels(organism_cols, rotation=45) + + # Add value labels on bars + for i, (bar, val) in enumerate(zip(bars1, tpm_summary.values)): + ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + val*0.01, + f'{val:.0f}', ha='center', va='bottom') + + # Reads summary + ax2 = axes[1] + reads_summary = reads_totals[organism_cols].mean() + bars2 = ax2.bar(range(len(organism_cols)), reads_summary.values, + color=color_cols) + ax2.set_title('Mean Reads by Organism', fontweight='bold') + ax2.set_xlabel('Organism') + ax2.set_ylabel('Mean Reads') + ax2.set_xticks(range(len(organism_cols))) + ax2.set_xticklabels(organism_cols, rotation=45) + + # Add value labels on bars + for i, (bar, val) in enumerate(zip(bars2, reads_summary.values)): + ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + val*0.01, + f'{val:.0f}', ha='center', va='bottom') + + plt.tight_layout() + + # Save plot + summary_output = os.path.join(output_dir, "organism_summary_stats.png") + plt.savefig(summary_output, dpi=300, bbox_inches='tight') + plt.savefig(summary_output.replace('.png', '.pdf'), bbox_inches='tight') + + print(f"Saved summary plot: {summary_output}") + plt.close() + +def main(): + """ + Genenerates analysis and visualization written to the + base_dir/organism_analysis/ directory. + + Expected Input Directory Structure + ---------------------------------- + The script expects the following structure: + + base_dir/ + └── combined_results/ + │ ├── STAR_TPMcalculator_*_TPMs_Matrix_Merged.csv + │ ├── STAR_TPMcalculator_*_Reads_Matrix_Merged.csv + └── tpmcalculator_output/ + └── ... + + Organism Dictionary + ------------------- + organisms contains two fields used throughout the analysis: + + names + List of gene ID prefixes used to identify which organism each + gene belongs to. For example, a gene ID like "Sym_Cgor_gene123" + should be labeled as "Sym_Cgor_". + + colors + List of colors corresponding to each organism prefix. + + Required Paths + -------------------- + req_paths contains user-defined paths and output file names: + + combined_results_dir + Directory containing the 'combined_results/' folder from the + previous script in the pipeline. + + tpm_file + File name for the merged TPM matrix. + + reads_file + File name for the merged Reads matrix. + """ + + #################### + # Define Organisms + #################### + organisms = { + "names": ['Host_', 'Sym_Cgor_', 'Sym_Dtre_', 'Sym_Smic_'], + "colors": ['#1f77b4', '#ff7f0e', '#2ca02c' ,'#d62728'] + } + + #################### + # Define paths + #################### + req_paths = { + "combined_results_dir": "", + "tpm_file": "STAR_TPMcalculator_TPMs_Matrix_Merged_rerun.csv", + "reads_file": "STAR_TPMcalculator_Reads_Matrix_Merged_rerun.csv" + } + + tpm_file = os.path.join(req_paths["combined_results_dir"], req_paths["tpm_file"]) + reads_file = os.path.join(req_paths["combined_results_dir"], req_paths["reads_file"]) + + if os.path.exists(reads_file): + print("Using original reads matrix (may have different gene set)") + else: + reads_file = tpm_file + print("No reads matrix found, using TPM matrix for both analyses") + + # Check if input files exist + if not os.path.exists(tpm_file): + print(f"TPM matrix file not found: {tpm_file}") + print("Please run combine_tpmcalculator_results.py first") + return + + # Create output directory for analysis results + output_dir = os.path.join(req_paths["combined_results_dir"], "organism_analysis") + os.makedirs(output_dir, exist_ok=True) + + # Analyze organism totals + tpm_totals, reads_totals = analyze_organism_totals(tpm_file, reads_file, output_dir, organisms) + + print("\nAnalysis completed successfully!") + print(f"Results saved in: {output_dir}") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/check_bam_sort.py b/post_processing/TPMCalculator/scripts/check_bam_sort.py new file mode 100644 index 0000000..fef7e42 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/check_bam_sort.py @@ -0,0 +1,138 @@ +#!/usr/bin/env python3 +""" +check_bam_sort.py + +The script scans each sample directory, identifies the STAR-aligned BAM file, +determines its sort order using `samtools view -H`, and logs the results to +a TSV file for quick verification of alignment output consistency. + +Functions: + - get_sort_status(bam_path) + - main() + +Usage: + python3 check_bam_sort.py +""" + +import os +import glob +import subprocess +import csv + +def get_sort_status(bam_path): + """ + Returns samtools sort order: "coordinate", "queryname", "unsorted", + or "unknown". + + Args: + - bam_path (string): Path to bam file. + + Returns: + - (string): samtools sort order. + """ + try: + + # Run samtools command + header = subprocess.run( + ["samtools", "view", "-H", bam_path], + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + text=True + ) + + # Return sort order + if header.returncode != 0: + return "unknown" + + for line in header.stdout.splitlines(): + if line.startswith("@HD"): + if "SO:coordinate" in line: + return "coordinate" + elif "SO:queryname" in line: + return "queryname" + elif "SO:unsorted" in line: + return "unsorted" + else: + return "unsorted" + + return "unsorted" + except Exception: + return "unknown" + + +def main(): + """ + Finds bam files within a provided folder and logs the + samtools sort order. + + Expected Input Directory Structure + ---------------------------------- + The script expects the following directory structure: + + base_dir/ + ├── PerSample/ + │ ├── SAMPLE_1/ + │ │ └── results_STAR_htseq/ + │ │ └── *_dedup_NoSingletonCollated.out.bam + │ │ + │ ├── SAMPLE_2/ + │ │ └── results_STAR_htseq/ + │ │ └── *_dedup_NoSingletonCollated.out.bam + │ │ + │ └── ... + """ + + ################# + # Define paths + ################## + base_dir = "" + per_sample_dir = os.path.join(base_dir, "PerSample") + + log_file_name = "bam_sort_status_" + os.path.basename(base_dir) + ".tsv" + log_path = os.path.join(os.getcwd(), log_file_name) + + # Check if PerSample directory exists + if not os.path.exists(per_sample_dir): + print(f"Error: PerSample directory not found at {per_sample_dir}") + return + + # Find all sample directories + sample_dirs = [d for d in os.listdir(per_sample_dir) + if os.path.isdir(os.path.join(per_sample_dir, d))] + print(f"\nFound {len(sample_dirs)} sample directories") + + # Open TSV + with open(log_path, "w", newline="") as log_file: + writer = csv.writer(log_file, delimiter="\t") + writer.writerow(["sample_dir", "bam_file", "sort_status"]) + + # Loop over samples + for sample_dir in sample_dirs: + sample_path = os.path.join(per_sample_dir, sample_dir) + results_star_htseq_path = os.path.join(sample_path, "results_STAR_htseq") + + if not os.path.exists(results_star_htseq_path): + print(f"Warning: results_STAR_htseq directory not found for {sample_dir}") + continue + + # Find BAM file + bam_pattern = os.path.join(results_star_htseq_path, "*dedup_NoSingletonCollated.out.bam") + bam_files = glob.glob(bam_pattern) + + if not bam_files: + print(f"Warning: No BAM file found for {sample_dir}") + continue + elif len(bam_files) > 1: + print(f"Warning: Multiple BAM files found for {sample_dir}, using the first one") + + bam_file = bam_files[0] + + # Check sort status and log result + sort_status = get_sort_status(bam_file) + writer.writerow([sample_dir, bam_file, sort_status]) + + print(f"\nLogged BAM sort status to {log_path}\n") + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/combine_tpmcalculator_results.py b/post_processing/TPMCalculator/scripts/combine_tpmcalculator_results.py new file mode 100644 index 0000000..ded8307 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/combine_tpmcalculator_results.py @@ -0,0 +1,258 @@ +#!/usr/bin/env python3 +""" +combine_tpmcalculator_results.py + +Combine TPMCalculator output files from multiple RNA-seq samples into merged matrices. +This script scans the TPMCalculator output directories and aligns TPM and Reads +values across all samples by gene. Two merged CSV are generated. + +Functions: + - extract_sample_name(filepath) + - find_tpmcalculator_outputs(base_dir) + - combine_tpm_matrices(output_files, output_dir, req_paths) + - main() + +Usage: + python3 combine_tpmcalculator_results.py +""" + +import os +import pandas as pd +import glob +import re +from pathlib import Path + +def extract_sample_name(filepath): + """ + Extract sample name from filepath by removing '_star_10_0.3_0.66_0_dedup' part. + + Args: + - filepath (str): Path to the TPMCalculator output file + + Returns: + - sample_name (str): Sample name + """ + filename = os.path.basename(filepath) + # Remove the file extension and TPMCalculator suffix + base_name = filename.replace('_tpmcalculator.txt_genes.out', '') + # Remove the STAR alignment suffix + sample_name = re.sub(r'_star_10_0\.3_0\.66_0_dedup.*', '', base_name) + return sample_name + +def find_tpmcalculator_outputs(base_dir): + """ + Find all TPMCalculator output files with various possible patterns. + + Args: + - base_dir (str): Base directory containing PerSample folders + + Returns: + - list: List of file paths + """ + # per_sample_dir = os.path.join(base_dir, "PerSample") + per_sample_dir = os.path.join(base_dir, "tpmcalculator_output") + + # Try multiple patterns for TPMCalculator output files + patterns = [ + os.path.join(per_sample_dir, "*", "*Collated.out_genes"), + os.path.join(per_sample_dir, "*", "*_genes.out"), + os.path.join(per_sample_dir, "*", "*NoSingletonCollated.out_genes.out"), + os.path.join(per_sample_dir, "*", "*dedup*.out_genes.out"), + ] + + print(patterns) + + output_files = [] + for pattern in patterns: + files = glob.glob(pattern) + if files: + output_files = files + print(f"Found files with pattern: {pattern}") + break + + return sorted(output_files) + +def combine_tpm_matrices(output_files, output_dir, req_paths): + """ + Combine TPMCalculator results into TPM and Reads matrices. + + Args: + - output_files (list): List of TPMCalculator output file paths + - output_dir (str): Directory to save the combined matrices + - req_paths (dict): Dictionary containing names of TPM and Reads CSVs + """ + tpm_data = {} + reads_data = {} + gene_ids = None + + print(f"Processing {len(output_files)} TPMCalculator output files...") + + for i, filepath in enumerate(output_files): + try: + # Extract sample name + sample_name = extract_sample_name(filepath) + print(f"Processing {i+1}/{len(output_files)}: {sample_name}") + + # Check file size first + file_size = os.path.getsize(filepath) + if file_size < 300: # Less than 300 bytes is likely empty (just headers) + print(f"Warning: File {filepath} is too small ({file_size} bytes), likely empty") + continue + + # Read the TPMCalculator output file + df = pd.read_csv(filepath, sep='\t') + + # Identify duplicated Gene_Id + dup_mask = df['Gene_Id'].duplicated(keep=False) + if dup_mask.any(): + skipped_file = os.path.join(output_dir, "Skipped_Duplicate_Gene_Ids.csv") + + # Save skipped rows + df[dup_mask].assign(sample=sample_name).to_csv( + skipped_file, + sep=",", + index=False, + mode="a", + header=not os.path.exists(skipped_file) + ) + df = df[~dup_mask] + + # Check if dataframe is empty + if df.empty: + print(f"Warning: File {filepath} contains no data") + continue + + # Check if required columns exist + required_cols = ['Gene_Id', 'TPM', 'Reads'] + missing_cols = [col for col in required_cols if col not in df.columns] + + if missing_cols: + print(f"Warning: Missing columns {missing_cols} in {filepath}") + print(f"Available columns: {list(df.columns)}") + continue + + # Check if there are any genes + if len(df) == 0: + print(f"Warning: No genes found in {filepath}") + continue + + # Set gene IDs from first file + if gene_ids is None: + gene_ids = df['Gene_Id'].tolist() + print(f"Using gene list from {sample_name} with {len(gene_ids)} genes") + + # Check if gene lists match + if gene_ids != df['Gene_Id'].tolist(): + print(f"Warning: Gene list mismatch in {sample_name}") + # Try to align with existing gene_ids + df_aligned = df.set_index('Gene_Id').reindex(gene_ids) + tpm_data[sample_name] = df_aligned['TPM'].fillna(0).tolist() + reads_data[sample_name] = df_aligned['Reads'].fillna(0).tolist() + else: + # Extract TPM and Reads columns + tpm_data[sample_name] = df['TPM'].tolist() + reads_data[sample_name] = df['Reads'].tolist() + + except Exception as e: + print(f"Error processing {filepath}: {e}") + continue + + if not tpm_data: + print("No valid data found. Exiting.") + return + + print(f"Successfully processed {len(tpm_data)} samples") + print(f"Found {len(gene_ids)} genes") + + # Create TPM matrix + tpm_df = pd.DataFrame(tpm_data) + tpm_df.insert(0, 'Gene_Id', gene_ids) + + # Create Reads matrix + reads_df = pd.DataFrame(reads_data) + reads_df.insert(0, 'Gene_Id', gene_ids) + + # Create output directory if it doesn't exist + os.makedirs(output_dir, exist_ok=True) + + # Save matrices + tpm_output = os.path.join(output_dir, req_paths["tpm_csv"]) + reads_output = os.path.join(output_dir, req_paths["reads_csv"]) + + tpm_df.to_csv(tpm_output, index=False) + reads_df.to_csv(reads_output, index=False) + + print(f"\nMatrices saved:") + print(f"TPM matrix: {tpm_output}") + print(f"Reads matrix: {reads_output}") + print(f"Matrix dimensions: {tpm_df.shape}") + print(f"Samples: {list(tpm_data.keys())}") + +def main(): + """ + This function combines the TPMCalculator results across all samples, + by gene, into merged TPM and Reads matrices. + + Expected Input Directory Structure + ---------------------------------- + The script expects the following structure: + + base_dir/ + └── tpmcalculator_output/ + ├── SAMPLE_1/ + │ └── SAMPLE_1...out_genes.out + ├── SAMPLE_2/ + │ └── SAMPLE_2...out_genes.out + └── ... + + Required Paths + -------------------- + req_paths contains user-defined paths and output file names: + + base_dir + Root directory containing 'tpmcalculator_output/'. This is the output directory + from the previous script. + + tpm_csv + File name for the merged TPM matrix. + + reads_csv + File name for the merged Reads matrix. + """ + + ################# + # Define paths + ################## + req_paths = { + "base_dir": "", + "tpm_csv": "STAR_TPMcalculator_TPMs_Matrix_Merged_rerun.csv", + "reads_csv": "STAR_TPMcalculator_Reads_Matrix_Merged_rerun.csv" + } + output_dir = req_paths["base_dir"] + "/combined_results" + + # Find TPMCalculator output files + print("Searching for TPMCalculator output files...") + output_files = find_tpmcalculator_outputs(req_paths["base_dir"]) + + if not output_files: + print("No TPMCalculator output files found!") + print("Expected pattern: tpmcalculator/*/*Collated.out_sorted_genes") + return + + print(f"Found {len(output_files)} TPMCalculator output files") + + # Show first few files for verification + print("\nFirst few files found:") + for f in output_files[:5]: + sample_name = extract_sample_name(f) + print(f" {sample_name}: {f}") + + if len(output_files) > 5: + print(f" ... and {len(output_files) - 5} more") + + # Combine matrices + combine_tpm_matrices(output_files, output_dir, req_paths) + + +if __name__ == "__main__": + main() diff --git a/post_processing/TPMCalculator/scripts/generate_tpmcalculator_jobs.py b/post_processing/TPMCalculator/scripts/generate_tpmcalculator_jobs.py new file mode 100644 index 0000000..96f39e7 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/generate_tpmcalculator_jobs.py @@ -0,0 +1,202 @@ +#!/usr/bin/env python3 +""" +generate_tpmcalculator_jobs.py + +This module generates and submits SLURM job scripts for running TPMCalculator on +multiple RNA-seq samples. It scans a PerSample directory containing sample-specific +subdirectories, identifies the appropriate BAM file for each sample, +and generates a SLURM job script that runs TPMCalculator using a +provided GTF annotation file. + +Functions: + - create_slurm_job_script(sample_id, bam_file, gtf_file, tmp_output_dir) + - main() + +Usage: + python3 generate_tpmcalculator_slurm_jobs.py +""" + +import os +import glob +import subprocess + +def create_slurm_job_script(sample_id, bam_file, gtf_file, tmp_output_dir): + """ + Create a SLURM job script for TPMCalculator analysis. + + Args: + - sample_id (string): Sample identifier + - bam_file (string): Path to the BAM file + - gtf_file (string): Path to the GTF file + - tmp_output_dir (string): Directory to save TPMCalculator output + + Returns: + - job_script_path (string): Path to SLURM job bash script + """ + + job_script_content = f"""#!/bin/bash +#SBATCH --job-name=tpmcalc_{sample_id} +#SBATCH --cpus-per-task=4 +#SBATCH --mem=16G +#SBATCH --time=08:00:00 +#SBATCH --output={tmp_output_dir}/{sample_id}_tpmcalculator_%j.out +#SBATCH --error={tmp_output_dir}/{sample_id}_tpmcalculator_%j.err + +# TPMCalculator is available via conda installation + +# Define variables +BAM="{bam_file}" +GTF="{gtf_file}" +OUTPUT_DIR="{tmp_output_dir}" + +# Run TPMCalculator +echo "Starting TPMCalculator analysis for {sample_id}..." +mkdir -p $OUTPUT_DIR +cd $OUTPUT_DIR +TPMCalculator -g $GTF -b $BAM -o $OUTPUT_DIR/{sample_id}_tpmcalculator.txt + +echo "TPMCalculator analysis completed for {sample_id}" +""" + + # Create the bash file containing the SLURM job + job_script_path = os.path.join(tmp_output_dir, f"{sample_id}_tpmcalculator_job.sh") + + with open(job_script_path, 'w') as f: + f.write(job_script_content) + + # Make the script executable + os.chmod(job_script_path, 0o755) + + return job_script_path + +def main(): + """ + This function scans a project results directory, identifies sample + BAM files produced by the STAR pipeline, generates a SLURM job script + for each sample, and submits those jobs to the cluster. + + Expected Input Directory Structure + ---------------------------------- + The script expects the following directory structure: + + base_dir/ + ├── PerSample/ + │ ├── SAMPLE_1/ + │ │ └── results_STAR_htseq/ + │ │ └── *_dedup_NoSingletonCollated.out.bam + │ │ + │ ├── SAMPLE_2/ + │ │ └── results_STAR_htseq/ + │ │ └── *_dedup_NoSingletonCollated.out.bam + │ │ + │ └── ... + + Where: + - Each directory inside `PerSample/` represents one RNA-seq sample. + - Each sample directory must contain a `results_STAR_htseq/` folder. + - The BAM file used for TPMCalculator must match the pattern: + `*dedup_NoSingletonCollated.out.bam`. + + Required Paths + -------------- + req_paths is a dictionary containing required input and output paths: + + base_dir + Root directory of the RNA-seq project results. This directory + must contain the `PerSample/` directory with all sample folders. + + gtf_file + Path to the reference GTF annotation file used by TPMCalculator + to assign reads to genes and transcripts. + + output_dir + Directory where TPMCalculator outputs and generated SLURM job + scripts will be written. A subdirectory for each sample will + be created automatically if it does not already exist. + """ + + ################# + # Define paths + ################## + req_paths ={ + "base_dir": "", + "gtf_file": "", + "output_dir": "" + } + per_sample_dir = os.path.join(req_paths["base_dir"], "PerSample") + + # Check if GTF file exists + if not os.path.exists(req_paths["gtf_file"]): + print(f"Error: GTF file not found at {req_paths["gtf_file"]}") + return + + # Check if PerSample directory exists + if not os.path.exists(per_sample_dir): + print(f"Error: PerSample directory not found at {per_sample_dir}") + return + + # Find all sample directories + sample_dirs = [d for d in os.listdir(per_sample_dir) + if os.path.isdir(os.path.join(per_sample_dir, d))] + + print(f"Found {len(sample_dirs)} sample directories") + + job_scripts_created = [] + + for sample_dir in sample_dirs: + sample_path = os.path.join(per_sample_dir, sample_dir) + results_star_htseq_path = os.path.join(sample_path, "results_STAR_htseq") + + # Check if results_STAR_htseq directory exists + if not os.path.exists(results_star_htseq_path): + print(f"Warning: results_STAR_htseq directory not found for {sample_dir}") + continue + + # Find BAM file + bam_pattern = os.path.join(results_star_htseq_path, "*dedup_NoSingletonCollated.out.bam") + bam_files = glob.glob(bam_pattern) + + if not bam_files: + print(f"Warning: No BAM file found for {sample_dir}") + continue + elif len(bam_files) > 1: + print(f"Warning: Multiple BAM files found for {sample_dir}, using the first one") + + bam_file = bam_files[0] + + # Create tpmcalculator directory + tpmcalculator_dir = req_paths["output_dir"] + sample_dir + os.makedirs(tpmcalculator_dir, exist_ok=True) + + # Extract sample ID from directory name + sample_id = sample_dir + + # Create SLURM job script + job_script_path = create_slurm_job_script(sample_id, bam_file, req_paths["gtf_file"], tpmcalculator_dir) + job_scripts_created.append(job_script_path) + + print(f"Created job script for {sample_id}: {job_script_path}") + + print(f"\nTotal job scripts created: {len(job_scripts_created)}") + + # Submit all jobs automatically + print("\nSubmitting jobs to SLURM cluster...") + submitted_jobs = [] + + for job_script in job_scripts_created: + try: + result = subprocess.run(['sbatch', job_script], stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True) + if result.returncode == 0: + job_id = result.stdout.strip().split()[-1] + submitted_jobs.append(job_id) + print(f"Submitted job {job_id} for {job_script}") + else: + print(f"Failed to submit {job_script}: {result.stderr.strip()}") + except Exception as e: + print(f"Error submitting {job_script}: {e}") + + print(f"\nSuccessfully submitted {len(submitted_jobs)} jobs out of {len(job_scripts_created)} total jobs.") + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf.py b/post_processing/TPMCalculator/scripts/gff3_to_gtf.py new file mode 100644 index 0000000..9ac6a09 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf.py @@ -0,0 +1,162 @@ +#!/usr/bin/env python3 +""" +gff3_to_gtf.py + +This script parses a GFF3 file and converts supported feature types +(gene, mRNA/transcript, exon, CDS) into GTF format while maintaining +the correct gene_id and transcript_id relationships. + +Functions: + - parse_attributes(attr_string) + - format_gtf_attributes(attributes, feature_type, gene_id=None, transcript_id=None) + - convert_gff3_to_gtf(input_file, output_file) + - main() + +Usage: + python3 gff3_to_gtf.py +""" + +import sys +import re +from pathlib import Path + +def parse_attributes(attr_string): + """ + GFF3 attributes are formatted as key-value pairs separated by + semicolons. This function converts the attribute string into a + Python dictionary for easier access. + + Args: + - attr_string (str): Attribute column from a GFF3 file. + + Returns: + - attributes (dict): Dictionary mapping attribute keys to values. + """ + attributes = {} + if not attr_string or attr_string == '.': + return attributes + + for item in attr_string.split(';'): + if '=' in item: + key, value = item.split('=', 1) + attributes[key] = value + return attributes + +def format_gtf_attributes(attributes, feature_type, gene_id=None, transcript_id=None): + """ + Constructs the GTF attribute string using gene_id and transcript_id + along with optional gene_name or transcript_name fields if present. + + Args: + - attributes (dict): Parsed GFF3 attributes. + - feature_type (str): Feature type (gene, transcript, exon, CDS). + - gene_id (str, optional): Gene identifier. + - transcript_id (str, optional): Transcript identifier. + + Returns: + - str: Properly formatted GTF attribute string. + """ + gtf_attrs = [] + + if gene_id: + gtf_attrs.append(f'gene_id "{gene_id}"') + + if transcript_id: + gtf_attrs.append(f'transcript_id "{transcript_id}"') + + if 'Name' in attributes: + if feature_type == 'gene': + gtf_attrs.append(f'gene_name "{attributes["Name"]}"') + elif feature_type in ['transcript', 'mRNA']: + gtf_attrs.append(f'transcript_name "{attributes["Name"]}"') + + return '; '.join(gtf_attrs) + ';' + +def convert_gff3_to_gtf(input_file, output_file): + """ + Reads a GFF3 file line by line, parses feature attributes, and writes + supported features (gene, transcript/mRNA, exon, CDS) to a GTF file. + Transcript-to-gene relationships are tracked to correctly assign + gene_id and transcript_id values. + + Args: + - input_file (str): Path to the input GFF3 file. + - output_file (str): Path to the output GTF file. + + Returns: + - N/A + """ + transcript_to_gene = {} + + with open(input_file, 'r') as infile, open(output_file, 'w') as outfile: + for line_num, line in enumerate(infile, 1): + if line_num % 100000 == 0: + print(f"Processed {line_num:,} lines...") + + line = line.strip() + + if line.startswith('#') or not line: + continue + + fields = line.split('\t') + if len(fields) != 9: + continue + + seqname, source, feature, start, end, score, strand, frame, attributes = fields + attr_dict = parse_attributes(attributes) + + if feature == 'gene': + gene_id = attr_dict.get('ID', '') + + gtf_attributes = format_gtf_attributes(attr_dict, feature, gene_id=gene_id) + gtf_line = f"{seqname}\t{source}\tgene\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{gtf_attributes}" + outfile.write(gtf_line + '\n') + + elif feature in ['mRNA', 'transcript']: + transcript_id = attr_dict.get('ID', '') + parent_gene = attr_dict.get('Parent', '') + + transcript_to_gene[transcript_id] = parent_gene + + gtf_attributes = format_gtf_attributes(attr_dict, feature, gene_id=parent_gene, transcript_id=transcript_id) + gtf_line = f"{seqname}\t{source}\ttranscript\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{gtf_attributes}" + outfile.write(gtf_line + '\n') + + elif feature in ['exon', 'CDS']: + parent_id = attr_dict.get('Parent', '') + + # Check if parent is a transcript/mRNA or directly a gene + if parent_id in transcript_to_gene: + # Parent is a transcript + parent_gene = transcript_to_gene[parent_id] + transcript_id = parent_id + else: + # Parent might be a gene (some exons directly reference genes) + parent_gene = parent_id + transcript_id = parent_id # Use gene ID as transcript ID for these cases + + gtf_attributes = format_gtf_attributes(attr_dict, feature, gene_id=parent_gene, transcript_id=transcript_id) + gtf_line = f"{seqname}\t{source}\t{feature}\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{gtf_attributes}" + outfile.write(gtf_line + '\n') + +def main(): + if len(sys.argv) != 3: + print("Usage: python3 gff3_to_gtf.py ") + sys.exit(1) + + input_file = sys.argv[1] + output_file = sys.argv[2] + + if not Path(input_file).exists(): + print(f"Error: Input file '{input_file}' not found.") + sys.exit(1) + + try: + convert_gff3_to_gtf(input_file, output_file) + print(f"Successfully converted {input_file} to {output_file}") + except Exception as e: + print(f"Error during conversion: {e}") + sys.exit(1) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/README.md b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/README.md new file mode 100644 index 0000000..f0323e5 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/README.md @@ -0,0 +1,11 @@ +# GFF3 to GTF Fixes + +This directory contains small helper scripts used to make manual fixes to GTF files after converting annotations from **GFF3 → GTF**. + +While the main conversion script performs the bulk of the format conversion, some reference annotations required small adjustments to ensure compatibility with downstream tools such as **TPMCalculator**. These fixes were typically related to feature naming or attribute formatting. + +Examples of fixes performed: +- Converting CDS to exon label +- Fixing GTF chromosome names to match BAM headers + +Each script documents the specific transformation it performs. These scripts were run **after the initial GFF3 → GTF conversion step** and before running downstream expression quantification. \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/cds_to_exon_gtf.sh b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/cds_to_exon_gtf.sh new file mode 100755 index 0000000..a165fe3 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/cds_to_exon_gtf.sh @@ -0,0 +1,29 @@ +#!/bin/bash +# Convert CDS to exon for host so TPMCalculator can count them + +GTF="Valid_Plob_Cgor_concat.gtf" +FIXED_GTF="Valid_Plob_Cgor_concat_fixed.gtf" + +awk 'BEGIN{OFS="\t"} +{ + # Split line into first 8 fields + rest as attributes + attrs = "" + if(NF>8) { + attrs = $9 + for(i=10;i<=NF;i++) attrs = attrs " " $i + } + + if($3=="CDS") { + $3="exon" + # Extract transcript_id + match(attrs,/transcript_id "[^"]+"/) + tid=substr(attrs,RSTART+15,RLENGTH-16) # get only the value inside quotes + # Prepend gene_id + attrs="gene_id \"" tid "\"; " attrs + } + + # Rebuild the line: columns 1-8 + fixed attrs + print $1,$2,$3,$4,$5,$6,$7,$8,attrs +}' "$GTF" > "$FIXED_GTF" + +echo "Fixed GTF created: $FIXED_GTF" \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/fix_gtf_scaffold_names.sh b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/fix_gtf_scaffold_names.sh new file mode 100755 index 0000000..9e86284 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/fix_gtf_scaffold_names.sh @@ -0,0 +1,19 @@ +#!/bin/bash +# Fix GTF chromosome names to match BAM headers for TPMCalculator +# Original issue: BAM and GTF scaffold names did not match, causing TPMCalculator to run without producing gene output. + +INPUT_GTF="Valid_Phar_Cgor_Dtre_Smic_concat.gtf" +OUTPUT_GTF="Valid_Phar_Cgor_Dtre_Smic_concat_fixed.gtf" + +awk 'BEGIN{FS=OFS="\t"} +{ + match($1, /JBDLLT010000[0-9]+/) + if (RSTART) { + id=substr($1, RSTART, RLENGTH) + sub(/JBDLLT010000/, "", id) + $1="Host_Phar_gnl|WGS:JBDLLT|Phar_scaff_" id "|gb|JBDLLT010000" id + } + print +}' "$INPUT_GTF" > "$OUTPUT_GTF" + +echo "Fixed GTF written to $OUTPUT_GTF" \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/underscore_to_pipe.sh b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/underscore_to_pipe.sh new file mode 100755 index 0000000..3ee6a94 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/underscore_to_pipe.sh @@ -0,0 +1,26 @@ +#!/bin/bash +# Convert underscore naming convention to piped for Host + +GTF="Valid_Pdae_Cgor_Dtre_Smic_concat.gtf" +FIXED_GTF="Valid_Pdae_Cgor_Dtre_Smic_concat_fixed.gtf" + +# Replace '_size' to '|size' +awk -F'\t' -v OFS='\t' ' + /^#/ {print; next} + { + # fix first column + sub(/_size/, "|size", $1) + + # combine everything after column 8 into column 9 + if (NF > 9) { + $9 = ""; + for(i=9; i<=NF; i++) { + $9 = $9 $i + if(i "$FIXED_GTF" + +echo "Fixed GTF written to $FIXED_GTF" \ No newline at end of file From f80f148add99994c000a24b001433253ce0539cb Mon Sep 17 00:00:00 2001 From: Yamina Katariya Date: Fri, 20 Mar 2026 10:59:58 -0700 Subject: [PATCH 3/4] fixed typo --- post_processing/TPMCalculator/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/post_processing/TPMCalculator/README.md b/post_processing/TPMCalculator/README.md index a7ea252..774d0ce 100644 --- a/post_processing/TPMCalculator/README.md +++ b/post_processing/TPMCalculator/README.md @@ -115,7 +115,7 @@ For clarity and organization, the pipeline outputs are recommended to be structu │ │ │ ├── ... │ │ ├── STAR_TPMCalculator__Reads_Matrix_Merged.csv │ │ ├── STAR_TPMCalculator__TPMs_Matrix_Merged.csv -│ ├── tpmcalculator/ +│ ├── tpmcalculator_output/ │ │ ├── sample_1/ │ │ │ ├── │ │ │ ├── ... From 5a683da6a5398aba9cae4c9f1b2995a63f3efda1 Mon Sep 17 00:00:00 2001 From: Yamina Katariya Date: Fri, 20 Mar 2026 11:27:11 -0700 Subject: [PATCH 4/4] updated README with table of contents --- post_processing/TPMCalculator/README.md | 19 +++++++++++++++++-- 1 file changed, 17 insertions(+), 2 deletions(-) diff --git a/post_processing/TPMCalculator/README.md b/post_processing/TPMCalculator/README.md index 774d0ce..72fb754 100644 --- a/post_processing/TPMCalculator/README.md +++ b/post_processing/TPMCalculator/README.md @@ -1,12 +1,23 @@ # mRNA Quantification using TPMCalculator + ## Introduction This pipeline quantifies mRNA abundance directly from BAM file alignments using **TPMCalculator** and generates visualizations to assess the results. It is designed to process multiple samples and combine results into comprehensive matrices for downstream analysis. The pipeline is organized into four scripts, each responsible for a different stage of processing and analysis. These scripts require access to a **SLURM** job manager for submitting and running the workflow. +## Table of Contents +* [Introduction](#introduction) +* [Installation](#installation) + * [Environment](#environment) +* [TPMCalculator](#tpmcalc) +* [GFF3 to GTF](#gff3_to_gtf) +* [Pipeline Overview](#pipeline_overview) +* [Recommended Directory Structure](#dir_struc) + + ## Installation This repository can be cloned locally by running the following `git` command: ```bash @@ -36,6 +47,7 @@ conda deactivate **Please note! This environment file does not contain the packages required to run the whole Global_Search pipeline. The packages included are only enough to run the scripts in the TPMCalculator section**. Installation instructions for the Gloabl_Search pipleine can be found in the main [README](../../README.md) + ## TPMCalculator TPMCalculator requires two inputs: @@ -48,7 +60,8 @@ It produces four output files per sample containing TPM values and raw read coun - **Exons** - **Introns** -## GFF to GTF + +## GFF3 to GTF While there are several ways to convert a GFF file to a GTF file, a custom script was used to facilitate this conversion. The [`gff3_to_gtf.py`](./scripts/gff3_to_gtf.py) script... - Reads a GFF3 file @@ -61,11 +74,12 @@ Run the following command to convert a GFF3 file to a GTF file: python3 gff3_to_gtf.py ``` + ## Pipeline Overview The scripts should be run in the following order: -0. **Generate the GTF file (if applicable)** +0. **gff3_to_gtf.py** - See instructions above. 1. **check_bam_sorted.py** @@ -104,6 +118,7 @@ The scripts should be run in the following order: - Summary CSV files of TPM and reads totals by organism (`TPM_totals_by_organism.csv`, `Reads_totals_by_organism.csv`). - Barplots and summary figures (`organism_totals_barplots.png/pdf`, `organism_summary_stats.png/pdf`). + ## Recommended Directory Structure For clarity and organization, the pipeline outputs are recommended to be structured as follows: