diff --git a/post_processing/TPMCalculator/README.md b/post_processing/TPMCalculator/README.md new file mode 100644 index 0000000..72fb754 --- /dev/null +++ b/post_processing/TPMCalculator/README.md @@ -0,0 +1,146 @@ + +# mRNA Quantification using TPMCalculator + + +## Introduction + +This pipeline quantifies mRNA abundance directly from BAM file alignments using **TPMCalculator** and generates visualizations to assess the results. It is designed to process multiple samples and combine results into comprehensive matrices for downstream analysis. + +The pipeline is organized into four scripts, each responsible for a different stage of processing and analysis. These scripts require access to a **SLURM** job manager for submitting and running the workflow. + +## Table of Contents +* [Introduction](#introduction) +* [Installation](#installation) + * [Environment](#environment) +* [TPMCalculator](#tpmcalc) +* [GFF3 to GTF](#gff3_to_gtf) +* [Pipeline Overview](#pipeline_overview) +* [Recommended Directory Structure](#dir_struc) + + +## Installation +This repository can be cloned locally by running the following `git` command: +```bash +git clone https://github.com/baliga-lab/Global_Search.git +``` +Please note that Git is required to run the above command. For instructions on downloading Git, please see [the Git Guide](https://github.com/git-guides/install-git). + + +### Environment + +#### Conda +This application is built on top of multiple Python packages with specific version requirements. We recommend using `conda` to create an isolated Python environment with all necessary packages. If you do not have it installed, follow the [Miniconda Instillation Instructions](https://www.anaconda.com/docs/getting-started/miniconda/install). The list of necessary packages can be found at in the [`environment.yml`](./environment.yml) file. + +To create the specified `gs-coral-env` Conda environment, run the following command: +```bash +conda env create -f environment.yml +``` + +Once the Conda environment is created, it can be activated by: +```bash +conda activate gs-coral-env +``` +After coding inside the environment, it can be deactivated with the command: +```bash +conda deactivate +``` + +**Please note! This environment file does not contain the packages required to run the whole Global_Search pipeline. The packages included are only enough to run the scripts in the TPMCalculator section**. Installation instructions for the Gloabl_Search pipleine can be found in the main [README](../../README.md) + + +## TPMCalculator + +TPMCalculator requires two inputs: +- A **GTF annotation file** describing gene models. +- **BAM alignment files** for each sample. + +It produces four output files per sample containing TPM values and raw read counts for: +- **Genes** +- **Transcripts** +- **Exons** +- **Introns** + + +## GFF3 to GTF + +While there are several ways to convert a GFF file to a GTF file, a custom script was used to facilitate this conversion. The [`gff3_to_gtf.py`](./scripts/gff3_to_gtf.py) script... +- Reads a GFF3 file +- Converts `gene`, `mRNA/transcript`, `exon`, and `CDS` features into GTF format +- Constructs valid `gene_id` and `transcript_id` fields +- Writes a new GTF file + +Run the following command to convert a GFF3 file to a GTF file: +```bash +python3 gff3_to_gtf.py +``` + + +## Pipeline Overview + +The scripts should be run in the following order: + +0. **gff3_to_gtf.py** + - See instructions above. + +1. **check_bam_sorted.py** + - **Purpose:** Verify that all BAM files are correctly sorted and log their sort order. + - **Inputs:** + - `base_dir`: Path to the folder containing BAM files. BAM files can be nested within sample directories. + - **Outputs:** + - TSV file logging the sort order of each BAM file (`bam_sort_status_.tsv`). + +2. **generate_tpmcalculator_results.py** + - **Purpose:** Generate TPMCalculator results for each sample and create SLURM job scripts to run the analyses. + - **Inputs:** + - `base_dir`: Path to the folder containing BAM files. Bam files may be nested. + - `gtf_file`: Path to the reference GTF annotation file. + - `tpmcalculator_dir`: Desired output directory for TPMCalculator results. + - **Outputs:** + - TPMCalculator output files for each sample (genes, transcripts, exons, introns). + - SLURM job scripts for each sample (`*_tpmcalculator_job.sh`). + +3. **combine_tpmcalculator_results.py** + - **Purpose:** Combine TPMCalculator results from all samples into unified TPM and reads matrices. + - **Inputs:** + - `base_dir`: Folder containing all TPMCalculator outputs for all samples. Output files may be nested. + - `output_dir`: Desired output directory for combined matrices. + - **Outputs:** + - TPM matrix (`STAR_TPMcalculator__TPMs_Matrix_Merged.csv`). + - Reads matrix (`STAR_TPMcalculator__Reads_Matrix_Merged.csv`). + +4. **analyze_organism_totals.py** + - **Purpose:** Summarize total TPMs and reads by organism (e.g., Host, Symbiont species) and generate visualizations to verify results. + - **Inputs:** + - `base_dir`: Folder containing combined TPMCalculator results. This is the output directory. + - `tpm_file`: Path to the merged TPM matrix. + - `reads_file`: Path to the merged reads matrix. + - **Outputs:** + - Summary CSV files of TPM and reads totals by organism (`TPM_totals_by_organism.csv`, `Reads_totals_by_organism.csv`). + - Barplots and summary figures (`organism_totals_barplots.png/pdf`, `organism_summary_stats.png/pdf`). + + +## Recommended Directory Structure + +For clarity and organization, the pipeline outputs are recommended to be structured as follows: +``` +├── sample_organism_name/ +│ ├── combined_results/ +│ │ ├── organism_analysis/ +│ │ │ ├── +│ │ │ ├── ... +│ │ ├── STAR_TPMCalculator__Reads_Matrix_Merged.csv +│ │ ├── STAR_TPMCalculator__TPMs_Matrix_Merged.csv +│ ├── tpmcalculator_output/ +│ │ ├── sample_1/ +│ │ │ ├── +│ │ │ ├── ... +│ │ ├── sample_2/ +│ │ │ ├── +│ │ │ ├── ... +│ │ ├── sample_.../ +│ │ │ ├── +│ │ │ ├── ... +│ ├── reference/ +│ │ ├── .gff3 or gff +│ │ ├── .gtf +``` diff --git a/post_processing/TPMCalculator/environment.yml b/post_processing/TPMCalculator/environment.yml new file mode 100644 index 0000000..b3948ef --- /dev/null +++ b/post_processing/TPMCalculator/environment.yml @@ -0,0 +1,217 @@ +name: gs-coral-env +channels: + - bioconda + - conda-forge + - https://repo.anaconda.com/pkgs/main + - https://repo.anaconda.com/pkgs/r +dependencies: + - _libgcc_mutex=0.1=conda_forge + - _openmp_mutex=4.5=2_gnu + - _r-mutex=1.0.1=anacondar_1 + - agat=1.5.1=pl5321hdfd78af_0 + - bamtools=2.5.3=he132191_0 + - binutils_impl_linux-64=2.45=default_hfdba357_104 + - bwidget=1.10.1=ha770c72_1 + - bzip2=1.0.8=hda65f42_8 + - c-ares=1.34.5=hb9d3cd8_0 + - ca-certificates=2025.11.12=hbd8a1cb_0 + - cairo=1.18.4=h3394656_0 + - curl=8.16.0=h4e3cde8_0 + - font-ttf-dejavu-sans-mono=2.37=hab24e00_0 + - font-ttf-inconsolata=3.000=h77eed37_0 + - font-ttf-source-code-pro=2.038=h77eed37_0 + - font-ttf-ubuntu=0.83=h77eed37_3 + - fontconfig=2.15.0=h7e30c49_1 + - fonts-conda-ecosystem=1=0 + - fonts-conda-forge=1=hc364b38_1 + - freetype=2.14.1=ha770c72_0 + - fribidi=1.0.16=hb03c661_0 + - gcc_impl_linux-64=15.2.0=h6f0f26c_15 + - gffread=0.12.7=h077b44d_6 + - gfortran_impl_linux-64=15.2.0=h281d09f_15 + - graphite2=1.3.14=hecca717_2 + - gsl=2.7=he838d99_0 + - gxx_impl_linux-64=15.2.0=hda75c37_15 + - harfbuzz=12.2.0=h15599e2_0 + - htslib=1.22.1=h566b1c6_0 + - icu=75.1=he02047a_0 + - jsoncpp=1.9.6=hf42df4d_1 + - kernel-headers_linux-64=6.12.0=he073ed8_3 + - keyutils=1.6.3=hb9d3cd8_0 + - krb5=1.21.3=h659f571_0 + - ld_impl_linux-64=2.45=default_hbd61a6d_104 + - lerc=4.0.0=h0aef613_1 + - libblas=3.11.0=4_h4a7cf45_openblas + - libcblas=3.11.0=4_h0358290_openblas + - libcurl=8.16.0=h4e3cde8_0 + - libdb=6.2.32=h9c3ff4c_0 + - libdeflate=1.24=h86f0d12_0 + - libedit=3.1.20250104=pl5321h7949ede_0 + - libev=4.33=hd590300_2 + - libexpat=2.7.3=hecca717_0 + - libffi=3.5.2=h9ec8514_0 + - libfreetype=2.14.1=ha770c72_0 + - libfreetype6=2.14.1=h73754d4_0 + - libgcc=15.2.0=h767d61c_7 + - libgcc-devel_linux-64=15.2.0=hcc6f6b0_115 + - libgcc-ng=15.2.0=h69a702a_7 + - libgfortran=15.2.0=h69a702a_15 + - libgfortran-ng=15.2.0=h69a702a_15 + - libgfortran5=15.2.0=h68bc16d_15 + - libglib=2.86.2=h32235b2_0 + - libgomp=15.2.0=h767d61c_7 + - libiconv=1.18=h3b78370_2 + - libjpeg-turbo=3.1.2=hb03c661_0 + - liblapack=3.11.0=4_h47877c9_openblas + - liblzma=5.8.1=hb9d3cd8_2 + - libnghttp2=1.67.0=had1ee68_0 + - libopenblas=0.3.30=pthreads_h94d23a6_4 + - libpng=1.6.53=h421ea60_0 + - libsanitizer=15.2.0=h90f66d4_15 + - libssh2=1.11.1=hcf80075_0 + - libstdcxx=15.2.0=h8f9b012_7 + - libstdcxx-devel_linux-64=15.2.0=hd446a21_115 + - libstdcxx-ng=15.2.0=h4852527_7 + - libtiff=4.7.1=h8261f1e_0 + - libuuid=2.41.2=h5347b49_1 + - libwebp-base=1.6.0=hd42ef1d_0 + - libxcb=1.17.0=h8a09558_0 + - libxcrypt=4.4.36=hd590300_1 + - libzlib=1.3.1=hb9d3cd8_2 + - make=4.4.1=hb9d3cd8_2 + - ncurses=6.5=h2d0b736_3 + - openssl=3.6.0=h26f9b46_0 + - pango=1.56.4=hadf4263_0 + - pcre2=10.46=h1321c63_0 + - perl=5.32.1=7_hd590300_perl5 + - perl-b-cow=0.007=pl5321hb9d3cd8_1 + - perl-b-hooks-endofscope=0.28=pl5321ha770c72_1 + - perl-base=2.23=pl5321hd8ed1ab_0 + - perl-bioperl-core=1.7.8=pl5321hdfd78af_1 + - perl-business-isbn=3.007=pl5321hd8ed1ab_0 + - perl-business-isbn-data=20210112.006=pl5321hd8ed1ab_0 + - perl-carp=1.50=pl5321hd8ed1ab_0 + - perl-class-inspector=1.36=pl5321hdfd78af_0 + - perl-class-load=0.25=pl5321ha770c72_2 + - perl-class-load-xs=0.10=pl5321hb9d3cd8_0 + - perl-class-methodmaker=2.25=pl5321h7b50bb2_1 + - perl-clone=0.46=pl5321hb9d3cd8_1 + - perl-compress-raw-bzip2=2.214=pl5321hda65f42_0 + - perl-compress-raw-zlib=2.214=pl5321h4dac143_0 + - perl-constant=1.33=pl5321hd8ed1ab_0 + - perl-cpan-meta-check=0.014=pl5321ha770c72_0 + - perl-data-dump=1.25=pl5321h7b50bb2_2 + - perl-data-optlist=0.114=pl5321ha770c72_1 + - perl-db_file=1.858=pl5321hb9d3cd8_0 + - perl-devel-globaldestruction=0.14=pl5321ha770c72_0 + - perl-devel-overloadinfo=0.007=pl5321ha770c72_1 + - perl-devel-stacktrace=2.04=pl5321h296ab09_0 + - perl-digest-hmac=1.05=pl5321hdfd78af_0 + - perl-digest-md5=2.59=pl5321hb9d3cd8_3 + - perl-dist-checkconflicts=0.11=pl5321ha770c72_0 + - perl-encode=3.21=pl5321hb9d3cd8_1 + - perl-encode-locale=1.05=pl5321hdfd78af_7 + - perl-eval-closure=0.14=pl5321ha770c72_0 + - perl-exporter=5.74=pl5321hd8ed1ab_0 + - perl-exporter-tiny=1.002002=pl5321hd8ed1ab_0 + - perl-extutils-makemaker=7.70=pl5321hd8ed1ab_0 + - perl-file-listing=6.16=pl5321hdfd78af_0 + - perl-file-pushd=1.016=pl5321ha770c72_0 + - perl-file-share=0.25=pl5321hdfd78af_3 + - perl-file-sharedir=1.118=pl5321hdfd78af_0 + - perl-file-sharedir-install=0.14=pl5321hdfd78af_0 + - perl-file-spec=3.48_01=pl5321hdfd78af_2 + - perl-getopt-long=2.58=pl5321hdfd78af_0 + - perl-graph=0.9735=pl5321hdfd78af_0 + - perl-heap=0.80=pl5321hdfd78af_1 + - perl-html-parser=3.81=pl5321h4ac6f70_1 + - perl-html-tagset=3.24=pl5321hdfd78af_0 + - perl-http-cookiejar-lwp=0.014=pl5321hdfd78af_0 + - perl-http-cookies=6.11=pl5321hdfd78af_0 + - perl-http-daemon=6.16=pl5321hdfd78af_0 + - perl-http-date=6.06=pl5321hdfd78af_0 + - perl-http-message=7.01=pl5321hdfd78af_0 + - perl-http-negotiate=6.01=pl5321hdfd78af_4 + - perl-inc-latest=0.500=pl5321ha770c72_0 + - perl-io-html=1.004=pl5321hdfd78af_0 + - perl-io-socket-ssl=2.075=pl5321hd8ed1ab_0 + - perl-io-tty=1.20=pl5321hb9d3cd8_3 + - perl-ipc-run=20250809.0=pl5321hdfd78af_0 + - perl-libwww-perl=6.81=pl5321hdfd78af_0 + - perl-list-moreutils=0.430=pl5321hdfd78af_0 + - perl-list-moreutils-xs=0.430=pl5321h7b50bb2_5 + - perl-lwp-mediatypes=6.04=pl5321hdfd78af_1 + - perl-lwp-protocol-https=6.14=pl5321hdfd78af_1 + - perl-mime-base64=3.16=pl5321hb9d3cd8_3 + - perl-module-build=0.4234=pl5321ha770c72_1 + - perl-module-implementation=0.09=pl5321ha770c72_1 + - perl-module-load=0.34=pl5321hdfd78af_0 + - perl-module-runtime=0.016=pl5321ha770c72_1 + - perl-module-runtime-conflicts=0.003=pl5321ha770c72_0 + - perl-moose=2.2207=pl5321hb9d3cd8_2 + - perl-mozilla-ca=20250602=pl5321hdfd78af_0 + - perl-mro-compat=0.15=pl5321ha770c72_0 + - perl-namespace-clean=0.27=pl5321h296ab09_1 + - perl-net-http=6.24=pl5321hdfd78af_0 + - perl-net-ssleay=1.94=pl5321hf672d98_1 + - perl-ntlm=1.09=pl5321hdfd78af_5 + - perl-package-deprecationmanager=0.18=pl5321ha770c72_2 + - perl-package-stash=0.40=pl5321ha770c72_2 + - perl-package-stash-xs=0.30=pl5321hb9d3cd8_2 + - perl-params-util=1.102=pl5321hb9d3cd8_1 + - perl-parent=0.243=pl5321hd8ed1ab_0 + - perl-pod-escapes=1.07=pl5321hdfd78af_2 + - perl-regexp-common=2017060201=pl5321hd8ed1ab_0 + - perl-safe=2.37=pl5321hdfd78af_2 + - perl-scalar-list-utils=1.70=pl5321hb03c661_0 + - perl-set-object=1.43=pl5321h7b50bb2_0 + - perl-socket=2.027=pl5321h5c03b87_6 + - perl-sort-naturally=1.03=pl5321hdfd78af_3 + - perl-statistics-r=0.34=pl5321r44hdfd78af_7 + - perl-storable=3.15=pl5321hb9d3cd8_2 + - perl-sub-exporter=0.991=pl5321ha770c72_1 + - perl-sub-exporter-progressive=0.001013=pl5321ha770c72_0 + - perl-sub-identify=0.14=pl5321hb9d3cd8_2 + - perl-sub-install=0.928=pl5321hd8ed1ab_0 + - perl-sub-name=0.28=pl5321hb9d3cd8_1 + - perl-term-progressbar=2.23=pl5321hdfd78af_0 + - perl-test=1.26=pl5321hd8ed1ab_0 + - perl-test-cleannamespaces=0.24=pl5321ha770c72_2 + - perl-test-fatal=0.016=pl5321ha770c72_0 + - perl-test-harness=3.44=pl5321hd8ed1ab_0 + - perl-test-leaktrace=0.17=pl5321h7bf1ee8_1 + - perl-test-needs=0.002009=pl5321hd8ed1ab_0 + - perl-test-requiresinternet=0.05=pl5321hdfd78af_1 + - perl-test-warnings=0.031=pl5321ha770c72_0 + - perl-text-balanced=2.07=pl5321hdfd78af_0 + - perl-text-wrap=2021.0814=pl5321hd8ed1ab_0 + - perl-time-local=1.35=pl5321hdfd78af_0 + - perl-timedate=2.33=pl5321hdfd78af_2 + - perl-try-tiny=0.31=pl5321ha770c72_0 + - perl-uri=5.34=pl5321ha770c72_0 + - perl-url-encode=0.03=pl5321h9ee0642_1 + - perl-variable-magic=0.64=pl5321hb9d3cd8_0 + - perl-vars=1.03=pl5321hdfd78af_2 + - perl-www-robotrules=6.02=pl5321hdfd78af_4 + - perl-yaml=1.30=pl5321hdfd78af_0 + - pixman=0.46.4=h54a6638_1 + - pthread-stubs=0.4=hb9d3cd8_1002 + - r-base=4.4.3=h14df4e6_4 + - readline=8.2=h8c095d6_2 + - samtools=1.22.1=h96c455f_0 + - sed=4.9=h6688a6e_0 + - sysroot_linux-64=2.39=hc4b9eeb_3 + - tk=8.6.13=noxft_ha0e22de_103 + - tktable=2.10=h8d826fa_7 + - tpmcalculator=0.0.5=h2bd4fab_3 + - tzdata=2025b=h78e105d_0 + - xorg-libice=1.1.2=hb9d3cd8_0 + - xorg-libsm=1.2.6=he73a12e_0 + - xorg-libx11=1.8.12=h4f16b4b_0 + - xorg-libxau=1.0.12=hb03c661_1 + - xorg-libxdmcp=1.1.5=hb03c661_1 + - xorg-libxext=1.3.6=hb9d3cd8_0 + - xorg-libxrender=0.9.12=hb9d3cd8_0 + - xorg-libxt=1.3.1=hb9d3cd8_0 + - zlib=1.3.1=hb9d3cd8_2 + - zstd=1.5.7=hb8e6e7a_2 diff --git a/post_processing/TPMCalculator/scripts/analyze_organism_totals.py b/post_processing/TPMCalculator/scripts/analyze_organism_totals.py new file mode 100644 index 0000000..a0a3948 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/analyze_organism_totals.py @@ -0,0 +1,340 @@ +#!/usr/bin/env python3 +""" +analyze_organism_totals.py + +This script takes merged TPM and Reads matrices produced by the previous +TPMCalculator combination step and verifes that expression values were +assinged correctly to each organism for the multi-organism dataset. + +It also creates summary dataframes and barplots to verify processing worked correctly, +and generates the following outputs: + - TPM_totals_by_organism.csv + - Reads_totals_by_organism.csv + - organism_totals_barplots.png/.pdf + - organism_summary_stats.png/.pdf + +Functions: + - analyze_organism_totals(tpm_file, reads_file, output_dir, organisms) + - create_barplots(tpm_totals, reads_totals, output_dir, organisms) + - create_summary_stats_plot(tpm_totals, reads_totals, output_dir, organisms) + - main() + +Usage: + python3 analyze_organism_totals.py +""" + +import os +import pandas as pd +import matplotlib.pyplot as plt +import numpy as np + +def analyze_organism_totals(tpm_file, reads_file, output_dir, organisms): + """ + Analyze total TPMs and reads by organism prefix. + + Args: + - tpm_file (str): Path to TPM matrix CSV file + - reads_file (str): Path to reads matrix CSV file + - output_dir (str): Directory to save outputs + - organisms (dict): Dictionary containing the names and desired colors used + for graph labeling + + Returns: + - tpm_totals (pandas.DataFrame): DataFrame with TPM totals by organism + - reads_totals (pandas.DataFrame): DataFrame with reads totals by organism + """ + print("Loading TPM and reads matrices...") + + # Load matrices + tpm_df = pd.read_csv(tpm_file) + reads_df = pd.read_csv(reads_file) + + print(f"TPM matrix shape: {tpm_df.shape}") + print(f"Reads matrix shape: {reads_df.shape}") + + # Get sample columns (all except Gene_Id) + sample_columns = [col for col in tpm_df.columns if col != 'Gene_Id'] + print(f"Found {len(sample_columns)} samples") + + # Define organism prefixes + organism_prefixes = organisms["names"] + + # Initialize result dataframes + tpm_totals = pd.DataFrame(index=sample_columns, columns=organism_prefixes) + reads_totals = pd.DataFrame(index=sample_columns, columns=organism_prefixes) + + print("Calculating totals by organism...") + + # Calculate totals for each organism and sample + for sample in sample_columns: + for prefix in organism_prefixes: + # Filter genes by prefix + gene_mask = tpm_df['Gene_Id'].str.startswith(prefix) + + # Calculate totals + tpm_total = tpm_df.loc[gene_mask, sample].sum() + reads_total = reads_df.loc[gene_mask, sample].sum() + + tpm_totals.loc[sample, prefix] = tpm_total + reads_totals.loc[sample, prefix] = reads_total + + # Convert to numeric + tpm_totals = tpm_totals.astype(float) + reads_totals = reads_totals.astype(float) + + # Add total columns + tpm_totals['Total'] = tpm_totals.sum(axis=1) + reads_totals['Total'] = reads_totals.sum(axis=1) + + print("Summary statistics:") + print("\nTPM totals:") + print(tpm_totals.describe()) + print("\nReads totals:") + print(reads_totals.describe()) + + # Save summary dataframes + tpm_output = os.path.join(output_dir, "TPM_totals_by_organism.csv") + reads_output = os.path.join(output_dir, "Reads_totals_by_organism.csv") + + tpm_totals.to_csv(tpm_output) + reads_totals.to_csv(reads_output) + + print(f"\nSaved summary files:") + print(f"TPM totals: {tpm_output}") + print(f"Reads totals: {reads_output}") + + # Create visualizations + create_barplots(tpm_totals, reads_totals, output_dir, organisms) + + return tpm_totals, reads_totals + +def create_barplots(tpm_totals, reads_totals, output_dir, organisms): + """ + Create barplots for TPM and reads totals by organism. + + Args: + - tpm_totals (pandas.DataFrame): DataFrame with TPM totals by organism + - reads_totals (pandas.DataFrame): DataFrame with reads totals by organism + - output_dir (str): Directory to save plots + - organisms (dict): Dictionary containing the names and desired colors used + for graph labeling + + Returns: + - N/A + """ + print("Creating visualizations...") + + # Set up the plotting style + plt.style.use('default') + colors = organisms["colors"] + + # Remove Total column for plotting + organism_cols = organisms["names"] + + # Create figure with subplots + fig, axes = plt.subplots(2, 2, figsize=(15, 12)) + fig.suptitle('TPM and Reads Distribution by Organism', fontsize=16, fontweight='bold') + + # Plot 1: TPM totals by organism (stacked bar) + ax1 = axes[0, 0] + tpm_plot_data = tpm_totals[organism_cols] + tpm_plot_data.plot(kind='bar', stacked=True, ax=ax1, width=0.8, color=colors) + ax1.set_title('TPM Totals by Organism (Stacked)', fontweight='bold') + ax1.set_xlabel('Sample Index') + ax1.set_ylabel('TPM') + ax1.legend(title='Organism', bbox_to_anchor=(1.05, 1), loc='upper left') + ax1.set_xticks([]) # Remove x-axis labels for readability + + # Plot 2: Reads totals by organism (stacked bar) + ax2 = axes[0, 1] + reads_plot_data = reads_totals[organism_cols] + reads_plot_data.plot(kind='bar', stacked=True, ax=ax2, width=0.8, color=colors) + ax2.set_title('Reads Totals by Organism (Stacked)', fontweight='bold') + ax2.set_xlabel('Sample Index') + ax2.set_ylabel('Reads') + ax2.legend(title='Organism', bbox_to_anchor=(1.05, 1), loc='upper left') + ax2.set_xticks([]) # Remove x-axis labels for readability + + # Plot 3: TPM percentage by organism + ax3 = axes[1, 0] + tpm_percentages = tpm_plot_data.div(tpm_plot_data.sum(axis=1), axis=0) * 100 + tpm_percentages.plot(kind='bar', stacked=True, ax=ax3, width=0.8, color=colors) + ax3.set_title('TPM Distribution by Organism (%)', fontweight='bold') + ax3.set_xlabel('Sample Index') + ax3.set_ylabel('Percentage (%)') + ax3.legend(title='Organism', bbox_to_anchor=(1.05, 1), loc='upper left') + ax3.set_xticks([]) # Remove x-axis labels for readability + + # Plot 4: Reads percentage by organism + ax4 = axes[1, 1] + reads_percentages = reads_plot_data.div(reads_plot_data.sum(axis=1), axis=0) * 100 + reads_percentages.plot(kind='bar', stacked=True, ax=ax4, width=0.8, color=colors) + ax4.set_title('Reads Distribution by Organism (%)', fontweight='bold') + ax4.set_xlabel('Sample Index') + ax4.set_ylabel('Percentage (%)') + ax4.legend(title='Organism', bbox_to_anchor=(1.05, 1), loc='upper left') + ax4.set_xticks([]) # Remove x-axis labels for readability + + # Adjust layout + plt.tight_layout() + + # Save plot + plot_output = os.path.join(output_dir, "organism_totals_barplots.png") + plt.savefig(plot_output, dpi=300, bbox_inches='tight') + plt.savefig(plot_output.replace('.png', '.pdf'), bbox_inches='tight') + + print(f"Saved plots: {plot_output}") + plt.close() + + # Create summary statistics plot + create_summary_stats_plot(tpm_totals, reads_totals, output_dir, organisms) + +def create_summary_stats_plot(tpm_totals, reads_totals, output_dir, organisms): + """ + Create summary statistics visualization. + + Args: + - tpm_totals (pandas.DataFrame): DataFrame with TPM totals by organism + - reads_totals (pandas.DataFrame): DataFrame with reads totals by organism + - output_dir (str): Directory to save plots + - organisms (dict): Dictionary containing the names and desired colors used + for graph labeling + + Returns: + - N/A + """ + organism_cols = organisms["names"] + color_cols = organisms["colors"] + + fig, axes = plt.subplots(1, 2, figsize=(12, 5)) + fig.suptitle('Summary Statistics by Organism', fontsize=16, fontweight='bold') + + # TPM summary + ax1 = axes[0] + tpm_summary = tpm_totals[organism_cols].mean() + bars1 = ax1.bar(range(len(organism_cols)), tpm_summary.values, + color=color_cols) + ax1.set_title('Mean TPM by Organism', fontweight='bold') + ax1.set_xlabel('Organism') + ax1.set_ylabel('Mean TPM') + ax1.set_xticks(range(len(organism_cols))) + ax1.set_xticklabels(organism_cols, rotation=45) + + # Add value labels on bars + for i, (bar, val) in enumerate(zip(bars1, tpm_summary.values)): + ax1.text(bar.get_x() + bar.get_width()/2, bar.get_height() + val*0.01, + f'{val:.0f}', ha='center', va='bottom') + + # Reads summary + ax2 = axes[1] + reads_summary = reads_totals[organism_cols].mean() + bars2 = ax2.bar(range(len(organism_cols)), reads_summary.values, + color=color_cols) + ax2.set_title('Mean Reads by Organism', fontweight='bold') + ax2.set_xlabel('Organism') + ax2.set_ylabel('Mean Reads') + ax2.set_xticks(range(len(organism_cols))) + ax2.set_xticklabels(organism_cols, rotation=45) + + # Add value labels on bars + for i, (bar, val) in enumerate(zip(bars2, reads_summary.values)): + ax2.text(bar.get_x() + bar.get_width()/2, bar.get_height() + val*0.01, + f'{val:.0f}', ha='center', va='bottom') + + plt.tight_layout() + + # Save plot + summary_output = os.path.join(output_dir, "organism_summary_stats.png") + plt.savefig(summary_output, dpi=300, bbox_inches='tight') + plt.savefig(summary_output.replace('.png', '.pdf'), bbox_inches='tight') + + print(f"Saved summary plot: {summary_output}") + plt.close() + +def main(): + """ + Genenerates analysis and visualization written to the + base_dir/organism_analysis/ directory. + + Expected Input Directory Structure + ---------------------------------- + The script expects the following structure: + + base_dir/ + └── combined_results/ + │ ├── STAR_TPMcalculator_*_TPMs_Matrix_Merged.csv + │ ├── STAR_TPMcalculator_*_Reads_Matrix_Merged.csv + └── tpmcalculator_output/ + └── ... + + Organism Dictionary + ------------------- + organisms contains two fields used throughout the analysis: + + names + List of gene ID prefixes used to identify which organism each + gene belongs to. For example, a gene ID like "Sym_Cgor_gene123" + should be labeled as "Sym_Cgor_". + + colors + List of colors corresponding to each organism prefix. + + Required Paths + -------------------- + req_paths contains user-defined paths and output file names: + + combined_results_dir + Directory containing the 'combined_results/' folder from the + previous script in the pipeline. + + tpm_file + File name for the merged TPM matrix. + + reads_file + File name for the merged Reads matrix. + """ + + #################### + # Define Organisms + #################### + organisms = { + "names": ['Host_', 'Sym_Cgor_', 'Sym_Dtre_', 'Sym_Smic_'], + "colors": ['#1f77b4', '#ff7f0e', '#2ca02c' ,'#d62728'] + } + + #################### + # Define paths + #################### + req_paths = { + "combined_results_dir": "", + "tpm_file": "STAR_TPMcalculator_TPMs_Matrix_Merged_rerun.csv", + "reads_file": "STAR_TPMcalculator_Reads_Matrix_Merged_rerun.csv" + } + + tpm_file = os.path.join(req_paths["combined_results_dir"], req_paths["tpm_file"]) + reads_file = os.path.join(req_paths["combined_results_dir"], req_paths["reads_file"]) + + if os.path.exists(reads_file): + print("Using original reads matrix (may have different gene set)") + else: + reads_file = tpm_file + print("No reads matrix found, using TPM matrix for both analyses") + + # Check if input files exist + if not os.path.exists(tpm_file): + print(f"TPM matrix file not found: {tpm_file}") + print("Please run combine_tpmcalculator_results.py first") + return + + # Create output directory for analysis results + output_dir = os.path.join(req_paths["combined_results_dir"], "organism_analysis") + os.makedirs(output_dir, exist_ok=True) + + # Analyze organism totals + tpm_totals, reads_totals = analyze_organism_totals(tpm_file, reads_file, output_dir, organisms) + + print("\nAnalysis completed successfully!") + print(f"Results saved in: {output_dir}") + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/check_bam_sort.py b/post_processing/TPMCalculator/scripts/check_bam_sort.py new file mode 100644 index 0000000..fef7e42 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/check_bam_sort.py @@ -0,0 +1,138 @@ +#!/usr/bin/env python3 +""" +check_bam_sort.py + +The script scans each sample directory, identifies the STAR-aligned BAM file, +determines its sort order using `samtools view -H`, and logs the results to +a TSV file for quick verification of alignment output consistency. + +Functions: + - get_sort_status(bam_path) + - main() + +Usage: + python3 check_bam_sort.py +""" + +import os +import glob +import subprocess +import csv + +def get_sort_status(bam_path): + """ + Returns samtools sort order: "coordinate", "queryname", "unsorted", + or "unknown". + + Args: + - bam_path (string): Path to bam file. + + Returns: + - (string): samtools sort order. + """ + try: + + # Run samtools command + header = subprocess.run( + ["samtools", "view", "-H", bam_path], + stdout=subprocess.PIPE, + stderr=subprocess.PIPE, + text=True + ) + + # Return sort order + if header.returncode != 0: + return "unknown" + + for line in header.stdout.splitlines(): + if line.startswith("@HD"): + if "SO:coordinate" in line: + return "coordinate" + elif "SO:queryname" in line: + return "queryname" + elif "SO:unsorted" in line: + return "unsorted" + else: + return "unsorted" + + return "unsorted" + except Exception: + return "unknown" + + +def main(): + """ + Finds bam files within a provided folder and logs the + samtools sort order. + + Expected Input Directory Structure + ---------------------------------- + The script expects the following directory structure: + + base_dir/ + ├── PerSample/ + │ ├── SAMPLE_1/ + │ │ └── results_STAR_htseq/ + │ │ └── *_dedup_NoSingletonCollated.out.bam + │ │ + │ ├── SAMPLE_2/ + │ │ └── results_STAR_htseq/ + │ │ └── *_dedup_NoSingletonCollated.out.bam + │ │ + │ └── ... + """ + + ################# + # Define paths + ################## + base_dir = "" + per_sample_dir = os.path.join(base_dir, "PerSample") + + log_file_name = "bam_sort_status_" + os.path.basename(base_dir) + ".tsv" + log_path = os.path.join(os.getcwd(), log_file_name) + + # Check if PerSample directory exists + if not os.path.exists(per_sample_dir): + print(f"Error: PerSample directory not found at {per_sample_dir}") + return + + # Find all sample directories + sample_dirs = [d for d in os.listdir(per_sample_dir) + if os.path.isdir(os.path.join(per_sample_dir, d))] + print(f"\nFound {len(sample_dirs)} sample directories") + + # Open TSV + with open(log_path, "w", newline="") as log_file: + writer = csv.writer(log_file, delimiter="\t") + writer.writerow(["sample_dir", "bam_file", "sort_status"]) + + # Loop over samples + for sample_dir in sample_dirs: + sample_path = os.path.join(per_sample_dir, sample_dir) + results_star_htseq_path = os.path.join(sample_path, "results_STAR_htseq") + + if not os.path.exists(results_star_htseq_path): + print(f"Warning: results_STAR_htseq directory not found for {sample_dir}") + continue + + # Find BAM file + bam_pattern = os.path.join(results_star_htseq_path, "*dedup_NoSingletonCollated.out.bam") + bam_files = glob.glob(bam_pattern) + + if not bam_files: + print(f"Warning: No BAM file found for {sample_dir}") + continue + elif len(bam_files) > 1: + print(f"Warning: Multiple BAM files found for {sample_dir}, using the first one") + + bam_file = bam_files[0] + + # Check sort status and log result + sort_status = get_sort_status(bam_file) + writer.writerow([sample_dir, bam_file, sort_status]) + + print(f"\nLogged BAM sort status to {log_path}\n") + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/combine_tpmcalculator_results.py b/post_processing/TPMCalculator/scripts/combine_tpmcalculator_results.py new file mode 100644 index 0000000..ded8307 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/combine_tpmcalculator_results.py @@ -0,0 +1,258 @@ +#!/usr/bin/env python3 +""" +combine_tpmcalculator_results.py + +Combine TPMCalculator output files from multiple RNA-seq samples into merged matrices. +This script scans the TPMCalculator output directories and aligns TPM and Reads +values across all samples by gene. Two merged CSV are generated. + +Functions: + - extract_sample_name(filepath) + - find_tpmcalculator_outputs(base_dir) + - combine_tpm_matrices(output_files, output_dir, req_paths) + - main() + +Usage: + python3 combine_tpmcalculator_results.py +""" + +import os +import pandas as pd +import glob +import re +from pathlib import Path + +def extract_sample_name(filepath): + """ + Extract sample name from filepath by removing '_star_10_0.3_0.66_0_dedup' part. + + Args: + - filepath (str): Path to the TPMCalculator output file + + Returns: + - sample_name (str): Sample name + """ + filename = os.path.basename(filepath) + # Remove the file extension and TPMCalculator suffix + base_name = filename.replace('_tpmcalculator.txt_genes.out', '') + # Remove the STAR alignment suffix + sample_name = re.sub(r'_star_10_0\.3_0\.66_0_dedup.*', '', base_name) + return sample_name + +def find_tpmcalculator_outputs(base_dir): + """ + Find all TPMCalculator output files with various possible patterns. + + Args: + - base_dir (str): Base directory containing PerSample folders + + Returns: + - list: List of file paths + """ + # per_sample_dir = os.path.join(base_dir, "PerSample") + per_sample_dir = os.path.join(base_dir, "tpmcalculator_output") + + # Try multiple patterns for TPMCalculator output files + patterns = [ + os.path.join(per_sample_dir, "*", "*Collated.out_genes"), + os.path.join(per_sample_dir, "*", "*_genes.out"), + os.path.join(per_sample_dir, "*", "*NoSingletonCollated.out_genes.out"), + os.path.join(per_sample_dir, "*", "*dedup*.out_genes.out"), + ] + + print(patterns) + + output_files = [] + for pattern in patterns: + files = glob.glob(pattern) + if files: + output_files = files + print(f"Found files with pattern: {pattern}") + break + + return sorted(output_files) + +def combine_tpm_matrices(output_files, output_dir, req_paths): + """ + Combine TPMCalculator results into TPM and Reads matrices. + + Args: + - output_files (list): List of TPMCalculator output file paths + - output_dir (str): Directory to save the combined matrices + - req_paths (dict): Dictionary containing names of TPM and Reads CSVs + """ + tpm_data = {} + reads_data = {} + gene_ids = None + + print(f"Processing {len(output_files)} TPMCalculator output files...") + + for i, filepath in enumerate(output_files): + try: + # Extract sample name + sample_name = extract_sample_name(filepath) + print(f"Processing {i+1}/{len(output_files)}: {sample_name}") + + # Check file size first + file_size = os.path.getsize(filepath) + if file_size < 300: # Less than 300 bytes is likely empty (just headers) + print(f"Warning: File {filepath} is too small ({file_size} bytes), likely empty") + continue + + # Read the TPMCalculator output file + df = pd.read_csv(filepath, sep='\t') + + # Identify duplicated Gene_Id + dup_mask = df['Gene_Id'].duplicated(keep=False) + if dup_mask.any(): + skipped_file = os.path.join(output_dir, "Skipped_Duplicate_Gene_Ids.csv") + + # Save skipped rows + df[dup_mask].assign(sample=sample_name).to_csv( + skipped_file, + sep=",", + index=False, + mode="a", + header=not os.path.exists(skipped_file) + ) + df = df[~dup_mask] + + # Check if dataframe is empty + if df.empty: + print(f"Warning: File {filepath} contains no data") + continue + + # Check if required columns exist + required_cols = ['Gene_Id', 'TPM', 'Reads'] + missing_cols = [col for col in required_cols if col not in df.columns] + + if missing_cols: + print(f"Warning: Missing columns {missing_cols} in {filepath}") + print(f"Available columns: {list(df.columns)}") + continue + + # Check if there are any genes + if len(df) == 0: + print(f"Warning: No genes found in {filepath}") + continue + + # Set gene IDs from first file + if gene_ids is None: + gene_ids = df['Gene_Id'].tolist() + print(f"Using gene list from {sample_name} with {len(gene_ids)} genes") + + # Check if gene lists match + if gene_ids != df['Gene_Id'].tolist(): + print(f"Warning: Gene list mismatch in {sample_name}") + # Try to align with existing gene_ids + df_aligned = df.set_index('Gene_Id').reindex(gene_ids) + tpm_data[sample_name] = df_aligned['TPM'].fillna(0).tolist() + reads_data[sample_name] = df_aligned['Reads'].fillna(0).tolist() + else: + # Extract TPM and Reads columns + tpm_data[sample_name] = df['TPM'].tolist() + reads_data[sample_name] = df['Reads'].tolist() + + except Exception as e: + print(f"Error processing {filepath}: {e}") + continue + + if not tpm_data: + print("No valid data found. Exiting.") + return + + print(f"Successfully processed {len(tpm_data)} samples") + print(f"Found {len(gene_ids)} genes") + + # Create TPM matrix + tpm_df = pd.DataFrame(tpm_data) + tpm_df.insert(0, 'Gene_Id', gene_ids) + + # Create Reads matrix + reads_df = pd.DataFrame(reads_data) + reads_df.insert(0, 'Gene_Id', gene_ids) + + # Create output directory if it doesn't exist + os.makedirs(output_dir, exist_ok=True) + + # Save matrices + tpm_output = os.path.join(output_dir, req_paths["tpm_csv"]) + reads_output = os.path.join(output_dir, req_paths["reads_csv"]) + + tpm_df.to_csv(tpm_output, index=False) + reads_df.to_csv(reads_output, index=False) + + print(f"\nMatrices saved:") + print(f"TPM matrix: {tpm_output}") + print(f"Reads matrix: {reads_output}") + print(f"Matrix dimensions: {tpm_df.shape}") + print(f"Samples: {list(tpm_data.keys())}") + +def main(): + """ + This function combines the TPMCalculator results across all samples, + by gene, into merged TPM and Reads matrices. + + Expected Input Directory Structure + ---------------------------------- + The script expects the following structure: + + base_dir/ + └── tpmcalculator_output/ + ├── SAMPLE_1/ + │ └── SAMPLE_1...out_genes.out + ├── SAMPLE_2/ + │ └── SAMPLE_2...out_genes.out + └── ... + + Required Paths + -------------------- + req_paths contains user-defined paths and output file names: + + base_dir + Root directory containing 'tpmcalculator_output/'. This is the output directory + from the previous script. + + tpm_csv + File name for the merged TPM matrix. + + reads_csv + File name for the merged Reads matrix. + """ + + ################# + # Define paths + ################## + req_paths = { + "base_dir": "", + "tpm_csv": "STAR_TPMcalculator_TPMs_Matrix_Merged_rerun.csv", + "reads_csv": "STAR_TPMcalculator_Reads_Matrix_Merged_rerun.csv" + } + output_dir = req_paths["base_dir"] + "/combined_results" + + # Find TPMCalculator output files + print("Searching for TPMCalculator output files...") + output_files = find_tpmcalculator_outputs(req_paths["base_dir"]) + + if not output_files: + print("No TPMCalculator output files found!") + print("Expected pattern: tpmcalculator/*/*Collated.out_sorted_genes") + return + + print(f"Found {len(output_files)} TPMCalculator output files") + + # Show first few files for verification + print("\nFirst few files found:") + for f in output_files[:5]: + sample_name = extract_sample_name(f) + print(f" {sample_name}: {f}") + + if len(output_files) > 5: + print(f" ... and {len(output_files) - 5} more") + + # Combine matrices + combine_tpm_matrices(output_files, output_dir, req_paths) + + +if __name__ == "__main__": + main() diff --git a/post_processing/TPMCalculator/scripts/generate_tpmcalculator_jobs.py b/post_processing/TPMCalculator/scripts/generate_tpmcalculator_jobs.py new file mode 100644 index 0000000..96f39e7 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/generate_tpmcalculator_jobs.py @@ -0,0 +1,202 @@ +#!/usr/bin/env python3 +""" +generate_tpmcalculator_jobs.py + +This module generates and submits SLURM job scripts for running TPMCalculator on +multiple RNA-seq samples. It scans a PerSample directory containing sample-specific +subdirectories, identifies the appropriate BAM file for each sample, +and generates a SLURM job script that runs TPMCalculator using a +provided GTF annotation file. + +Functions: + - create_slurm_job_script(sample_id, bam_file, gtf_file, tmp_output_dir) + - main() + +Usage: + python3 generate_tpmcalculator_slurm_jobs.py +""" + +import os +import glob +import subprocess + +def create_slurm_job_script(sample_id, bam_file, gtf_file, tmp_output_dir): + """ + Create a SLURM job script for TPMCalculator analysis. + + Args: + - sample_id (string): Sample identifier + - bam_file (string): Path to the BAM file + - gtf_file (string): Path to the GTF file + - tmp_output_dir (string): Directory to save TPMCalculator output + + Returns: + - job_script_path (string): Path to SLURM job bash script + """ + + job_script_content = f"""#!/bin/bash +#SBATCH --job-name=tpmcalc_{sample_id} +#SBATCH --cpus-per-task=4 +#SBATCH --mem=16G +#SBATCH --time=08:00:00 +#SBATCH --output={tmp_output_dir}/{sample_id}_tpmcalculator_%j.out +#SBATCH --error={tmp_output_dir}/{sample_id}_tpmcalculator_%j.err + +# TPMCalculator is available via conda installation + +# Define variables +BAM="{bam_file}" +GTF="{gtf_file}" +OUTPUT_DIR="{tmp_output_dir}" + +# Run TPMCalculator +echo "Starting TPMCalculator analysis for {sample_id}..." +mkdir -p $OUTPUT_DIR +cd $OUTPUT_DIR +TPMCalculator -g $GTF -b $BAM -o $OUTPUT_DIR/{sample_id}_tpmcalculator.txt + +echo "TPMCalculator analysis completed for {sample_id}" +""" + + # Create the bash file containing the SLURM job + job_script_path = os.path.join(tmp_output_dir, f"{sample_id}_tpmcalculator_job.sh") + + with open(job_script_path, 'w') as f: + f.write(job_script_content) + + # Make the script executable + os.chmod(job_script_path, 0o755) + + return job_script_path + +def main(): + """ + This function scans a project results directory, identifies sample + BAM files produced by the STAR pipeline, generates a SLURM job script + for each sample, and submits those jobs to the cluster. + + Expected Input Directory Structure + ---------------------------------- + The script expects the following directory structure: + + base_dir/ + ├── PerSample/ + │ ├── SAMPLE_1/ + │ │ └── results_STAR_htseq/ + │ │ └── *_dedup_NoSingletonCollated.out.bam + │ │ + │ ├── SAMPLE_2/ + │ │ └── results_STAR_htseq/ + │ │ └── *_dedup_NoSingletonCollated.out.bam + │ │ + │ └── ... + + Where: + - Each directory inside `PerSample/` represents one RNA-seq sample. + - Each sample directory must contain a `results_STAR_htseq/` folder. + - The BAM file used for TPMCalculator must match the pattern: + `*dedup_NoSingletonCollated.out.bam`. + + Required Paths + -------------- + req_paths is a dictionary containing required input and output paths: + + base_dir + Root directory of the RNA-seq project results. This directory + must contain the `PerSample/` directory with all sample folders. + + gtf_file + Path to the reference GTF annotation file used by TPMCalculator + to assign reads to genes and transcripts. + + output_dir + Directory where TPMCalculator outputs and generated SLURM job + scripts will be written. A subdirectory for each sample will + be created automatically if it does not already exist. + """ + + ################# + # Define paths + ################## + req_paths ={ + "base_dir": "", + "gtf_file": "", + "output_dir": "" + } + per_sample_dir = os.path.join(req_paths["base_dir"], "PerSample") + + # Check if GTF file exists + if not os.path.exists(req_paths["gtf_file"]): + print(f"Error: GTF file not found at {req_paths["gtf_file"]}") + return + + # Check if PerSample directory exists + if not os.path.exists(per_sample_dir): + print(f"Error: PerSample directory not found at {per_sample_dir}") + return + + # Find all sample directories + sample_dirs = [d for d in os.listdir(per_sample_dir) + if os.path.isdir(os.path.join(per_sample_dir, d))] + + print(f"Found {len(sample_dirs)} sample directories") + + job_scripts_created = [] + + for sample_dir in sample_dirs: + sample_path = os.path.join(per_sample_dir, sample_dir) + results_star_htseq_path = os.path.join(sample_path, "results_STAR_htseq") + + # Check if results_STAR_htseq directory exists + if not os.path.exists(results_star_htseq_path): + print(f"Warning: results_STAR_htseq directory not found for {sample_dir}") + continue + + # Find BAM file + bam_pattern = os.path.join(results_star_htseq_path, "*dedup_NoSingletonCollated.out.bam") + bam_files = glob.glob(bam_pattern) + + if not bam_files: + print(f"Warning: No BAM file found for {sample_dir}") + continue + elif len(bam_files) > 1: + print(f"Warning: Multiple BAM files found for {sample_dir}, using the first one") + + bam_file = bam_files[0] + + # Create tpmcalculator directory + tpmcalculator_dir = req_paths["output_dir"] + sample_dir + os.makedirs(tpmcalculator_dir, exist_ok=True) + + # Extract sample ID from directory name + sample_id = sample_dir + + # Create SLURM job script + job_script_path = create_slurm_job_script(sample_id, bam_file, req_paths["gtf_file"], tpmcalculator_dir) + job_scripts_created.append(job_script_path) + + print(f"Created job script for {sample_id}: {job_script_path}") + + print(f"\nTotal job scripts created: {len(job_scripts_created)}") + + # Submit all jobs automatically + print("\nSubmitting jobs to SLURM cluster...") + submitted_jobs = [] + + for job_script in job_scripts_created: + try: + result = subprocess.run(['sbatch', job_script], stdout=subprocess.PIPE, stderr=subprocess.PIPE, universal_newlines=True) + if result.returncode == 0: + job_id = result.stdout.strip().split()[-1] + submitted_jobs.append(job_id) + print(f"Submitted job {job_id} for {job_script}") + else: + print(f"Failed to submit {job_script}: {result.stderr.strip()}") + except Exception as e: + print(f"Error submitting {job_script}: {e}") + + print(f"\nSuccessfully submitted {len(submitted_jobs)} jobs out of {len(job_scripts_created)} total jobs.") + + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf.py b/post_processing/TPMCalculator/scripts/gff3_to_gtf.py new file mode 100644 index 0000000..9ac6a09 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf.py @@ -0,0 +1,162 @@ +#!/usr/bin/env python3 +""" +gff3_to_gtf.py + +This script parses a GFF3 file and converts supported feature types +(gene, mRNA/transcript, exon, CDS) into GTF format while maintaining +the correct gene_id and transcript_id relationships. + +Functions: + - parse_attributes(attr_string) + - format_gtf_attributes(attributes, feature_type, gene_id=None, transcript_id=None) + - convert_gff3_to_gtf(input_file, output_file) + - main() + +Usage: + python3 gff3_to_gtf.py +""" + +import sys +import re +from pathlib import Path + +def parse_attributes(attr_string): + """ + GFF3 attributes are formatted as key-value pairs separated by + semicolons. This function converts the attribute string into a + Python dictionary for easier access. + + Args: + - attr_string (str): Attribute column from a GFF3 file. + + Returns: + - attributes (dict): Dictionary mapping attribute keys to values. + """ + attributes = {} + if not attr_string or attr_string == '.': + return attributes + + for item in attr_string.split(';'): + if '=' in item: + key, value = item.split('=', 1) + attributes[key] = value + return attributes + +def format_gtf_attributes(attributes, feature_type, gene_id=None, transcript_id=None): + """ + Constructs the GTF attribute string using gene_id and transcript_id + along with optional gene_name or transcript_name fields if present. + + Args: + - attributes (dict): Parsed GFF3 attributes. + - feature_type (str): Feature type (gene, transcript, exon, CDS). + - gene_id (str, optional): Gene identifier. + - transcript_id (str, optional): Transcript identifier. + + Returns: + - str: Properly formatted GTF attribute string. + """ + gtf_attrs = [] + + if gene_id: + gtf_attrs.append(f'gene_id "{gene_id}"') + + if transcript_id: + gtf_attrs.append(f'transcript_id "{transcript_id}"') + + if 'Name' in attributes: + if feature_type == 'gene': + gtf_attrs.append(f'gene_name "{attributes["Name"]}"') + elif feature_type in ['transcript', 'mRNA']: + gtf_attrs.append(f'transcript_name "{attributes["Name"]}"') + + return '; '.join(gtf_attrs) + ';' + +def convert_gff3_to_gtf(input_file, output_file): + """ + Reads a GFF3 file line by line, parses feature attributes, and writes + supported features (gene, transcript/mRNA, exon, CDS) to a GTF file. + Transcript-to-gene relationships are tracked to correctly assign + gene_id and transcript_id values. + + Args: + - input_file (str): Path to the input GFF3 file. + - output_file (str): Path to the output GTF file. + + Returns: + - N/A + """ + transcript_to_gene = {} + + with open(input_file, 'r') as infile, open(output_file, 'w') as outfile: + for line_num, line in enumerate(infile, 1): + if line_num % 100000 == 0: + print(f"Processed {line_num:,} lines...") + + line = line.strip() + + if line.startswith('#') or not line: + continue + + fields = line.split('\t') + if len(fields) != 9: + continue + + seqname, source, feature, start, end, score, strand, frame, attributes = fields + attr_dict = parse_attributes(attributes) + + if feature == 'gene': + gene_id = attr_dict.get('ID', '') + + gtf_attributes = format_gtf_attributes(attr_dict, feature, gene_id=gene_id) + gtf_line = f"{seqname}\t{source}\tgene\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{gtf_attributes}" + outfile.write(gtf_line + '\n') + + elif feature in ['mRNA', 'transcript']: + transcript_id = attr_dict.get('ID', '') + parent_gene = attr_dict.get('Parent', '') + + transcript_to_gene[transcript_id] = parent_gene + + gtf_attributes = format_gtf_attributes(attr_dict, feature, gene_id=parent_gene, transcript_id=transcript_id) + gtf_line = f"{seqname}\t{source}\ttranscript\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{gtf_attributes}" + outfile.write(gtf_line + '\n') + + elif feature in ['exon', 'CDS']: + parent_id = attr_dict.get('Parent', '') + + # Check if parent is a transcript/mRNA or directly a gene + if parent_id in transcript_to_gene: + # Parent is a transcript + parent_gene = transcript_to_gene[parent_id] + transcript_id = parent_id + else: + # Parent might be a gene (some exons directly reference genes) + parent_gene = parent_id + transcript_id = parent_id # Use gene ID as transcript ID for these cases + + gtf_attributes = format_gtf_attributes(attr_dict, feature, gene_id=parent_gene, transcript_id=transcript_id) + gtf_line = f"{seqname}\t{source}\t{feature}\t{start}\t{end}\t{score}\t{strand}\t{frame}\t{gtf_attributes}" + outfile.write(gtf_line + '\n') + +def main(): + if len(sys.argv) != 3: + print("Usage: python3 gff3_to_gtf.py ") + sys.exit(1) + + input_file = sys.argv[1] + output_file = sys.argv[2] + + if not Path(input_file).exists(): + print(f"Error: Input file '{input_file}' not found.") + sys.exit(1) + + try: + convert_gff3_to_gtf(input_file, output_file) + print(f"Successfully converted {input_file} to {output_file}") + except Exception as e: + print(f"Error during conversion: {e}") + sys.exit(1) + +if __name__ == "__main__": + main() \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/README.md b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/README.md new file mode 100644 index 0000000..f0323e5 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/README.md @@ -0,0 +1,11 @@ +# GFF3 to GTF Fixes + +This directory contains small helper scripts used to make manual fixes to GTF files after converting annotations from **GFF3 → GTF**. + +While the main conversion script performs the bulk of the format conversion, some reference annotations required small adjustments to ensure compatibility with downstream tools such as **TPMCalculator**. These fixes were typically related to feature naming or attribute formatting. + +Examples of fixes performed: +- Converting CDS to exon label +- Fixing GTF chromosome names to match BAM headers + +Each script documents the specific transformation it performs. These scripts were run **after the initial GFF3 → GTF conversion step** and before running downstream expression quantification. \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/cds_to_exon_gtf.sh b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/cds_to_exon_gtf.sh new file mode 100755 index 0000000..a165fe3 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/cds_to_exon_gtf.sh @@ -0,0 +1,29 @@ +#!/bin/bash +# Convert CDS to exon for host so TPMCalculator can count them + +GTF="Valid_Plob_Cgor_concat.gtf" +FIXED_GTF="Valid_Plob_Cgor_concat_fixed.gtf" + +awk 'BEGIN{OFS="\t"} +{ + # Split line into first 8 fields + rest as attributes + attrs = "" + if(NF>8) { + attrs = $9 + for(i=10;i<=NF;i++) attrs = attrs " " $i + } + + if($3=="CDS") { + $3="exon" + # Extract transcript_id + match(attrs,/transcript_id "[^"]+"/) + tid=substr(attrs,RSTART+15,RLENGTH-16) # get only the value inside quotes + # Prepend gene_id + attrs="gene_id \"" tid "\"; " attrs + } + + # Rebuild the line: columns 1-8 + fixed attrs + print $1,$2,$3,$4,$5,$6,$7,$8,attrs +}' "$GTF" > "$FIXED_GTF" + +echo "Fixed GTF created: $FIXED_GTF" \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/fix_gtf_scaffold_names.sh b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/fix_gtf_scaffold_names.sh new file mode 100755 index 0000000..9e86284 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/fix_gtf_scaffold_names.sh @@ -0,0 +1,19 @@ +#!/bin/bash +# Fix GTF chromosome names to match BAM headers for TPMCalculator +# Original issue: BAM and GTF scaffold names did not match, causing TPMCalculator to run without producing gene output. + +INPUT_GTF="Valid_Phar_Cgor_Dtre_Smic_concat.gtf" +OUTPUT_GTF="Valid_Phar_Cgor_Dtre_Smic_concat_fixed.gtf" + +awk 'BEGIN{FS=OFS="\t"} +{ + match($1, /JBDLLT010000[0-9]+/) + if (RSTART) { + id=substr($1, RSTART, RLENGTH) + sub(/JBDLLT010000/, "", id) + $1="Host_Phar_gnl|WGS:JBDLLT|Phar_scaff_" id "|gb|JBDLLT010000" id + } + print +}' "$INPUT_GTF" > "$OUTPUT_GTF" + +echo "Fixed GTF written to $OUTPUT_GTF" \ No newline at end of file diff --git a/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/underscore_to_pipe.sh b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/underscore_to_pipe.sh new file mode 100755 index 0000000..3ee6a94 --- /dev/null +++ b/post_processing/TPMCalculator/scripts/gff3_to_gtf_fixes/underscore_to_pipe.sh @@ -0,0 +1,26 @@ +#!/bin/bash +# Convert underscore naming convention to piped for Host + +GTF="Valid_Pdae_Cgor_Dtre_Smic_concat.gtf" +FIXED_GTF="Valid_Pdae_Cgor_Dtre_Smic_concat_fixed.gtf" + +# Replace '_size' to '|size' +awk -F'\t' -v OFS='\t' ' + /^#/ {print; next} + { + # fix first column + sub(/_size/, "|size", $1) + + # combine everything after column 8 into column 9 + if (NF > 9) { + $9 = ""; + for(i=9; i<=NF; i++) { + $9 = $9 $i + if(i "$FIXED_GTF" + +echo "Fixed GTF written to $FIXED_GTF" \ No newline at end of file