This repository provides a full pipeline to go from raw count data to:
- Differential expression analysis (DESeq2 via PyDESeq2)
- Pathway enrichment (GSEA / ORA)
- Automated Quarto reports
git clone https://github.com/mbrochut/bulkRNA-seq_template.git
cd bulkRNA-seq_templatepython3 -m venv env
# bash
source env/bin/activate
# fish
source env/bin/activate.fishpip install -r requirements.txtplotly_get_chromepython 00_init_repo.py --organism Mouse # TO USE WITH TESTING DATA
# or
python 00_init_repo.py --organism HumanThis will:
-
Create the full project structure
-
Download pathway databases for the selected organism into:
data/DB/<organism>/
data/
DB/<organism>/
meta/
QUARTO/
results/
contrasts/
models/
pathway/
concat/
GSEA/
ORA/
GSEA_object/
QC/
Must be wide format:
| gene_id | gene_name | OC1 | OC2 | ... |
|---|---|---|---|---|
| ENSMUSG... | Gnai3 | 6591 | 6228 | ... |
-
Must include:
gene_idgene_name
| id | treatment | stimulation | Age |
|---|---|---|---|
| OC1 | Control | NO | Old |
idmust match column names in expression matrix
Run notebooks in order:
Creates the main object:
adata_filter_genes.h5ad
path_to_your_data = "./data/salmon.merged.gene_counts.tsv"
path_to_your_metadata = "./meta/meta.xlsx"
gene_id_col = 'gene_id'
condition_columns = ['treatment', 'Age']
# → merged into a single "condition" column: treatment_Age. You can put only one column if needed
filtering_sum = 10
# → keeps genes with total counts ≥ 10Notes:
- Additional filtering uses default PyDESeq2 settings
- The
conditioncolumn is used for all downstream analysis
- Quality control plots
- PCA / distributions
- General dataset exploration
Free exploration — no strict requirements
Define comparisons of interest:
paired_contrast = [
("Control_Young", "Trained_Young"),
("Control_Young", "Control_Old"),
]design = '~condition'Outputs:
- Contrast files saved in
results/contrasts/
Runs pathway enrichment for each contrast:
- GSEA
- ORA
organism = "Mouse" # or "Human"Uses databases from:
data/DB/<organism>/
Outputs saved in:
results/pathway/
Not sure if necessary https://quarto.org/docs/download/
Quarto report generation is now fully driven by the config.yaml file.
This step is mandatory and no longer relies on hardcoded parameters.
The section 05_generate_quarto controls how the report is built:
05_generate_quarto:
contrast_folder: "results/contrasts/"
output_folder: "QUARTO/"
template_dir: "Quarto_template/"
project_title: "YOUR TITLE"
author_name: "AUTHOR NAME"
split_to_remove: 1 # for now this parameter is used to isolate the contrast name in generation of files.
modules:
QC:
name: "Quality Control"
type: "single"
file: "QC.qmd"
include: true
DE:
name: "Differential Analysis"
type: "menu"
template: "Differential_analysis_template.qmd"
prefix: ""
model: "condition"
include: trueBy default, all the modules are include. Modules define what appears in the final Quarto report.
Each module has:
| Parameter | Description |
|---|---|
name |
Display name in the navbar |
type |
"single" or "menu" |
include |
Enable or disable the module |
file |
(single) static .qmd file |
template |
(menu) template used to generate multiple pages |
prefix |
Prefix added to generated files |
model |
Column used to group contrasts |
- One static
.qmdfile - Used for global sections (QC, Venn, summary)
- Generates one page per contrast
- Uses a template located in
Quarto_template/
The script automatically:
- loops over contrasts
- fills the template
- creates one
.qmdper contrast - adds them to the Quarto navbar
python 05_generate_quarto_files.py
cd QUARTO
quarto renderA Snakemake pipeline is provided to automate the full workflow.
to lunch the pipeline:
snakemake --cores N # use N as the number of core you want to use
The pipeline relies on config.yaml.
You must define at least:
general:
organism: Mouse # CHOSE YOUR ORGANISM: Human or Mouse
paths:
counts: "data/salmon.merged.gene_counts.tsv"
metadata: "meta/meta.xlsx"
01_create_anndata_object:
gene_id_col: "gene_id" # Colname of gene id (gene_id from the NF-core pieline)
condition_columns: # Define which metadata columns are combined to create the `condition` column
- "treatment"
- "Age"
03_create_contrast:
paired_contrast:
- ["Control_Young","Trained_Young"] # always: [reference, test] in that order. example: ["control", "treated"]| Parameter | Description |
|---|---|
organism |
Used for pathway databases (GSEA / ORA) |
counts |
Gene count matrix |
metadata |
Sample metadata |
gene_id_col |
Column containing gene identifiers |
condition_columns |
Columns combined to build the condition variable |
paired_contrast |
List of comparisons [reference, test] |
The pipeline executes the following steps:
- Initialization (folder structure and database download)
- AnnData creation
- Data exploration
- Differential expression (DESeq2)
- Pathway analysis (GSEA / ORA)
- Quarto file generation
- Quarto rendering
The pipeline automatically handles initialization:
- If the project is not initialized, it will create the structure and download databases
- If already initialized, the step is skipped
- If databases already exist, they are reused
If you change the organism, make sure the corresponding databases exist or rerun initialization.
snakemake --cores 8
snakemake quarto_render
- Notebooks can still be run independently without Snakemake
- The YAML file ensures reproducibility and consistency across the pipeline
- The module system allows easy extension (e.g. TF inference or additional analyses)
This pipeline provides:
- Clean DESeq2 workflow
- Automated pathway analysis
- Fully reproducible reporting
👉 From raw counts → publication-ready report in a few steps
The following features are currently under development:
- Heatmaps → Improved visualization of gene expression and pathway activity for specific genes of interest
Contributions, suggestions, and feedback are welcome!