Skip to content

KalaitzakisNikolaos/Biodel

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

5 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

BRCA PAM50 Subtype Classification: Elastic Net vs Conditional VAE

Last updated: October 11, 2025

Python PyTorch License

🌐 Interactive Project Website

β†’ Open the complete project showcase (View in browser)

Explore:

  • πŸ“Š Comprehensive project overview and methodology
  • πŸ€– Model architectures, hyperparameters, and test performance
  • πŸ”¬ External validation on GSE96058 (n=3,409) and METABRIC (n=1,980)
  • πŸ“ˆ Interactive figures gallery with distributions, confidence plots, entropy analysis
  • πŸ’Ύ All downloadable artifacts: models, predictions, distribution CSVs

🎯 Project Overview

This project implements and compares interpretable and deep learning approaches for PAM50 breast cancer subtype classification from RNA-Seq expression data, with rigorous external validation across independent cohorts.

Key Achievements

βœ… Interpretable baseline: Elastic Net on MSigDB Hallmark pathway features (F1=0.884)
βœ… Deep latent model: Conditional Variational Autoencoder with classifier head (F1=0.829)
βœ… Pathway-based features: ssGSEA and mean-z normalized enrichment scores
βœ… External validation: Tested on GSE96058 (microarray) and METABRIC (Illumina)
βœ… Robust alignment: Alias-based feature matching across platforms
βœ… Interactive reports: Auto-generated HTML with figures and distribution tables


πŸ† Key Results

Test Set Performance (TCGA BRCA, n=1,222)

Model Macro F1 AUROC (OvR) Balanced Accuracy
Elastic Net (ssGSEA) 0.884 0.985 0.898
Best CVAE (Ξ²=0.2, cw=1.0) 0.829 0.975 0.813

Winner: Elastic Net shows superior test metrics while maintaining full interpretability through pathway coefficients.

External Validation

Cohort Samples Platform Models Evaluated Status
GSE96058 3,409 GEO microarray (GPL570) EN (ssGSEA), CVAE (ssGSEA, mean-z) βœ… Complete
METABRIC 1,980 Illumina HT-12 v3 EN (ssGSEA, mean-z), CVAE (ssGSEA, mean-z) βœ… Complete

Key findings:

  • 50/50 Hallmark pathways successfully aligned across all cohorts
  • Mean-z normalization more stable than ssGSEA for CVAE cross-cohort inference
  • Predicted subtype distributions vary by platform and normalization method
  • Full diagnostic outputs available: alignment logs, pathway matrices, confidence metrics

πŸ“Š View detailed external validation results β†’


πŸ“ Repository Structure

Biodel/
β”œβ”€β”€ data/
β”‚   β”œβ”€β”€ processed_pathways_ssgsea_true/    # TCGA training data (pathway features)
β”‚   └── external/                          # External validation cohorts
β”‚       β”œβ”€β”€ GSE96058/                      # GEO microarray (GPL570)
β”‚       β”œβ”€β”€ METABRIC/                      # cBioPortal Illumina
β”‚       └── MSigDB/                        # Hallmark gene sets (local GMT)
β”‚
β”œβ”€β”€ outputs/
β”‚   β”œβ”€β”€ elastic_net_pathways_ssgsea_en_broad/  # Trained Elastic Net model
β”‚   β”œβ”€β”€ cvae_sweeps/                           # CVAE hyperparameter sweep results
β”‚   β”‚   └── b0.2_cw1.0/                        # Best CVAE checkpoint
β”‚   └── external_eval/                         # External distribution CSVs
β”‚
β”œβ”€β”€ scripts/
β”‚   β”œβ”€β”€ train_elastic_net.py                   # Elastic Net trainer with grid search
β”‚   β”œβ”€β”€ sweep_cvae.py                          # CVAE hyperparameter sweep
β”‚   β”œβ”€β”€ evaluate_external.py                   # Elastic Net external evaluator
β”‚   β”œβ”€β”€ evaluate_external_cvae.py              # CVAE external evaluator
β”‚   β”œβ”€β”€ build_external_report.py               # HTML report generator
β”‚   β”œβ”€β”€ export_external_distributions.py       # Distribution CSV exporter
β”‚   β”œβ”€β”€ convert_cbio_to_expr.py                # cBioPortal format converter
β”‚   β”œβ”€β”€ plot_external_results.py               # Advanced plotting utilities
β”‚   β”œβ”€β”€ calibrate_cvae.py                      # Temperature scaling for probabilities
β”‚   └── plot_vae_embeddings.py                 # UMAP latent space visualization
β”‚
β”œβ”€β”€ reports/
β”‚   β”œβ”€β”€ index.html                             # Main website homepage
β”‚   β”œβ”€β”€ overview.html                          # Project overview & methodology
β”‚   β”œβ”€β”€ models.html                            # Model details & metrics
β”‚   β”œβ”€β”€ external_validation.html               # External validation results
β”‚   β”œβ”€β”€ figures.html                           # Figures gallery
β”‚   β”œβ”€β”€ style.css                              # Website styling
β”‚   β”œβ”€β”€ external/                              # Interactive cohort reports
β”‚   β”‚   β”œβ”€β”€ METABRIC/index.html               # METABRIC report with figures/tables
β”‚   β”‚   └── GSE96058/index.html               # GSE96058 report with figures/tables
β”‚   └── summary.md                             # Detailed text summary
β”‚
β”œβ”€β”€ requirements.txt                           # Python dependencies
β”œβ”€β”€ run_all.ps1                                # Full pipeline orchestrator (Windows)
└── README.md                                  # This file

πŸš€ Quick Start

1. Installation

# Clone repository
git clone <repository-url>
cd Biodel

# Create virtual environment (recommended)
python -m venv venv
.\venv\Scripts\Activate.ps1

# Install dependencies
pip install -r requirements.txt

Core Dependencies:

  • Python 3.8+
  • PyTorch 1.12+ (CUDA optional)
  • scikit-learn, pandas, numpy
  • gseapy (ssGSEA pathway enrichment)
  • umap-learn (latent visualizations)
  • matplotlib, seaborn (plotting)
  • pyarrow (Parquet I/O)

2. View Results (No Training Required)

Option A: Direct open

cd reports
start index.html

Option B: Local server (recommended)

cd reports
python -m http.server 8000
# Open http://localhost:8000 in your browser

Option C: VS Code Live Server

  • Install "Live Server" extension
  • Right-click reports/index.html β†’ "Open with Live Server"

3. Reproduce Training (Optional)

Full pipeline in one command:

.\run_all.ps1

Or run steps individually:

Step 1: Train Elastic Net

python .\scripts\train_elastic_net.py `
  --in-dir .\data\processed_pathways_ssgsea_true `
  --out-dir .\outputs\elastic_net_pathways_ssgsea_en_broad `
  --mode elasticnet `
  --grid_C 0.05 0.1 0.25 0.5 1.0 2.0 `
  --grid_l1 0.05 0.1 0.2 0.3 0.5 0.7 `
  --selection_split val `
  --class_weight balanced `
  --max_iter 5000

Step 2: Train CVAE with sweep

python .\scripts\sweep_cvae.py `
  --pathway-scores .\data\processed_pathways_ssgsea_true\X.parquet `
  --labels .\data\processed_pathways_ssgsea_true\y.csv `
  --splits .\data\processed_pathways_ssgsea_true\splits.csv `
  --label-col pam50_subtype `
  --out-root .\outputs\cvae_sweeps `
  --latent-dim 16 `
  --epochs 80 `
  --batch-size 64 `
  --lr 0.001 `
  --betas 0.1 0.2 0.5 `
  --cls-weights 1.0 2.0 3.0

Step 3: Generate summary

python .\scripts\generate_summary.py `
  --en-dir .\outputs\elastic_net_pathways_ssgsea_en_broad `
  --cvae-sweep .\outputs\cvae_sweeps\sweep_summary.json `
  --out .\reports\summary.md

πŸ”¬ External Validation Workflow

Prerequisites

  • Expression matrix in samples Γ— genes format (CSV or Parquet)
  • Gene symbols (HGNC): case-insensitive, duplicates auto-averaged
  • Optional: labels CSV with sample_id and label column

Supported Cohorts

  1. GSE96058 - GEO microarray (GPL570), 3,409 samples
  2. METABRIC - cBioPortal Illumina HT-12 v3, 1,980 samples

Step-by-Step Guide

1. Prepare expression data

If your data is genes Γ— samples (cBioPortal format), convert first:

python .\scripts\convert_cbio_to_expr.py `
  --input "data\external\METABRIC\brca_metabric\data_mrna_illumina_microarray.txt" `
  --output "data\external\METABRIC\expr.csv"

2. Run Elastic Net evaluation

python .\scripts\evaluate_external.py `
  --expr .\data\external\METABRIC\expr.csv `
  --expr-format csv `
  --model-dir .\outputs\elastic_net_pathways_ssgsea_en_broad `
  --out-dir .\data\external\METABRIC\en_ssgsea `
  --method ssgsea `
  --library .\data\external\MSigDB\h.all.v2025.1.Hs.symbols.gmt

3. Run CVAE evaluation

python .\scripts\evaluate_external_cvae.py `
  --expr .\data\external\METABRIC\expr.csv `
  --expr-format csv `
  --checkpoint .\outputs\cvae_sweeps\b0.2_cw1.0\best.pt `
  --out-dir .\data\external\METABRIC\cvae_meanz `
  --method meanz `
  --library .\data\external\MSigDB\h.all.v2025.1.Hs.symbols.gmt

Recommended pathway methods:

  • Elastic Net: --method ssgsea (ranked enrichment, interpretable)
  • CVAE: --method meanz (normalized, more stable across cohorts)

4. Export distributions comparison

python .\scripts\export_external_distributions.py `
  --outputs "data\external\METABRIC\en_ssgsea" `
            "data\external\METABRIC\en_meanz" `
            "data\external\METABRIC\cvae_ssgsea" `
            "data\external\METABRIC\cvae_meanz" `
  --out-csv "outputs\external_eval\METABRIC_distributions.csv"

5. Generate interactive HTML report

python .\scripts\build_external_report.py `
  --cohort METABRIC `
  --runs "data\external\METABRIC\en_ssgsea" `
         "data\external\METABRIC\en_meanz" `
         "data\external\METABRIC\cvae_ssgsea" `
         "data\external\METABRIC\cvae_meanz" `
  --outdir "reports\external\METABRIC"

Evaluation Outputs

Each run produces:

  • predictions.csv - Predicted PAM50 subtype per sample
  • probabilities.csv - 5-class probabilities (Basal, HER2, LumA, LumB, Normal)
  • debug_alignment.json - Feature alignment diagnostics
  • pathways_matrix_raw.csv - Raw pathway scores (for inspection)

CVAE runs additionally save:

  • embeddings.parquet - Latent space (z) embeddings for UMAP
  • logits.parquet - Raw classifier logits

HTML reports include:

  • Distribution panel (bar charts across runs)
  • Confidence KDE plots per run
  • Predictive entropy histograms
  • Top-5 most confident samples per subtype (CSV downloads)

πŸ“Š View example reports β†’


πŸ“Š Model Details

Elastic Net Baseline

Architecture:

  • Input: 50 MSigDB Hallmark pathway enrichment scores (ssGSEA)
  • Model: Elastic Net logistic regression (L1 + L2 regularization)
  • Solver: SAGA (efficient for L1 penalty)
  • Class weights: Balanced (inverse frequency)

Hyperparameters:

  • Regularization grid: C ∈ {0.05, 0.1, 0.25, 0.5, 1.0, 2.0}
  • L1 ratio grid: l1_ratio ∈ {0.05, 0.1, 0.2, 0.3, 0.5, 0.7}
  • Selection criterion: Validation set macro F1
  • Max iterations: 5000

Outputs:

  • model.pkl - Trained scikit-learn model
  • coefficients.csv - Pathway importance per subtype (interpretable!)
  • metrics.json - Test F1, AUROC, balanced accuracy
  • confusion_matrix.csv - Per-class performance

Location: outputs/elastic_net_pathways_ssgsea_en_broad/

Conditional VAE (CVAE)

Architecture:

  • Encoder: Pathway features β†’ latent embeddings (z), dim=16
  • Decoder: Reconstruct pathway features from z
  • Classifier head: z β†’ 5-class PAM50 probabilities
  • Loss: Reconstruction + Ξ²Β·KL(q(z|x)||p(z)) + classification cross-entropy

Hyperparameter sweep:

  • Beta (KL weight): {0.1, 0.2, 0.5}
  • Classification weight: {1.0, 2.0, 3.0}
  • Early stopping: Validation loss patience=10 epochs
  • Best config: Ξ²=0.2, cls_weight=1.0 (selected by validation F1)

Training details:

  • Optimizer: Adam (lr=0.001)
  • Batch size: 64
  • Epochs: 80 (with early stopping)
  • Device: CUDA if available, else CPU

Outputs:

  • best.pt - PyTorch checkpoint (state_dict)
  • metrics.json - Test F1, AUROC, balanced accuracy, loss curves
  • embeddings.parquet - Latent space (z) for UMAP
  • calibration/ - Temperature-scaled probabilities (optional)

Location: outputs/cvae_sweeps/b0.2_cw1.0/


🧬 Data & Features

Training Data (TCGA BRCA)

  • Samples: 1,222 (after quality filtering)
  • Features: 50 MSigDB Hallmark pathways
  • Labels: PAM50 subtypes
    • Basal-like
    • HER2-enriched
    • Luminal A
    • Luminal B
    • Normal-like
  • Split: 70% train, 15% validation, 15% test (stratified by subtype)

Pathway Enrichment Methods

ssGSEA (Single Sample Gene Set Enrichment Analysis):

  • Ranked enrichment scores per sample
  • Used for Elastic Net (primary) and CVAE (optional)
  • Tool: gseapy.ssgsea()
  • Best for interpretability and within-platform consistency

Mean-z normalization:

  • Gene-level z-scores averaged across pathway members
  • More stable for cross-platform/cross-cohort inference
  • Recommended for CVAE external validation
  • Fallback when ssGSEA shows platform sensitivity

Feature Alignment (External Validation)

  • Pathway names normalized: uppercase, strip whitespace
  • Alias matching handles variants:
    • TNF-alpha ↔ TNFA
    • Peroxisome ↔ Pperoxisome
  • Debug outputs: debug_alignment.json with match counts and variance checks
  • Result: 50/50 Hallmark pathways matched across all cohorts

πŸ› οΈ Scripts & Utilities

Core Training

Script Purpose
train_elastic_net.py Elastic Net trainer with grid search
sweep_cvae.py CVAE hyperparameter sweep with early stopping
calibrate_cvae.py Temperature scaling for probability calibration
plot_vae_embeddings.py UMAP visualization of latent space

External Evaluation

Script Purpose
evaluate_external.py Elastic Net external inference
evaluate_external_cvae.py CVAE external inference
convert_cbio_to_expr.py Convert cBioPortal genesΓ—samples β†’ samplesΓ—genes
export_external_distributions.py Generate side-by-side count tables (CSV)
build_external_report.py Generate interactive HTML reports
plot_external_results.py Advanced plotting (KDE, entropy, UMAP)

Orchestration

Script Purpose
run_all.ps1 Full pipeline: train, sweep, calibrate, plot, summarize

πŸ“ˆ Figures & Visualizations

All figures available in:

Available Visualizations

TCGA Test Set:

  • Confusion matrices (Elastic Net, CVAE)
  • UMAP of latent space colored by PAM50 subtype
  • Pathway coefficient heatmap (Elastic Net)

External Validation (per cohort):

  • Distribution panels (bar charts across 4 runs)
  • Confidence KDE plots (per run)
  • Predictive entropy histograms (per run)
  • UMAP entropy maps (when embeddings available)
  • Top-5 most confident samples per subtype (CSV)

Examples:

  • reports/external/METABRIC/METABRIC_distributions_panel.png
  • reports/external/GSE96058/GSE96058_cvae_meanz_confidence_kde.png
  • reports/external/METABRIC/METABRIC_en_ssgsea_entropy_hist.png

πŸ”§ Troubleshooting

Common Issues

1. Parquet I/O errors

# Install pyarrow if missing
pip install pyarrow

2. gseapy network errors (ssGSEA)

  • Solution 1: Use local Hallmark GMT
    --library data\external\MSigDB\h.all.v2025.1.Hs.symbols.gmt
  • Solution 2: Fallback to mean-z
    --method meanz

3. Unicode rendering issues

  • Ensure editor is set to UTF-8 encoding
  • Issue affects reports/summary.md

4. CVAE checkpoint loading fails

  • Check metrics.json for correct latent_dim
  • Evaluator auto-infers from checkpoint if metrics missing
  • Verify PyTorch version compatibility

5. Feature alignment shows low match count

  • Check debug_alignment.json for diagnostics
  • Ensure gene symbols are uppercased
  • Duplicates are automatically averaged

Windows-Specific Notes

  • βœ… Use PowerShell (not CMD)
  • βœ… Paths: Use backslashes \ or forward slashes /
  • βœ… Line continuations: Use backtick ` at end of line
  • βœ… Scripts tested on Windows 10/11

πŸ“š References & Citations

Methods

  • PAM50: Parker et al., 2009 - "Supervised risk predictor of breast cancer based on intrinsic subtypes"
  • Elastic Net: Zou & Hastie, 2005 - "Regularization and variable selection via the elastic net"
  • VAE: Kingma & Welling, 2014 - "Auto-Encoding Variational Bayes"
  • Hallmark: Liberzon et al., 2015 - "The Molecular Signatures Database (MSigDB) hallmark gene set collection"
  • ssGSEA: Barbie et al., 2009 - "Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1"

Data Sources

  • TCGA-BRCA: The Cancer Genome Atlas Program (National Cancer Institute)
  • GSE96058: Gene Expression Omnibus (GEO/NCBI)
  • METABRIC: cBioPortal for Cancer Genomics (MSKCC/Dana-Farber)
  • MSigDB: Molecular Signatures Database (Broad Institute)

πŸ“‹ System Requirements

Python: 3.8 or higher

Key Dependencies:

torch>=1.12.0
scikit-learn>=1.0.0
pandas>=1.3.0
numpy>=1.21.0
gseapy>=1.0.0
umap-learn>=0.5.0
matplotlib>=3.5.0
seaborn>=0.11.0
pyarrow>=6.0.0

Optional:

  • CUDA toolkit (for GPU acceleration)
  • Jupyter (for notebook exploration)

See requirements.txt for complete dependency list.

Tested on:

  • Windows 10/11 (PowerShell)
  • Python 3.8, 3.9, 3.10
  • CPU and CUDA environments

🀝 Contributing

Contributions are welcome! To contribute:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Commit your changes (git commit -m 'Add amazing feature')
  4. Push to the branch (git push origin feature/amazing-feature)
  5. Open a Pull Request

Guidelines:

  • Add tests for new functionality
  • Update documentation
  • Follow existing code style
  • Include clear commit messages

πŸ“„ License

This project is licensed under the MIT License. See LICENSE file for details.


πŸ™ Acknowledgments

  • TCGA Research Network for BRCA genomic data
  • MSigDB team (Broad Institute) for Hallmark gene sets
  • cBioPortal and GEO for external cohort access
  • PyTorch, scikit-learn, and gseapy communities
  • All contributors and users of this project

πŸ“ž Contact & Support


πŸš€ Project Status

Current Version: 1.0.0
Status: βœ… Complete

  • Training pipeline (Elastic Net + CVAE)
  • External validation (GSE96058, METABRIC)
  • Interactive HTML reports
  • Distribution comparison tables
  • Advanced visualizations
  • Comprehensive documentation

Next Steps (Optional):

  • Add calibration plots when labels available
  • Implement survival analysis
  • Add more external cohorts
  • Create Docker container
  • Add Jupyter notebook tutorials

Last updated: October 11, 2025
Built with ❀️ for reproducible cancer research

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors