BRCA PAM50 Subtype Classification: Elastic Net vs Conditional VAE

Last updated: October 11, 2025

🌐 Interactive Project Website

→ Open the complete project showcase (View in browser)

Explore:

📊 Comprehensive project overview and methodology
🤖 Model architectures, hyperparameters, and test performance
🔬 External validation on GSE96058 (n=3,409) and METABRIC (n=1,980)
📈 Interactive figures gallery with distributions, confidence plots, entropy analysis
💾 All downloadable artifacts: models, predictions, distribution CSVs

🎯 Project Overview

This project implements and compares interpretable and deep learning approaches for PAM50 breast cancer subtype classification from RNA-Seq expression data, with rigorous external validation across independent cohorts.

Key Achievements

✅ Interpretable baseline: Elastic Net on MSigDB Hallmark pathway features (F1=0.884)
✅ Deep latent model: Conditional Variational Autoencoder with classifier head (F1=0.829)
✅ Pathway-based features: ssGSEA and mean-z normalized enrichment scores
✅ External validation: Tested on GSE96058 (microarray) and METABRIC (Illumina)
✅ Robust alignment: Alias-based feature matching across platforms
✅ Interactive reports: Auto-generated HTML with figures and distribution tables

🏆 Key Results

Test Set Performance (TCGA BRCA, n=1,222)

Model	Macro F1	AUROC (OvR)	Balanced Accuracy
Elastic Net (ssGSEA)	0.884	0.985	0.898
Best CVAE (β=0.2, cw=1.0)	0.829	0.975	0.813

Winner: Elastic Net shows superior test metrics while maintaining full interpretability through pathway coefficients.

External Validation

Cohort	Samples	Platform	Models Evaluated	Status
GSE96058	3,409	GEO microarray (GPL570)	EN (ssGSEA), CVAE (ssGSEA, mean-z)	✅ Complete
METABRIC	1,980	Illumina HT-12 v3	EN (ssGSEA, mean-z), CVAE (ssGSEA, mean-z)	✅ Complete

Key findings:

50/50 Hallmark pathways successfully aligned across all cohorts
Mean-z normalization more stable than ssGSEA for CVAE cross-cohort inference
Predicted subtype distributions vary by platform and normalization method
Full diagnostic outputs available: alignment logs, pathway matrices, confidence metrics

📊 View detailed external validation results →

📁 Repository Structure

Biodel/
├── data/
│   ├── processed_pathways_ssgsea_true/    # TCGA training data (pathway features)
│   └── external/                          # External validation cohorts
│       ├── GSE96058/                      # GEO microarray (GPL570)
│       ├── METABRIC/                      # cBioPortal Illumina
│       └── MSigDB/                        # Hallmark gene sets (local GMT)
│
├── outputs/
│   ├── elastic_net_pathways_ssgsea_en_broad/  # Trained Elastic Net model
│   ├── cvae_sweeps/                           # CVAE hyperparameter sweep results
│   │   └── b0.2_cw1.0/                        # Best CVAE checkpoint
│   └── external_eval/                         # External distribution CSVs
│
├── scripts/
│   ├── train_elastic_net.py                   # Elastic Net trainer with grid search
│   ├── sweep_cvae.py                          # CVAE hyperparameter sweep
│   ├── evaluate_external.py                   # Elastic Net external evaluator
│   ├── evaluate_external_cvae.py              # CVAE external evaluator
│   ├── build_external_report.py               # HTML report generator
│   ├── export_external_distributions.py       # Distribution CSV exporter
│   ├── convert_cbio_to_expr.py                # cBioPortal format converter
│   ├── plot_external_results.py               # Advanced plotting utilities
│   ├── calibrate_cvae.py                      # Temperature scaling for probabilities
│   └── plot_vae_embeddings.py                 # UMAP latent space visualization
│
├── reports/
│   ├── index.html                             # Main website homepage
│   ├── overview.html                          # Project overview & methodology
│   ├── models.html                            # Model details & metrics
│   ├── external_validation.html               # External validation results
│   ├── figures.html                           # Figures gallery
│   ├── style.css                              # Website styling
│   ├── external/                              # Interactive cohort reports
│   │   ├── METABRIC/index.html               # METABRIC report with figures/tables
│   │   └── GSE96058/index.html               # GSE96058 report with figures/tables
│   └── summary.md                             # Detailed text summary
│
├── requirements.txt                           # Python dependencies
├── run_all.ps1                                # Full pipeline orchestrator (Windows)
└── README.md                                  # This file

🚀 Quick Start

1. Installation

# Clone repository
git clone <repository-url>
cd Biodel

# Create virtual environment (recommended)
python -m venv venv
.\venv\Scripts\Activate.ps1

# Install dependencies
pip install -r requirements.txt

Core Dependencies:

Python 3.8+
PyTorch 1.12+ (CUDA optional)
scikit-learn, pandas, numpy
gseapy (ssGSEA pathway enrichment)
umap-learn (latent visualizations)
matplotlib, seaborn (plotting)
pyarrow (Parquet I/O)

2. View Results (No Training Required)

Option A: Direct open

cd reports
start index.html

Option B: Local server (recommended)

cd reports
python -m http.server 8000
# Open http://localhost:8000 in your browser

Option C: VS Code Live Server

Install "Live Server" extension
Right-click reports/index.html → "Open with Live Server"

3. Reproduce Training (Optional)

Full pipeline in one command:

.\run_all.ps1

Or run steps individually:

Step 1: Train Elastic Net

python .\scripts\train_elastic_net.py `
  --in-dir .\data\processed_pathways_ssgsea_true `
  --out-dir .\outputs\elastic_net_pathways_ssgsea_en_broad `
  --mode elasticnet `
  --grid_C 0.05 0.1 0.25 0.5 1.0 2.0 `
  --grid_l1 0.05 0.1 0.2 0.3 0.5 0.7 `
  --selection_split val `
  --class_weight balanced `
  --max_iter 5000

Step 2: Train CVAE with sweep

python .\scripts\sweep_cvae.py `
  --pathway-scores .\data\processed_pathways_ssgsea_true\X.parquet `
  --labels .\data\processed_pathways_ssgsea_true\y.csv `
  --splits .\data\processed_pathways_ssgsea_true\splits.csv `
  --label-col pam50_subtype `
  --out-root .\outputs\cvae_sweeps `
  --latent-dim 16 `
  --epochs 80 `
  --batch-size 64 `
  --lr 0.001 `
  --betas 0.1 0.2 0.5 `
  --cls-weights 1.0 2.0 3.0

Step 3: Generate summary

python .\scripts\generate_summary.py `
  --en-dir .\outputs\elastic_net_pathways_ssgsea_en_broad `
  --cvae-sweep .\outputs\cvae_sweeps\sweep_summary.json `
  --out .\reports\summary.md

🔬 External Validation Workflow

Prerequisites

Expression matrix in samples × genes format (CSV or Parquet)
Gene symbols (HGNC): case-insensitive, duplicates auto-averaged
Optional: labels CSV with sample_id and label column

Supported Cohorts

GSE96058 - GEO microarray (GPL570), 3,409 samples
METABRIC - cBioPortal Illumina HT-12 v3, 1,980 samples

Step-by-Step Guide

1. Prepare expression data

If your data is genes × samples (cBioPortal format), convert first:

python .\scripts\convert_cbio_to_expr.py `
  --input "data\external\METABRIC\brca_metabric\data_mrna_illumina_microarray.txt" `
  --output "data\external\METABRIC\expr.csv"

2. Run Elastic Net evaluation

python .\scripts\evaluate_external.py `
  --expr .\data\external\METABRIC\expr.csv `
  --expr-format csv `
  --model-dir .\outputs\elastic_net_pathways_ssgsea_en_broad `
  --out-dir .\data\external\METABRIC\en_ssgsea `
  --method ssgsea `
  --library .\data\external\MSigDB\h.all.v2025.1.Hs.symbols.gmt

3. Run CVAE evaluation

python .\scripts\evaluate_external_cvae.py `
  --expr .\data\external\METABRIC\expr.csv `
  --expr-format csv `
  --checkpoint .\outputs\cvae_sweeps\b0.2_cw1.0\best.pt `
  --out-dir .\data\external\METABRIC\cvae_meanz `
  --method meanz `
  --library .\data\external\MSigDB\h.all.v2025.1.Hs.symbols.gmt

Recommended pathway methods:

Elastic Net: --method ssgsea (ranked enrichment, interpretable)
CVAE: --method meanz (normalized, more stable across cohorts)

4. Export distributions comparison

python .\scripts\export_external_distributions.py `
  --outputs "data\external\METABRIC\en_ssgsea" `
            "data\external\METABRIC\en_meanz" `
            "data\external\METABRIC\cvae_ssgsea" `
            "data\external\METABRIC\cvae_meanz" `
  --out-csv "outputs\external_eval\METABRIC_distributions.csv"

5. Generate interactive HTML report

python .\scripts\build_external_report.py `
  --cohort METABRIC `
  --runs "data\external\METABRIC\en_ssgsea" `
         "data\external\METABRIC\en_meanz" `
         "data\external\METABRIC\cvae_ssgsea" `
         "data\external\METABRIC\cvae_meanz" `
  --outdir "reports\external\METABRIC"

Evaluation Outputs

Each run produces:

predictions.csv - Predicted PAM50 subtype per sample
probabilities.csv - 5-class probabilities (Basal, HER2, LumA, LumB, Normal)
debug_alignment.json - Feature alignment diagnostics
pathways_matrix_raw.csv - Raw pathway scores (for inspection)

CVAE runs additionally save:

embeddings.parquet - Latent space (z) embeddings for UMAP
logits.parquet - Raw classifier logits

HTML reports include:

Distribution panel (bar charts across runs)
Confidence KDE plots per run
Predictive entropy histograms
Top-5 most confident samples per subtype (CSV downloads)

📊 View example reports →

📊 Model Details

Elastic Net Baseline

Architecture:

Input: 50 MSigDB Hallmark pathway enrichment scores (ssGSEA)
Model: Elastic Net logistic regression (L1 + L2 regularization)
Solver: SAGA (efficient for L1 penalty)
Class weights: Balanced (inverse frequency)

Hyperparameters:

Regularization grid: C ∈ {0.05, 0.1, 0.25, 0.5, 1.0, 2.0}
L1 ratio grid: l1_ratio ∈ {0.05, 0.1, 0.2, 0.3, 0.5, 0.7}
Selection criterion: Validation set macro F1
Max iterations: 5000

Outputs:

model.pkl - Trained scikit-learn model
coefficients.csv - Pathway importance per subtype (interpretable!)
metrics.json - Test F1, AUROC, balanced accuracy
confusion_matrix.csv - Per-class performance

Location: outputs/elastic_net_pathways_ssgsea_en_broad/

Conditional VAE (CVAE)

Architecture:

Encoder: Pathway features → latent embeddings (z), dim=16
Decoder: Reconstruct pathway features from z
Classifier head: z → 5-class PAM50 probabilities
Loss: Reconstruction + β·KL(q(z|x)||p(z)) + classification cross-entropy

Hyperparameter sweep:

Beta (KL weight): {0.1, 0.2, 0.5}
Classification weight: {1.0, 2.0, 3.0}
Early stopping: Validation loss patience=10 epochs
Best config: β=0.2, cls_weight=1.0 (selected by validation F1)

Training details:

Optimizer: Adam (lr=0.001)
Batch size: 64
Epochs: 80 (with early stopping)
Device: CUDA if available, else CPU

Outputs:

best.pt - PyTorch checkpoint (state_dict)
metrics.json - Test F1, AUROC, balanced accuracy, loss curves
embeddings.parquet - Latent space (z) for UMAP
calibration/ - Temperature-scaled probabilities (optional)

Location: outputs/cvae_sweeps/b0.2_cw1.0/

🧬 Data & Features

Training Data (TCGA BRCA)

Samples: 1,222 (after quality filtering)
Features: 50 MSigDB Hallmark pathways
Labels: PAM50 subtypes
- Basal-like
- HER2-enriched
- Luminal A
- Luminal B
- Normal-like
Split: 70% train, 15% validation, 15% test (stratified by subtype)

Pathway Enrichment Methods

ssGSEA (Single Sample Gene Set Enrichment Analysis):

Ranked enrichment scores per sample
Used for Elastic Net (primary) and CVAE (optional)
Tool: gseapy.ssgsea()
Best for interpretability and within-platform consistency

Mean-z normalization:

Gene-level z-scores averaged across pathway members
More stable for cross-platform/cross-cohort inference
Recommended for CVAE external validation
Fallback when ssGSEA shows platform sensitivity

Feature Alignment (External Validation)

Pathway names normalized: uppercase, strip whitespace
Alias matching handles variants:
- TNF-alpha ↔ TNFA
- Peroxisome ↔ Pperoxisome
Debug outputs: debug_alignment.json with match counts and variance checks
Result: 50/50 Hallmark pathways matched across all cohorts

🛠️ Scripts & Utilities

Core Training

Script	Purpose
`train_elastic_net.py`	Elastic Net trainer with grid search
`sweep_cvae.py`	CVAE hyperparameter sweep with early stopping
`calibrate_cvae.py`	Temperature scaling for probability calibration
`plot_vae_embeddings.py`	UMAP visualization of latent space

External Evaluation

Script	Purpose
`evaluate_external.py`	Elastic Net external inference
`evaluate_external_cvae.py`	CVAE external inference
`convert_cbio_to_expr.py`	Convert cBioPortal genes×samples → samples×genes
`export_external_distributions.py`	Generate side-by-side count tables (CSV)
`build_external_report.py`	Generate interactive HTML reports
`plot_external_results.py`	Advanced plotting (KDE, entropy, UMAP)

Orchestration

Script	Purpose
`run_all.ps1`	Full pipeline: train, sweep, calibrate, plot, summarize

📈 Figures & Visualizations

All figures available in:

Interactive website - Main showcase
Figures gallery - Organized by cohort
External reports - Per-cohort HTML pages

Available Visualizations

TCGA Test Set:

Confusion matrices (Elastic Net, CVAE)
UMAP of latent space colored by PAM50 subtype
Pathway coefficient heatmap (Elastic Net)

External Validation (per cohort):

Distribution panels (bar charts across 4 runs)
Confidence KDE plots (per run)
Predictive entropy histograms (per run)
UMAP entropy maps (when embeddings available)
Top-5 most confident samples per subtype (CSV)

Examples:

reports/external/METABRIC/METABRIC_distributions_panel.png
reports/external/GSE96058/GSE96058_cvae_meanz_confidence_kde.png
reports/external/METABRIC/METABRIC_en_ssgsea_entropy_hist.png

🔧 Troubleshooting

Common Issues

1. Parquet I/O errors

# Install pyarrow if missing
pip install pyarrow

2. gseapy network errors (ssGSEA)

Solution 1: Use local Hallmark GMT

--library data\external\MSigDB\h.all.v2025.1.Hs.symbols.gmt

Solution 2: Fallback to mean-z
```
--method meanz
```

3. Unicode rendering issues

Ensure editor is set to UTF-8 encoding
Issue affects reports/summary.md

4. CVAE checkpoint loading fails

Check metrics.json for correct latent_dim
Evaluator auto-infers from checkpoint if metrics missing
Verify PyTorch version compatibility

5. Feature alignment shows low match count

Check debug_alignment.json for diagnostics
Ensure gene symbols are uppercased
Duplicates are automatically averaged

Windows-Specific Notes

✅ Use PowerShell (not CMD)
✅ Paths: Use backslashes \ or forward slashes /
✅ Line continuations: Use backtick ` at end of line
✅ Scripts tested on Windows 10/11

📚 References & Citations

Methods

PAM50: Parker et al., 2009 - "Supervised risk predictor of breast cancer based on intrinsic subtypes"
Elastic Net: Zou & Hastie, 2005 - "Regularization and variable selection via the elastic net"
VAE: Kingma & Welling, 2014 - "Auto-Encoding Variational Bayes"
Hallmark: Liberzon et al., 2015 - "The Molecular Signatures Database (MSigDB) hallmark gene set collection"
ssGSEA: Barbie et al., 2009 - "Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1"

Data Sources

TCGA-BRCA: The Cancer Genome Atlas Program (National Cancer Institute)
GSE96058: Gene Expression Omnibus (GEO/NCBI)
METABRIC: cBioPortal for Cancer Genomics (MSKCC/Dana-Farber)
MSigDB: Molecular Signatures Database (Broad Institute)

📋 System Requirements

Python: 3.8 or higher

Key Dependencies:

torch>=1.12.0
scikit-learn>=1.0.0
pandas>=1.3.0
numpy>=1.21.0
gseapy>=1.0.0
umap-learn>=0.5.0
matplotlib>=3.5.0
seaborn>=0.11.0
pyarrow>=6.0.0

Optional:

CUDA toolkit (for GPU acceleration)
Jupyter (for notebook exploration)

See requirements.txt for complete dependency list.

Tested on:

Windows 10/11 (PowerShell)
Python 3.8, 3.9, 3.10
CPU and CUDA environments

🤝 Contributing

Contributions are welcome! To contribute:

Fork the repository
Create a feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Guidelines:

Add tests for new functionality
Update documentation
Follow existing code style
Include clear commit messages

📄 License

This project is licensed under the MIT License. See LICENSE file for details.

🙏 Acknowledgments

TCGA Research Network for BRCA genomic data
MSigDB team (Broad Institute) for Hallmark gene sets
cBioPortal and GEO for external cohort access
PyTorch, scikit-learn, and gseapy communities
All contributors and users of this project

📞 Contact & Support

Issues: GitHub Issues
Documentation: Project Website
External Validation: Detailed Results
Figures: Gallery

🚀 Project Status

Current Version: 1.0.0
Status: ✅ Complete

Training pipeline (Elastic Net + CVAE)
External validation (GSE96058, METABRIC)
Interactive HTML reports
Distribution comparison tables
Advanced visualizations
Comprehensive documentation

Next Steps (Optional):

Add calibration plots when labels available
Implement survival analysis
Add more external cohorts
Create Docker container
Add Jupyter notebook tutorials

Last updated: October 11, 2025
Built with ❤️ for reproducible cancer research

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.vscode		.vscode
configs		configs
outputs		outputs
reports		reports
scripts		scripts
src		src
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run_all.ps1		run_all.ps1
run_pipeline.py		run_pipeline.py
run_reproduce.py		run_reproduce.py
texput.log		texput.log

Folders and files

Latest commit

History

Repository files navigation

BRCA PAM50 Subtype Classification: Elastic Net vs Conditional VAE

🌐 Interactive Project Website

🎯 Project Overview

Key Achievements

🏆 Key Results

Test Set Performance (TCGA BRCA, n=1,222)

External Validation

📁 Repository Structure

🚀 Quick Start

1. Installation

2. View Results (No Training Required)

3. Reproduce Training (Optional)

🔬 External Validation Workflow

Prerequisites

Supported Cohorts

Step-by-Step Guide

Evaluation Outputs

📊 Model Details

Elastic Net Baseline

Conditional VAE (CVAE)

🧬 Data & Features

Training Data (TCGA BRCA)

Pathway Enrichment Methods

Feature Alignment (External Validation)

🛠️ Scripts & Utilities

Core Training

External Evaluation

Orchestration

📈 Figures & Visualizations

Available Visualizations

🔧 Troubleshooting

Common Issues

Windows-Specific Notes

📚 References & Citations

Methods

Data Sources

📋 System Requirements

🤝 Contributing

📄 License

🙏 Acknowledgments

📞 Contact & Support

🚀 Project Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages