Last updated: October 11, 2025
β Open the complete project showcase (View in browser)
Explore:
- π Comprehensive project overview and methodology
- π€ Model architectures, hyperparameters, and test performance
- π¬ External validation on GSE96058 (n=3,409) and METABRIC (n=1,980)
- π Interactive figures gallery with distributions, confidence plots, entropy analysis
- πΎ All downloadable artifacts: models, predictions, distribution CSVs
This project implements and compares interpretable and deep learning approaches for PAM50 breast cancer subtype classification from RNA-Seq expression data, with rigorous external validation across independent cohorts.
β
Interpretable baseline: Elastic Net on MSigDB Hallmark pathway features (F1=0.884)
β
Deep latent model: Conditional Variational Autoencoder with classifier head (F1=0.829)
β
Pathway-based features: ssGSEA and mean-z normalized enrichment scores
β
External validation: Tested on GSE96058 (microarray) and METABRIC (Illumina)
β
Robust alignment: Alias-based feature matching across platforms
β
Interactive reports: Auto-generated HTML with figures and distribution tables
| Model | Macro F1 | AUROC (OvR) | Balanced Accuracy |
|---|---|---|---|
| Elastic Net (ssGSEA) | 0.884 | 0.985 | 0.898 |
| Best CVAE (Ξ²=0.2, cw=1.0) | 0.829 | 0.975 | 0.813 |
Winner: Elastic Net shows superior test metrics while maintaining full interpretability through pathway coefficients.
| Cohort | Samples | Platform | Models Evaluated | Status |
|---|---|---|---|---|
| GSE96058 | 3,409 | GEO microarray (GPL570) | EN (ssGSEA), CVAE (ssGSEA, mean-z) | β Complete |
| METABRIC | 1,980 | Illumina HT-12 v3 | EN (ssGSEA, mean-z), CVAE (ssGSEA, mean-z) | β Complete |
Key findings:
- 50/50 Hallmark pathways successfully aligned across all cohorts
- Mean-z normalization more stable than ssGSEA for CVAE cross-cohort inference
- Predicted subtype distributions vary by platform and normalization method
- Full diagnostic outputs available: alignment logs, pathway matrices, confidence metrics
π View detailed external validation results β
Biodel/
βββ data/
β βββ processed_pathways_ssgsea_true/ # TCGA training data (pathway features)
β βββ external/ # External validation cohorts
β βββ GSE96058/ # GEO microarray (GPL570)
β βββ METABRIC/ # cBioPortal Illumina
β βββ MSigDB/ # Hallmark gene sets (local GMT)
β
βββ outputs/
β βββ elastic_net_pathways_ssgsea_en_broad/ # Trained Elastic Net model
β βββ cvae_sweeps/ # CVAE hyperparameter sweep results
β β βββ b0.2_cw1.0/ # Best CVAE checkpoint
β βββ external_eval/ # External distribution CSVs
β
βββ scripts/
β βββ train_elastic_net.py # Elastic Net trainer with grid search
β βββ sweep_cvae.py # CVAE hyperparameter sweep
β βββ evaluate_external.py # Elastic Net external evaluator
β βββ evaluate_external_cvae.py # CVAE external evaluator
β βββ build_external_report.py # HTML report generator
β βββ export_external_distributions.py # Distribution CSV exporter
β βββ convert_cbio_to_expr.py # cBioPortal format converter
β βββ plot_external_results.py # Advanced plotting utilities
β βββ calibrate_cvae.py # Temperature scaling for probabilities
β βββ plot_vae_embeddings.py # UMAP latent space visualization
β
βββ reports/
β βββ index.html # Main website homepage
β βββ overview.html # Project overview & methodology
β βββ models.html # Model details & metrics
β βββ external_validation.html # External validation results
β βββ figures.html # Figures gallery
β βββ style.css # Website styling
β βββ external/ # Interactive cohort reports
β β βββ METABRIC/index.html # METABRIC report with figures/tables
β β βββ GSE96058/index.html # GSE96058 report with figures/tables
β βββ summary.md # Detailed text summary
β
βββ requirements.txt # Python dependencies
βββ run_all.ps1 # Full pipeline orchestrator (Windows)
βββ README.md # This file
# Clone repository
git clone <repository-url>
cd Biodel
# Create virtual environment (recommended)
python -m venv venv
.\venv\Scripts\Activate.ps1
# Install dependencies
pip install -r requirements.txtCore Dependencies:
- Python 3.8+
- PyTorch 1.12+ (CUDA optional)
- scikit-learn, pandas, numpy
- gseapy (ssGSEA pathway enrichment)
- umap-learn (latent visualizations)
- matplotlib, seaborn (plotting)
- pyarrow (Parquet I/O)
Option A: Direct open
cd reports
start index.htmlOption B: Local server (recommended)
cd reports
python -m http.server 8000
# Open http://localhost:8000 in your browserOption C: VS Code Live Server
- Install "Live Server" extension
- Right-click
reports/index.htmlβ "Open with Live Server"
Full pipeline in one command:
.\run_all.ps1Or run steps individually:
Step 1: Train Elastic Net
python .\scripts\train_elastic_net.py `
--in-dir .\data\processed_pathways_ssgsea_true `
--out-dir .\outputs\elastic_net_pathways_ssgsea_en_broad `
--mode elasticnet `
--grid_C 0.05 0.1 0.25 0.5 1.0 2.0 `
--grid_l1 0.05 0.1 0.2 0.3 0.5 0.7 `
--selection_split val `
--class_weight balanced `
--max_iter 5000Step 2: Train CVAE with sweep
python .\scripts\sweep_cvae.py `
--pathway-scores .\data\processed_pathways_ssgsea_true\X.parquet `
--labels .\data\processed_pathways_ssgsea_true\y.csv `
--splits .\data\processed_pathways_ssgsea_true\splits.csv `
--label-col pam50_subtype `
--out-root .\outputs\cvae_sweeps `
--latent-dim 16 `
--epochs 80 `
--batch-size 64 `
--lr 0.001 `
--betas 0.1 0.2 0.5 `
--cls-weights 1.0 2.0 3.0Step 3: Generate summary
python .\scripts\generate_summary.py `
--en-dir .\outputs\elastic_net_pathways_ssgsea_en_broad `
--cvae-sweep .\outputs\cvae_sweeps\sweep_summary.json `
--out .\reports\summary.md- Expression matrix in samples Γ genes format (CSV or Parquet)
- Gene symbols (HGNC): case-insensitive, duplicates auto-averaged
- Optional: labels CSV with
sample_idand label column
- GSE96058 - GEO microarray (GPL570), 3,409 samples
- METABRIC - cBioPortal Illumina HT-12 v3, 1,980 samples
1. Prepare expression data
If your data is genes Γ samples (cBioPortal format), convert first:
python .\scripts\convert_cbio_to_expr.py `
--input "data\external\METABRIC\brca_metabric\data_mrna_illumina_microarray.txt" `
--output "data\external\METABRIC\expr.csv"2. Run Elastic Net evaluation
python .\scripts\evaluate_external.py `
--expr .\data\external\METABRIC\expr.csv `
--expr-format csv `
--model-dir .\outputs\elastic_net_pathways_ssgsea_en_broad `
--out-dir .\data\external\METABRIC\en_ssgsea `
--method ssgsea `
--library .\data\external\MSigDB\h.all.v2025.1.Hs.symbols.gmt3. Run CVAE evaluation
python .\scripts\evaluate_external_cvae.py `
--expr .\data\external\METABRIC\expr.csv `
--expr-format csv `
--checkpoint .\outputs\cvae_sweeps\b0.2_cw1.0\best.pt `
--out-dir .\data\external\METABRIC\cvae_meanz `
--method meanz `
--library .\data\external\MSigDB\h.all.v2025.1.Hs.symbols.gmtRecommended pathway methods:
- Elastic Net:
--method ssgsea(ranked enrichment, interpretable) - CVAE:
--method meanz(normalized, more stable across cohorts)
4. Export distributions comparison
python .\scripts\export_external_distributions.py `
--outputs "data\external\METABRIC\en_ssgsea" `
"data\external\METABRIC\en_meanz" `
"data\external\METABRIC\cvae_ssgsea" `
"data\external\METABRIC\cvae_meanz" `
--out-csv "outputs\external_eval\METABRIC_distributions.csv"5. Generate interactive HTML report
python .\scripts\build_external_report.py `
--cohort METABRIC `
--runs "data\external\METABRIC\en_ssgsea" `
"data\external\METABRIC\en_meanz" `
"data\external\METABRIC\cvae_ssgsea" `
"data\external\METABRIC\cvae_meanz" `
--outdir "reports\external\METABRIC"Each run produces:
predictions.csv- Predicted PAM50 subtype per sampleprobabilities.csv- 5-class probabilities (Basal, HER2, LumA, LumB, Normal)debug_alignment.json- Feature alignment diagnosticspathways_matrix_raw.csv- Raw pathway scores (for inspection)
CVAE runs additionally save:
embeddings.parquet- Latent space (z) embeddings for UMAPlogits.parquet- Raw classifier logits
HTML reports include:
- Distribution panel (bar charts across runs)
- Confidence KDE plots per run
- Predictive entropy histograms
- Top-5 most confident samples per subtype (CSV downloads)
Architecture:
- Input: 50 MSigDB Hallmark pathway enrichment scores (ssGSEA)
- Model: Elastic Net logistic regression (L1 + L2 regularization)
- Solver: SAGA (efficient for L1 penalty)
- Class weights: Balanced (inverse frequency)
Hyperparameters:
- Regularization grid:
C β {0.05, 0.1, 0.25, 0.5, 1.0, 2.0} - L1 ratio grid:
l1_ratio β {0.05, 0.1, 0.2, 0.3, 0.5, 0.7} - Selection criterion: Validation set macro F1
- Max iterations: 5000
Outputs:
model.pkl- Trained scikit-learn modelcoefficients.csv- Pathway importance per subtype (interpretable!)metrics.json- Test F1, AUROC, balanced accuracyconfusion_matrix.csv- Per-class performance
Location: outputs/elastic_net_pathways_ssgsea_en_broad/
Architecture:
- Encoder: Pathway features β latent embeddings (z), dim=16
- Decoder: Reconstruct pathway features from z
- Classifier head: z β 5-class PAM50 probabilities
- Loss: Reconstruction + Ξ²Β·KL(q(z|x)||p(z)) + classification cross-entropy
Hyperparameter sweep:
- Beta (KL weight):
{0.1, 0.2, 0.5} - Classification weight:
{1.0, 2.0, 3.0} - Early stopping: Validation loss patience=10 epochs
- Best config: Ξ²=0.2, cls_weight=1.0 (selected by validation F1)
Training details:
- Optimizer: Adam (lr=0.001)
- Batch size: 64
- Epochs: 80 (with early stopping)
- Device: CUDA if available, else CPU
Outputs:
best.pt- PyTorch checkpoint (state_dict)metrics.json- Test F1, AUROC, balanced accuracy, loss curvesembeddings.parquet- Latent space (z) for UMAPcalibration/- Temperature-scaled probabilities (optional)
Location: outputs/cvae_sweeps/b0.2_cw1.0/
- Samples: 1,222 (after quality filtering)
- Features: 50 MSigDB Hallmark pathways
- Labels: PAM50 subtypes
- Basal-like
- HER2-enriched
- Luminal A
- Luminal B
- Normal-like
- Split: 70% train, 15% validation, 15% test (stratified by subtype)
ssGSEA (Single Sample Gene Set Enrichment Analysis):
- Ranked enrichment scores per sample
- Used for Elastic Net (primary) and CVAE (optional)
- Tool:
gseapy.ssgsea() - Best for interpretability and within-platform consistency
Mean-z normalization:
- Gene-level z-scores averaged across pathway members
- More stable for cross-platform/cross-cohort inference
- Recommended for CVAE external validation
- Fallback when ssGSEA shows platform sensitivity
- Pathway names normalized: uppercase, strip whitespace
- Alias matching handles variants:
- TNF-alpha β TNFA
- Peroxisome β Pperoxisome
- Debug outputs:
debug_alignment.jsonwith match counts and variance checks - Result: 50/50 Hallmark pathways matched across all cohorts
| Script | Purpose |
|---|---|
train_elastic_net.py |
Elastic Net trainer with grid search |
sweep_cvae.py |
CVAE hyperparameter sweep with early stopping |
calibrate_cvae.py |
Temperature scaling for probability calibration |
plot_vae_embeddings.py |
UMAP visualization of latent space |
| Script | Purpose |
|---|---|
evaluate_external.py |
Elastic Net external inference |
evaluate_external_cvae.py |
CVAE external inference |
convert_cbio_to_expr.py |
Convert cBioPortal genesΓsamples β samplesΓgenes |
export_external_distributions.py |
Generate side-by-side count tables (CSV) |
build_external_report.py |
Generate interactive HTML reports |
plot_external_results.py |
Advanced plotting (KDE, entropy, UMAP) |
| Script | Purpose |
|---|---|
run_all.ps1 |
Full pipeline: train, sweep, calibrate, plot, summarize |
All figures available in:
- Interactive website - Main showcase
- Figures gallery - Organized by cohort
- External reports - Per-cohort HTML pages
TCGA Test Set:
- Confusion matrices (Elastic Net, CVAE)
- UMAP of latent space colored by PAM50 subtype
- Pathway coefficient heatmap (Elastic Net)
External Validation (per cohort):
- Distribution panels (bar charts across 4 runs)
- Confidence KDE plots (per run)
- Predictive entropy histograms (per run)
- UMAP entropy maps (when embeddings available)
- Top-5 most confident samples per subtype (CSV)
Examples:
reports/external/METABRIC/METABRIC_distributions_panel.pngreports/external/GSE96058/GSE96058_cvae_meanz_confidence_kde.pngreports/external/METABRIC/METABRIC_en_ssgsea_entropy_hist.png
1. Parquet I/O errors
# Install pyarrow if missing
pip install pyarrow2. gseapy network errors (ssGSEA)
- Solution 1: Use local Hallmark GMT
--library data\external\MSigDB\h.all.v2025.1.Hs.symbols.gmt
- Solution 2: Fallback to mean-z
--method meanz
3. Unicode rendering issues
- Ensure editor is set to UTF-8 encoding
- Issue affects
reports/summary.md
4. CVAE checkpoint loading fails
- Check
metrics.jsonfor correctlatent_dim - Evaluator auto-infers from checkpoint if metrics missing
- Verify PyTorch version compatibility
5. Feature alignment shows low match count
- Check
debug_alignment.jsonfor diagnostics - Ensure gene symbols are uppercased
- Duplicates are automatically averaged
- β Use PowerShell (not CMD)
- β
Paths: Use backslashes
\or forward slashes/ - β
Line continuations: Use backtick
`at end of line - β Scripts tested on Windows 10/11
- PAM50: Parker et al., 2009 - "Supervised risk predictor of breast cancer based on intrinsic subtypes"
- Elastic Net: Zou & Hastie, 2005 - "Regularization and variable selection via the elastic net"
- VAE: Kingma & Welling, 2014 - "Auto-Encoding Variational Bayes"
- Hallmark: Liberzon et al., 2015 - "The Molecular Signatures Database (MSigDB) hallmark gene set collection"
- ssGSEA: Barbie et al., 2009 - "Systematic RNA interference reveals that oncogenic KRAS-driven cancers require TBK1"
- TCGA-BRCA: The Cancer Genome Atlas Program (National Cancer Institute)
- GSE96058: Gene Expression Omnibus (GEO/NCBI)
- METABRIC: cBioPortal for Cancer Genomics (MSKCC/Dana-Farber)
- MSigDB: Molecular Signatures Database (Broad Institute)
Python: 3.8 or higher
Key Dependencies:
torch>=1.12.0
scikit-learn>=1.0.0
pandas>=1.3.0
numpy>=1.21.0
gseapy>=1.0.0
umap-learn>=0.5.0
matplotlib>=3.5.0
seaborn>=0.11.0
pyarrow>=6.0.0
Optional:
- CUDA toolkit (for GPU acceleration)
- Jupyter (for notebook exploration)
See requirements.txt for complete dependency list.
Tested on:
- Windows 10/11 (PowerShell)
- Python 3.8, 3.9, 3.10
- CPU and CUDA environments
Contributions are welcome! To contribute:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
Guidelines:
- Add tests for new functionality
- Update documentation
- Follow existing code style
- Include clear commit messages
This project is licensed under the MIT License. See LICENSE file for details.
- TCGA Research Network for BRCA genomic data
- MSigDB team (Broad Institute) for Hallmark gene sets
- cBioPortal and GEO for external cohort access
- PyTorch, scikit-learn, and gseapy communities
- All contributors and users of this project
- Issues: GitHub Issues
- Documentation: Project Website
- External Validation: Detailed Results
- Figures: Gallery
Current Version: 1.0.0
Status: β
Complete
- Training pipeline (Elastic Net + CVAE)
- External validation (GSE96058, METABRIC)
- Interactive HTML reports
- Distribution comparison tables
- Advanced visualizations
- Comprehensive documentation
Next Steps (Optional):
- Add calibration plots when labels available
- Implement survival analysis
- Add more external cohorts
- Create Docker container
- Add Jupyter notebook tutorials
Last updated: October 11, 2025
Built with β€οΈ for reproducible cancer research