Computational pipeline and interactive web dashboard for identifying drug repurposing candidates across multiple cancers, driven by biomarker evidence from BiomarkerKB and cross-referenced against GTEx, Pharos/IDG, GDC/TCGA, LINCS L1000, and STRING.
Presentation: Google Slides
Full project description on Google Docs
The pipeline queries six public databases to build a ranked list of drug repurposing candidates for a given cancer:
| Step | Source | What it provides |
|---|---|---|
| 1 | BiomarkerKB | Disease-associated biomarker genes |
| 2 | GTEx | Tissue expression (TPM) per gene |
| 3 | Pharos / IDG | Target development level (Tclin → Tdark) |
| 4 | GDC / TCGA | Somatic mutation frequency + differential expression vs. normal |
| 5 | STRING | Protein–protein interaction network expansion |
| 6 | LINCS L1000 | Drug perturbagen signatures matching query genes |
Candidates are scored across five weighted components and ranked. Weights can be adjusted interactively in the web dashboard.
Diseases supported (configurable): Hepatocellular Carcinoma, Pancreatic Cancer, Colorectal Cancer, Lung Adenocarcinoma and Breast Cancer.
Requires Python 3.9+.
python -m venv venv
source venv/bin/activate
pip install -r requirements.txtThe LINCS L1000 knowledge graph files are too large for git. Download them from:
Place the following files in the data/ directory:
LINCS.Compound.nodes.csvLINCS.Gene.nodes.csvLINCS.edges.csv
The LINCS step can be skipped with --no-lincs if these files are unavailable.
python pipeline.py \
--output-dir output \
--search-terms "hepatocellular carcinoma" "HCC" \
--gtex-tissue Liver \
--gdc-project TCGA-LIHCKey flags:
| Flag | Default | Description |
|---|---|---|
--output-dir |
output |
Directory for all result CSVs and JSON |
--search-terms |
HCC terms | BiomarkerKB condition search terms |
--gtex-tissue |
Liver |
GTEx tissue (e.g. Pancreas, Lung, Colon_Transverse) |
--gdc-project |
TCGA-LIHC |
TCGA project (e.g. TCGA-PAAD, TCGA-COAD) |
--no-lincs |
— | Skip LINCS L1000 step |
--no-pharos |
— | Skip Pharos/IDG step |
--no-gdc |
— | Skip GDC step |
--no-string |
— | Skip STRING PPI expansion |
--cache-dir |
cache |
Local cache for API responses |
Each run writes to --output-dir:
| File | Contents |
|---|---|
biomarkers.csv |
BiomarkerKB hits |
gtex_liver_expression.csv |
Per-gene tissue TPM |
pharos_targets.csv |
IDG target annotations |
gdc_stats.csv |
Mutation freq + DE log2FC per gene |
string_interactions.csv |
PPI edges from STRING |
lincs_perturbagen_hits.csv |
Ranked perturbagens with annotations |
final_scored_candidates_v2.csv |
Final scored + ranked candidates |
summary.json |
Summary counts for each step |
After running the pipeline, score and rank candidates with:
python score.py --output-dir output --top 20This reads lincs_perturbagen_hits.csv and writes final_scored_candidates_v2.csv with all component scores and a weighted total. Component score caps:
| Component | Cap | Signal |
|---|---|---|
| BiomarkerKB | 5 | Number of biomarker genes hit |
| GTEx expression | 15 | Tissue TPM (log-normalized) |
| IDG/Pharos | 8 | Target development level |
| LINCS | 10 | Perturbagen signature strength |
| GDC/TCGA | 8 | Mutation freq + differential expression |
An interactive Django + React dashboard lets you explore results across all five cancers and adjust scoring weights in real time.
cd web
pip install -r requirements-web.txt
python manage.py migrate
python manage.py runservercd web/frontend
npm install
npm run devOpen http://localhost:5173 (or the Vite port shown in the terminal).
- Disease selector — switch between HCC, Pancreatic, Colorectal, Lung, and Breast cancer results
- Summary cards — step-level counts (biomarkers, GTEx genes, targets, candidates)
- Scoring panel — ranked drug candidates with per-component score bars; five weight sliders (0–3×) re-rank in real time without re-running the pipeline
- BiomarkerKB, GTEx, Pharos, GDC, LINCS, STRING sections — paginated tables for each data source
The dashboard reads pre-computed CSV outputs. Run the pipeline at least once per disease before using the dashboard.
pipeline.py # entry point: python pipeline.py [options]
score.py # standalone scorer: python score.py [--output-dir ...]
src/
pipeline.py # pipeline orchestration
scoring.py # multi-component scoring module
clients/ # API clients (BiomarkerKB, GTEx, Pharos, GDC, LINCS, STRING)
models.py # dataclasses for intermediate results
web/
api/ # Django REST views + URL routing
config/ # Django settings
frontend/src/ # React components
data/ # LINCS files (not in git — see setup above)
output*/ # Pipeline results per disease (not in git)