Skip to content

BioITHackathons/Repurpostory

Repository files navigation

Drug Repurposing Through Disease Similarity Analysis Using BiomarkerKB

Computational pipeline and interactive web dashboard for identifying drug repurposing candidates across multiple cancers, driven by biomarker evidence from BiomarkerKB and cross-referenced against GTEx, Pharos/IDG, GDC/TCGA, LINCS L1000, and STRING.

Presentation: Google Slides

Full project description on Google Docs


Overview

The pipeline queries six public databases to build a ranked list of drug repurposing candidates for a given cancer:

Step Source What it provides
1 BiomarkerKB Disease-associated biomarker genes
2 GTEx Tissue expression (TPM) per gene
3 Pharos / IDG Target development level (Tclin → Tdark)
4 GDC / TCGA Somatic mutation frequency + differential expression vs. normal
5 STRING Protein–protein interaction network expansion
6 LINCS L1000 Drug perturbagen signatures matching query genes

Candidates are scored across five weighted components and ranked. Weights can be adjusted interactively in the web dashboard.

Diseases supported (configurable): Hepatocellular Carcinoma, Pancreatic Cancer, Colorectal Cancer, Lung Adenocarcinoma and Breast Cancer.


Setup

Python environment

Requires Python 3.9+.

python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

LINCS data files (required for LINCS step)

The LINCS L1000 knowledge graph files are too large for git. Download them from:

https://dd-kg-ui.cfde.cloud/downloads

Place the following files in the data/ directory:

  • LINCS.Compound.nodes.csv
  • LINCS.Gene.nodes.csv
  • LINCS.edges.csv

The LINCS step can be skipped with --no-lincs if these files are unavailable.


Running the pipeline

python pipeline.py \
  --output-dir output \
  --search-terms "hepatocellular carcinoma" "HCC" \
  --gtex-tissue Liver \
  --gdc-project TCGA-LIHC

Key flags:

Flag Default Description
--output-dir output Directory for all result CSVs and JSON
--search-terms HCC terms BiomarkerKB condition search terms
--gtex-tissue Liver GTEx tissue (e.g. Pancreas, Lung, Colon_Transverse)
--gdc-project TCGA-LIHC TCGA project (e.g. TCGA-PAAD, TCGA-COAD)
--no-lincs Skip LINCS L1000 step
--no-pharos Skip Pharos/IDG step
--no-gdc Skip GDC step
--no-string Skip STRING PPI expansion
--cache-dir cache Local cache for API responses

Output files

Each run writes to --output-dir:

File Contents
biomarkers.csv BiomarkerKB hits
gtex_liver_expression.csv Per-gene tissue TPM
pharos_targets.csv IDG target annotations
gdc_stats.csv Mutation freq + DE log2FC per gene
string_interactions.csv PPI edges from STRING
lincs_perturbagen_hits.csv Ranked perturbagens with annotations
final_scored_candidates_v2.csv Final scored + ranked candidates
summary.json Summary counts for each step

Scoring

After running the pipeline, score and rank candidates with:

python score.py --output-dir output --top 20

This reads lincs_perturbagen_hits.csv and writes final_scored_candidates_v2.csv with all component scores and a weighted total. Component score caps:

Component Cap Signal
BiomarkerKB 5 Number of biomarker genes hit
GTEx expression 15 Tissue TPM (log-normalized)
IDG/Pharos 8 Target development level
LINCS 10 Perturbagen signature strength
GDC/TCGA 8 Mutation freq + differential expression

Web dashboard

An interactive Django + React dashboard lets you explore results across all five cancers and adjust scoring weights in real time.

Start the backend

cd web
pip install -r requirements-web.txt
python manage.py migrate
python manage.py runserver

Start the frontend

cd web/frontend
npm install
npm run dev

Open http://localhost:5173 (or the Vite port shown in the terminal).

Features

  • Disease selector — switch between HCC, Pancreatic, Colorectal, Lung, and Breast cancer results
  • Summary cards — step-level counts (biomarkers, GTEx genes, targets, candidates)
  • Scoring panel — ranked drug candidates with per-component score bars; five weight sliders (0–3×) re-rank in real time without re-running the pipeline
  • BiomarkerKB, GTEx, Pharos, GDC, LINCS, STRING sections — paginated tables for each data source

The dashboard reads pre-computed CSV outputs. Run the pipeline at least once per disease before using the dashboard.


Repository layout

pipeline.py          # entry point: python pipeline.py [options]
score.py             # standalone scorer: python score.py [--output-dir ...]
src/
  pipeline.py        # pipeline orchestration
  scoring.py         # multi-component scoring module
  clients/           # API clients (BiomarkerKB, GTEx, Pharos, GDC, LINCS, STRING)
  models.py          # dataclasses for intermediate results
web/
  api/               # Django REST views + URL routing
  config/            # Django settings
  frontend/src/      # React components
data/                # LINCS files (not in git — see setup above)
output*/             # Pipeline results per disease (not in git)

About

No description or website provided.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors