Skip to content

johndef64/KG-TransomicNet

Repository files navigation

KG-TransomicNet

python ArangoDB PheKnowLator TCGA / TARGET Multi-omics License: MIT

A Semantic–Quantitative Property-Graph Framework for Dynamic Construction of Knowledge-Based Trans-Omic Networks

KG-TransomicNet couples a PheKnowLator-derived ontology backbone with per-sample TCGA/TARGET multi-omic measurements inside a single ArangoDB property graph. The semantic layer (≈780k entities, ≈11M typed relations from the OBO Foundry and Relation Ontology) and the quantitative layer (12,653 samples across 41 projects, five omic modalities at native precision) are deterministically joined through controlled mapping keys, so that ontology-grounded queries return per-sample numerical vectors without application-side joins.

The framework is application-agnostic and meant to be reused as the upstream substrate for downstream graph machine-learning pipelines (heterogeneous GNNs, KG embeddings, link prediction).

Framework architecture

What is in the database

Layer Source / platform Projects Samples
Transcriptomics HTSeq FPKM-UQ 41 11,768
CNV (gene-level) TCGA CNV (ASCAT3) 33 11,368
miRNA TCGA miRNA-Seq 33 11,020
Proteomics RPPA (TCPA) 32 7,754
Methylation Illumina HM27 12 2,595
Distinct samples (any layer) 41 12,653

Semantic backbone: 780,753 nodes / 11,082,103 edges from the PheKnowLator instance-based, OWLNETS build.

Schema details, collection fields, and the AQL query pattern are documented in docs/readme_db_structure.md.

Requirements

  • Python ≥ 3.10
  • A running ArangoDB instance (≥ 3.11), local or remote — required by every ingestion and use-case script
  • ~150 GB free disk for the full PanCancer build (semantic backbone + 5 quantitative layers)
pip install -r requiremets.txt

Configure the ArangoDB connection (host, credentials, database name) in scripts/arangodb_utils.py before running any script.

What ships with the repo and what does not

The repository ships with the curated identifier-mapping tables under data/mappings/ (BioMart gene mappings, miRNA↔HGNC, RPPA antibody manifest, methylation probe maps, MONDO cross-references, dbSNP rsID checkpoints). The heavy ones are shipped compressed as .zip and transparently unzipped on first use; the public probemap files are auto-downloaded from the GDC Xena hub when missing. These tables are not regenerated by the pipeline — the scripts under scripts/mapping/ document how they were produced and are kept for provenance only.

See data/mappings/README.md for the full per-file inventory, provenance details, and the resolution order used by load_mappings().

The following inputs are not tracked (see .gitignore) and must be obtained externally before running the build:

  • PheKnowLator instance-based OWLNETS build → Zenodo 10689968 — place the build under data/pkt/builds/v3.0.2/
  • TCGA / TARGET quantitative layers → UCSC Xena and the GDC Data Portal — downloaded by the scripts in step 1 below

Reproducing the database

Run the pipeline in order. Every script connects to the ArangoDB instance configured in scripts/arangodb_utils.py.

1. Download the omic data

python scripts/download_omics.py                          # TCGA-BRCA test run (default)
python scripts/download_omics.py --cohort tcga            # all TCGA studies
python scripts/download_omics.py --cohort target          # all TARGET studies
python scripts/download_omics.py --cohort all             # TCGA + TARGET
python scripts/download_omics.py --studies TCGA-BRCA TCGA-LUAD   # explicit list
python scripts/download_omics.py --include-probemaps --include-pancan

2. Build the semantic backbone (PKT → property graph)

python scripts/download_pkt.py                                  # fetch PKT v3.0.2 from Zenodo + derive NodeLabels CSV
python scripts/build_property_graph.py                          # PKT N-Triples → nodes/edges JSON (full build)
python scripts/build_property_graph.py --sample 10000           # quick test with a random subsample
python scripts/load_graph_to_arangodb.py --db PKT_main          # load nodes/edges into ArangoDB

3. Build and load the quantitative layers

The build and load steps share the same --studies / --cohort / --layers CLI as download_omics.py.

# Build JSON collections (default: TCGA-BRCA, all layers)
python scripts/build_omics_collections.py
python scripts/build_omics_collections.py --cohort tcga
python scripts/build_omics_collections.py --studies TCGA-BRCA TCGA-LUAD --layers gene_expression mirna

# Load into ArangoDB
python scripts/load_omics_collections_to_arangodb.py --db PKT_main
python scripts/load_omics_collections_to_arangodb.py --cohort tcga --db PKT_main
python scripts/load_omics_collections_to_arangodb.py --studies TCGA-BRCA --layers protein methylation

4. (Optional) Trans-omic networks and database statistics

python scripts/build_kg_transomics.py        # per-sample trans-omic subgraph
python scripts/build_transomic_network.py    # cohort-level trans-omic network
python scripts/analyze_kg_transomics.py     # subgraph descriptive statistics

Reproducing the use cases

The three use cases on TCGA-BRCA exercise the framework at predicate, gene, and phenotype granularity. Each script reads from ArangoDB and writes results + figures to results/ucN/.

Script What it does
use_case_1.py Predicate-stratified mRNA–mRNA coherence (co-pathway, molecular interaction, genetic interaction, co-disease) vs random baseline; hop-distance decay.
use_case_2.py Per-gene CNV→mRNA→protein discordance classification; GO, pathway, and KG-predicate enrichment on discordant genes.
use_case_3.py Phenotype-anchored two-hop traversal from HP:0003002 (Breast carcinoma); multi-layer projection of the resulting gene set against per-layer random baselines.
python use_case_1.py
python use_case_2.py
python use_case_3.py

To regenerate only the figures from cached results (no ArangoDB connection needed):

python use_case_N.py --skip-analysis

The classification of output files (R = result, C = plot cache, F = figure) and which files are required by --skip-analysis are documented in use_case_readme.md.

Repository layout

.
├── data/
│   ├── mappings/            # curated identifier-mapping tables (shipped)
│   ├── pkt/                 # PKT build target (external, see above)
│   ├── omics/               # TCGA/TARGET dumps (external)
│   └── ...
├── docs/                    # database schema, layer structure, methods
├── scripts/
│   ├── mapping/             # provenance scripts for data/mappings/ (not part of the pipeline)
│   ├── stats/               # database statistics
│   ├── download_omics.py    # fetch TCGA/TARGET layers from UCSC Xena GDC hub
│   ├── download_pkt.py      # fetch PheKnowLator v3.0.2 build from Zenodo + derive NodeLabels CSV
│   ├── omics_utils.py       # read/manipulate locally-downloaded omic matrices
│   ├── pkt_utils.py         # PKT tar/RDF reader helpers
│   ├── build_property_graph.py
│   ├── load_graph_to_arangodb.py
│   ├── build_omics_collections.py
│   ├── load_omics_collections_to_arangodb.py
│   ├── build_kg_transomics.py
│   ├── build_transomic_network.py
│   ├── analyze_kg_transomics.py
│   └── arangodb_utils.py
├── use_case_1.py            # UC1 — predicate-stratified coherence
├── use_case_2.py            # UC2 — multi-layer discordance
├── use_case_3.py            # UC3 — phenotype-anchored subgraph
├── results/                 # per-use-case CSV/JSON results + figures
└── figures/                 # framework architecture figure

Citation

De Filippis, G. M., Rinaldi, A. M. Modeling Omics with Semantics for Dynamic Construction of Knowledge-Based Trans-Omic Networks. (under review).

License

MIT

About

Semantic-Quantitative property-graph framework integrating PheKnowLator KG with per-sample TCGA/TARGET multi-omics in ArangoDB.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages