A Semantic–Quantitative Property-Graph Framework for Dynamic Construction of Knowledge-Based Trans-Omic Networks
KG-TransomicNet couples a PheKnowLator-derived ontology backbone with per-sample TCGA/TARGET multi-omic measurements inside a single ArangoDB property graph. The semantic layer (≈780k entities, ≈11M typed relations from the OBO Foundry and Relation Ontology) and the quantitative layer (12,653 samples across 41 projects, five omic modalities at native precision) are deterministically joined through controlled mapping keys, so that ontology-grounded queries return per-sample numerical vectors without application-side joins.
The framework is application-agnostic and meant to be reused as the upstream substrate for downstream graph machine-learning pipelines (heterogeneous GNNs, KG embeddings, link prediction).
| Layer | Source / platform | Projects | Samples |
|---|---|---|---|
| Transcriptomics | HTSeq FPKM-UQ | 41 | 11,768 |
| CNV (gene-level) | TCGA CNV (ASCAT3) | 33 | 11,368 |
| miRNA | TCGA miRNA-Seq | 33 | 11,020 |
| Proteomics | RPPA (TCPA) | 32 | 7,754 |
| Methylation | Illumina HM27 | 12 | 2,595 |
| Distinct samples (any layer) | — | 41 | 12,653 |
Semantic backbone: 780,753 nodes / 11,082,103 edges from the PheKnowLator instance-based, OWLNETS build.
Schema details, collection fields, and the AQL query pattern are documented in docs/readme_db_structure.md.
- Python ≥ 3.10
- A running ArangoDB instance (≥ 3.11), local or remote — required by every ingestion and use-case script
- ~150 GB free disk for the full PanCancer build (semantic backbone + 5 quantitative layers)
pip install -r requiremets.txtConfigure the ArangoDB connection (host, credentials, database name) in scripts/arangodb_utils.py before running any script.
The repository ships with the curated identifier-mapping tables under data/mappings/ (BioMart gene mappings, miRNA↔HGNC, RPPA antibody manifest, methylation probe maps, MONDO cross-references, dbSNP rsID checkpoints). The heavy ones are shipped compressed as .zip and transparently unzipped on first use; the public probemap files are auto-downloaded from the GDC Xena hub when missing. These tables are not regenerated by the pipeline — the scripts under scripts/mapping/ document how they were produced and are kept for provenance only.
See data/mappings/README.md for the full per-file inventory, provenance details, and the resolution order used by load_mappings().
The following inputs are not tracked (see .gitignore) and must be obtained externally before running the build:
- PheKnowLator instance-based OWLNETS build → Zenodo 10689968 — place the build under
data/pkt/builds/v3.0.2/ - TCGA / TARGET quantitative layers → UCSC Xena and the GDC Data Portal — downloaded by the scripts in step 1 below
Run the pipeline in order. Every script connects to the ArangoDB instance configured in scripts/arangodb_utils.py.
python scripts/download_omics.py # TCGA-BRCA test run (default)
python scripts/download_omics.py --cohort tcga # all TCGA studies
python scripts/download_omics.py --cohort target # all TARGET studies
python scripts/download_omics.py --cohort all # TCGA + TARGET
python scripts/download_omics.py --studies TCGA-BRCA TCGA-LUAD # explicit list
python scripts/download_omics.py --include-probemaps --include-pancanpython scripts/download_pkt.py # fetch PKT v3.0.2 from Zenodo + derive NodeLabels CSV
python scripts/build_property_graph.py # PKT N-Triples → nodes/edges JSON (full build)
python scripts/build_property_graph.py --sample 10000 # quick test with a random subsample
python scripts/load_graph_to_arangodb.py --db PKT_main # load nodes/edges into ArangoDBThe build and load steps share the same --studies / --cohort / --layers CLI as download_omics.py.
# Build JSON collections (default: TCGA-BRCA, all layers)
python scripts/build_omics_collections.py
python scripts/build_omics_collections.py --cohort tcga
python scripts/build_omics_collections.py --studies TCGA-BRCA TCGA-LUAD --layers gene_expression mirna
# Load into ArangoDB
python scripts/load_omics_collections_to_arangodb.py --db PKT_main
python scripts/load_omics_collections_to_arangodb.py --cohort tcga --db PKT_main
python scripts/load_omics_collections_to_arangodb.py --studies TCGA-BRCA --layers protein methylationpython scripts/build_kg_transomics.py # per-sample trans-omic subgraph
python scripts/build_transomic_network.py # cohort-level trans-omic network
python scripts/analyze_kg_transomics.py # subgraph descriptive statisticsThe three use cases on TCGA-BRCA exercise the framework at predicate, gene, and phenotype granularity. Each script reads from ArangoDB and writes results + figures to results/ucN/.
| Script | What it does |
|---|---|
use_case_1.py |
Predicate-stratified mRNA–mRNA coherence (co-pathway, molecular interaction, genetic interaction, co-disease) vs random baseline; hop-distance decay. |
use_case_2.py |
Per-gene CNV→mRNA→protein discordance classification; GO, pathway, and KG-predicate enrichment on discordant genes. |
use_case_3.py |
Phenotype-anchored two-hop traversal from HP:0003002 (Breast carcinoma); multi-layer projection of the resulting gene set against per-layer random baselines. |
python use_case_1.py
python use_case_2.py
python use_case_3.pyTo regenerate only the figures from cached results (no ArangoDB connection needed):
python use_case_N.py --skip-analysisThe classification of output files (R = result, C = plot cache, F = figure) and which files are required by --skip-analysis are documented in use_case_readme.md.
.
├── data/
│ ├── mappings/ # curated identifier-mapping tables (shipped)
│ ├── pkt/ # PKT build target (external, see above)
│ ├── omics/ # TCGA/TARGET dumps (external)
│ └── ...
├── docs/ # database schema, layer structure, methods
├── scripts/
│ ├── mapping/ # provenance scripts for data/mappings/ (not part of the pipeline)
│ ├── stats/ # database statistics
│ ├── download_omics.py # fetch TCGA/TARGET layers from UCSC Xena GDC hub
│ ├── download_pkt.py # fetch PheKnowLator v3.0.2 build from Zenodo + derive NodeLabels CSV
│ ├── omics_utils.py # read/manipulate locally-downloaded omic matrices
│ ├── pkt_utils.py # PKT tar/RDF reader helpers
│ ├── build_property_graph.py
│ ├── load_graph_to_arangodb.py
│ ├── build_omics_collections.py
│ ├── load_omics_collections_to_arangodb.py
│ ├── build_kg_transomics.py
│ ├── build_transomic_network.py
│ ├── analyze_kg_transomics.py
│ └── arangodb_utils.py
├── use_case_1.py # UC1 — predicate-stratified coherence
├── use_case_2.py # UC2 — multi-layer discordance
├── use_case_3.py # UC3 — phenotype-anchored subgraph
├── results/ # per-use-case CSV/JSON results + figures
└── figures/ # framework architecture figure
De Filippis, G. M., Rinaldi, A. M. Modeling Omics with Semantics for Dynamic Construction of Knowledge-Based Trans-Omic Networks. (under review).
MIT
