a python package for converting gene sets to biological pathway embeddings with enrichment analysis attributes.
gene sets (from clusters, niches, studies) are tested against reactome pathways via fisher's exact test, then embedded into a shared vector space using metapath2vec over the pathway hierarchy graph.
enrichment is either fisher's exact (a binary top-N gene set per cluster/niche) or niche-level AUCell (the full per-niche expression ranking, no gene-set selection step). AUCell here scores each niche's aggregated pseudobulk profile, one score per (niche, pathway); it is not single-cell AUCell (the package never sees individual cells, niche construction is upstream). both enrichment sources feed the same reactome hierarchy graph and metapath2vec embedding.
gene sets (per cluster/niche/study)
|
v
pathway filtering (level: high/mid/low, gene type: TF, etc.)
|
v
enrichment analysis (fisher's exact test, fdr correction)
|
v
EA matrix: cluster x pathway (1-fdr or odds ratio weights)
|
v
pathway hierarchy graph (reactome) + cluster nodes with EA edges
|
v
metapath2vec random walks (weighted, type-biased)
|
v
skipgram embeddings (default 512-d)
|
v
cluster x 512 embedding matrix
the Net class builds a heterogeneous networkx graph with two node types:
- pathway nodes (reactome stIds): the reactome homo sapiens pathway hierarchy
- cluster nodes (your gene lists of interest): added from enrichment results
and two edge types:
- pathway - pathway: parent-child relations from the reactome hierarchy
- cluster - pathway: weighted edges from enrichment analysis (1 - fdr or odds ratio). each cluster connects to its significantly enriched pathways.
genes are not in the graph. they are used upstream in the enrichment step to determine which pathways are significant, but only pathways and clusters appear as nodes.
pathway node attributes:
node_type: "sig" or "notsig" (fdr < 0.05)het: 1 (sig), 0 (notsig), or -1 (not in enrichment results)feature: fdr value (raw)features: [1 - fdr] (inverted, used as weight)stId: reactome stable identifierpathway_name: human-readable nameparent_pathway: top-level reactome category
cluster node attributes:
node_type: "cluster"cluster: cluster name
cluster-pathway edge attributes:
weight: enrichment score (1 - fdr or odds ratio)
the graph can be filtered by pathway level (high/mid/low) and by gene membership before construction. set digraph=True for a directed graph, induce=True to keep only significant pathways.
metapath2vec performs biased random walks on the network, then trains a skipgram model to learn a dense vector for every node.
the walks are type-aware: metapaths like [sig, notsig, sig] or [cluster, sig, sig] guide the walker to follow specific node-type sequences. edge weights from the enrichment analysis bias which neighbors get visited, so clusters with strong signal to specific pathways walk there more often.
the result is a shared embedding space where:
- pathway embeddings (
pathway x dim) capture where each pathway sits in the reactome hierarchy and how it relates to other pathways through enrichment patterns. pathways that are structurally close in reactome or co-enriched across clusters end up with similar vectors. - cluster embeddings (
cluster x dim) capture each cluster's biological function as a position in pathway space. two clusters with similar pathway enrichment profiles end up close together, but unlike the raw EA matrix, the embedding also encodes the hierarchical relationships between their enriched pathways. a cluster enriched in "FGFR2 alternative splicing" and one enriched in "signaling by FGFR" will be closer than two clusters enriched in unrelated pathways, even if neither shares the exact same significant pathway. - EA matrix (
cluster x pathway) is the interpretable complement to the embeddings. each row is a cluster's pathway activity profile with explicit scores (1 - fdr or odds ratio). it serves as ground truth for what the embeddings encode and can be used directly for comparison across studies via cosine similarity.
- EA matrix (
cluster x pathway): enrichment weights per cluster, available as 1 - fdr or odds ratio - cluster embeddings (
cluster x dim): one dense vector per cluster encoding pathway activity + graph structure - pathway embeddings (
pathway x dim): one dense vector per pathway encoding hierarchical position + enrichment context
pip install -e . # fisher enrichment path
pip install -e '.[aucell]' # adds the niche-level AUCell path (decoupler, anndata)from gpath2vec.ea import enrich, ea_matrix, filter_pathways
from gpath2vec.net import Net
from gpath2vec.embedder import PathwayMetapath2vec
# gene sets: dict of {name: [genes]}
gene_sets = {
"cluster_0": ["EGFR", "EGF", "FGFR2", ...],
"cluster_1": ["CD8A", "CD8B", "GZMB", ...],
}
# enrichment (filter to low-level pathways containing TF genes)
ea_df = enrich(gene_sets, level="low", gene_filter=tf_genes)
matrix = ea_matrix(ea_df, weight="fdr") # cluster x pathway
matrix_or = ea_matrix(ea_df, weight="oddsratio")
# build graph with cluster nodes
clusters = {}
for _, r in ea_df[ea_df.sig_pathway].iterrows():
clusters.setdefault(r["cluster"], {})[r["stId"]] = 1 - r["fdr_bh"]
enrichment = [{"stId": r["stId"], "entities": {"fdr": r["fdr_bh"]}}
for _, r in ea_df.drop_duplicates("stId").iterrows()]
net = Net(enrichment=enrichment, id="my_study", digraph=True,
level="low", gene_filter=tf_genes, clusters=clusters)
# embeddings (pick a method)
from gpath2vec.embedder import (
PathwayMetapath2vec, SVDEmbedder, SpectralGraphEmbedder, LINEEmbedder
)
# metapath2vec: weighted random walks + skipgram on the graph
embedder = PathwayMetapath2vec(graph=net.graph, name="my_study",
walks_per_node=10, walk_length=100)
walks = embedder.model
embedder.train_embeddings(walks=walks, dimensions=512, epochs=15, lr=0.005)
# svd: truncated svd on the ea matrix (no graph, baseline)
embedder = SVDEmbedder(matrix, dimensions=512)
# spectral: laplacian eigenmaps on the graph (deterministic)
embedder = SpectralGraphEmbedder(net.graph, dimensions=512)
# line: first + second order proximity on the graph (weighted edges)
embedder = LINEEmbedder(net.graph, dimensions=512, epochs=15, lr=0.005)
embeddings = embedder.get_embeddings()# enrichment
gpath2vec enrichment --genes "EGFR,EGF,FGFR2" --level low --out-path results.json
# network
gpath2vec network --enrichment-path results.json --level low --out-path net.pkl
# embeddings (default: metapath2vec)
gpath2vec embeddings --network-path net.pkl --dimensions 512 --out-path emb.pkl
# embeddings with alternative methods
gpath2vec embeddings --network-path net.pkl --method svd --ea-matrix-path ea_matrix.csv --out-path emb.pkl
gpath2vec embeddings --network-path net.pkl --method spectral --out-path emb.pkl
gpath2vec embeddings --network-path net.pkl --method line --out-path emb.pkl
gpath2vec embeddings --network-path net.pkl --method vae --ea-matrix-path ea_matrix.csv --out-path emb.pkl
# full pipeline with method choice
gpath2vec end2end --genes "EGFR,EGF" --level low --method vae --output-dir output/
# niche pipeline: enrichment -> graph -> embeddings in one command.
# --enrichment fisher : binary top-N gene set per niche, fisher's exact + fdr.
# --enrichment aucell : niche-level AUCell on each niche's aggregated pseudobulk
# (one score per niche, NOT single-cell), per-niche top-k as edges.
# inputs: --niche-matrix (niches x genes .npz/.npy), --genes (.npy gene order),
# --niche-meta (parquet with a niche_id column).
gpath2vec niche-pipeline \
--niche-matrix niches.npz --genes genes.npy --niche-meta niche_meta.parquet \
--enrichment aucell --reactome-level low --topk 50 \
--reactome-dir /path/to/reactome/cache --out-dir output/| method | input | training | edge weights | deterministic |
|---|---|---|---|---|
| metapath2vec | graph | skipgram on random walks | yes (biases walks) | no |
| svd | ea matrix | truncated svd | n/a (no graph) | yes |
| spectral | graph | laplacian eigenmaps | yes | yes |
| line | graph | first + second order proximity | yes (samples proportional) | no |
| vae | ea matrix | variational autoencoder | n/a (no graph) | no |
- metapath2vec: best for capturing heterogeneous graph structure (pathway hierarchy + cluster nodes). requires training.
- svd: baseline. operates on the ea matrix directly, no graph structure. fast, deterministic. if svd gives the same results as metapath2vec, the graph isn't adding signal.
- spectral: deterministic embedding from the graph laplacian. good comparison point for metapath2vec without training variance.
- line: handles edge weights more explicitly than metapath2vec. two objectives capture both local (direct neighbors) and global (shared neighbor) structure.
- vae: variational autoencoder on the ea matrix. smooth latent space where similar pathway profiles map nearby. provides uncertainty estimates (latent variance per cluster) and can generate new pathway activity profiles. nonlinear alternative to svd.
all methods are more configurable from python than the cli. for example, vae exposes beta (kl divergence weight), hidden_dim, and the full model for downstream use:
from gpath2vec.embedder import VAEEmbedder
vae = VAEEmbedder(ea_matrix, dimensions=512, beta=0.5, hidden_dim=256)
embeddings = vae.get_embeddings() # latent means
uncertainty = vae.get_uncertainty() # latent variance per clusterpathway filtering uses reactome's own classification:
- high: pathways with enhanced high level diagrams (ehld)
- mid: pathways between ehld and sbgn
- low: pathways with sbgn diagrams (most specific)
- all: no filtering
restrict the pathway universe to only pathways containing specific genes of interest (ex. transcription factors from pathway commons):
# TF genes from pathway commons SIF (controls-expression-of)
ea_df = enrich(gene_sets, level="low", gene_filter=tf_genes)reactome data is downloaded once and cached to ~/.gpath2vec/cache/. set GPATH2VEC_REACTOME_DIR to use a custom cache directory:
export GPATH2VEC_REACTOME_DIR=/path/to/reactome/filesclusters like ARC have no internet on compute nodes, so the cache must be staged from a node that does have internet (ex. a login node):
./pull_reactome_cache.sh /shared/path/reactome_cache # run where there IS internetthis drives the package's real fetchers, so the cache matches exactly what gpath2vec expects. on the compute node, point at it without re-downloading:
gpath2vec niche-pipeline ... --reactome-dir /shared/path/reactome_cache
# or: export GPATH2VEC_REACTOME_DIR=/shared/path/reactome_cacheall stochastic embedders (metapath2vec, line, vae) take a seed (default 1234) that pins the python, numpy and torch rngs, so embeddings are bit-reproducible run to run. the seed is re-applied before training (independent of walk-generation rng) and recorded in run_provenance.json. svd and spectral are deterministic by construction. cli: --seed.
- edge2vec: edge-type transition-matrix biased walks as an embedding method.
- lorentz (hyperbolic) pipeline
