Skip to content

feat: add FUMA Gene2Func hypergeometric gene-set enrichment method#1229

Open
xyg123 wants to merge 1 commit into
devfrom
xg1_gene2func
Open

feat: add FUMA Gene2Func hypergeometric gene-set enrichment method#1229
xyg123 wants to merge 1 commit into
devfrom
xg1_gene2func

Conversation

@xyg123
Copy link
Copy Markdown
Contributor

@xyg123 xyg123 commented May 28, 2026

✨ Context

To leverage the available gene prioritisation results from the open targets platform and produce relevant biosamples for each study and trait, this PR implements the computationally simple hypergeometric test used in FUMA to infer biosample significance and can be a starting point for the biosample enrichment catalog for studies which do not have full summary statistics.

opentargets/issues#4389

🛠 What does this PR implement

  • This PR adds FUMA Gene2Func tissue enrichment functionality in src/gentropy/method/fuma_gene2func.py, which is a hypergeometric enrichment approach for GWAS-prioritised genes
  • Takes any scored gene DataFrame (L2G predictions or OT association scores) and a long-format gene sets DataFrame (e.g. GTEx DEGs), and tests whether prioritised genes are over-represented in each gene set relative to a per-group background universe.
  • Automatically resolves studyLocusId -> studyId via a credible set DataFrame, and optionally joins to a study index to produce per-(study, disease) results — group columns are inferred from whatever identifiers are present in the input.
  • Returns one row per (group × gene set) with enrichment counts, fold enrichment, one-sided hypergeometric p-value, Bonferroni correction, and BH FDR.

🚦 Before submitting

  • Do these changes cover one single feature (one change at a time)?
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes?
  • Did you make sure there is no commented out code in this PR?
  • Did you follow conventional commits standards in PR title and commit messages?
  • Did you make sure the branch is up-to-date with the dev branch?
  • Did you write any new necessary tests?
  • Did you make sure the changes pass local tests (make test)?
  • Did you make sure the changes pass pre-commit rules (e.g uv run pre-commit run --all-files)?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new TissueEnrichment class in src/gentropy/method/fuma_gene2func.py implementing FUMA's gene2func hypergeometric gene-set enrichment. Given a scored gene DataFrame (e.g. L2G predictions or OT association scores) and a long-format gene-sets DataFrame (e.g. GTEx DEGs), it tests whether prioritised genes are over-represented in each gene set per group, computing fold enrichment, one-sided hypergeometric p-values, Bonferroni, and per-group BH FDR.

Changes:

  • New module fuma_gene2func.py with TissueEnrichment.tissue_enrichment public entry point and _compute_enrichment core.
  • Automatic identifier resolution: studyLocusId -> studyId via credible_set_df, optional studyId -> diseaseId via study_index_df (with array diseaseIds exploded).
  • Per-group BH FDR implemented with a cumulative-min window over descending ranks; SciPy hypergeom.sf invoked via a regular PySpark UDF.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +102 to +108
# Align gene column name in gene_sets_df to match gene_col.
# gene_sets_df is expected to have exactly two columns: set_col and a gene column.
# If that gene column is named differently from gene_col (e.g. "geneId" vs "targetId"),
# rename it so all subsequent joins use a consistent column name.
gene_sets_gene_cols = [c for c in gene_sets_df.columns if c != set_col]
if len(gene_sets_gene_cols) == 1 and gene_sets_gene_cols[0] != gene_col:
gene_sets_df = gene_sets_df.withColumnRenamed(gene_sets_gene_cols[0], gene_col)
disease_mapping = disease_mapping.withColumnRenamed(
study_disease_col, "diseaseId"
)
working = working.join(disease_mapping, on="studyId", how="left")
Comment on lines +185 to +232
n_sets = float(gene_sets_df.select(set_col).distinct().count())

# BH FDR windows (per group, proper step-down monotonicity)
w_asc = Window.partitionBy(*group_cols).orderBy(f.col("p_value").asc())
w_desc_cummin = (
Window.partitionBy(*group_cols)
.orderBy(f.col("_rank").desc())
.rowsBetween(Window.unboundedPreceding, 0)
)

return (
counts.withColumn(
"expected_overlap",
f.col("n_input").cast(DoubleType())
* f.col("k_gene_set").cast(DoubleType())
/ f.col("n_background").cast(DoubleType()),
)
.withColumn(
"fold_enrichment",
f.when(
f.col("expected_overlap") > 0,
f.col("k_overlap").cast(DoubleType()) / f.col("expected_overlap"),
).otherwise(f.lit(0.0).cast(DoubleType())),
)
.withColumn(
"p_value",
_hypergeom_sf(
f.col("k_overlap").cast("int"),
f.col("n_background").cast("int"),
f.col("k_gene_set").cast("int"),
f.col("n_input").cast("int"),
),
)
.withColumn(
"p_bonferroni",
f.least(f.lit(1.0), f.col("p_value") * f.lit(n_sets)),
)
.withColumn("_rank", f.rank().over(w_asc))
.withColumn(
"_bh_raw",
f.least(
f.lit(1.0),
f.col("p_value")
* f.lit(n_sets)
/ f.col("_rank").cast(DoubleType()),
),
)
.withColumn("p_fdr_bh", f.min("_bh_raw").over(w_desc_cummin))
from pyspark.sql import DataFrame


class TissueEnrichment:
@@ -0,0 +1,355 @@
"""Hypergeometric tissue enrichment for GWAS-prioritised genes (FUMA gene2func approach)."""
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants