feat: add FUMA Gene2Func hypergeometric gene-set enrichment method by xyg123 · Pull Request #1229 · opentargets/gentropy

xyg123 · 2026-05-28T13:28:59Z

✨ Context

To leverage the available gene prioritisation results from the open targets platform and produce relevant biosamples for each study and trait, this PR implements the computationally simple hypergeometric test used in FUMA to infer biosample significance and can be a starting point for the biosample enrichment catalog for studies which do not have full summary statistics.

opentargets/issues#4389

🛠 What does this PR implement

This PR adds FUMA Gene2Func tissue enrichment functionality in src/gentropy/method/fuma_gene2func.py, which is a hypergeometric enrichment approach for GWAS-prioritised genes
Takes any scored gene DataFrame (L2G predictions or OT association scores) and a long-format gene sets DataFrame (e.g. GTEx DEGs), and tests whether prioritised genes are over-represented in each gene set relative to a per-group background universe.
Automatically resolves studyLocusId -> studyId via a credible set DataFrame, and optionally joins to a study index to produce per-(study, disease) results — group columns are inferred from whatever identifiers are present in the input.
Returns one row per (group × gene set) with enrichment counts, fold enrichment, one-sided hypergeometric p-value, Bonferroni correction, and BH FDR.

🚦 Before submitting

Do these changes cover one single feature (one change at a time)?
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes?
Did you make sure there is no commented out code in this PR?
Did you follow conventional commits standards in PR title and commit messages?
Did you make sure the branch is up-to-date with the dev branch?
Did you write any new necessary tests?
Did you make sure the changes pass local tests (make test)?
Did you make sure the changes pass pre-commit rules (e.g uv run pre-commit run --all-files)?

Copilot

Pull request overview

Adds a new TissueEnrichment class in src/gentropy/method/fuma_gene2func.py implementing FUMA's gene2func hypergeometric gene-set enrichment. Given a scored gene DataFrame (e.g. L2G predictions or OT association scores) and a long-format gene-sets DataFrame (e.g. GTEx DEGs), it tests whether prioritised genes are over-represented in each gene set per group, computing fold enrichment, one-sided hypergeometric p-values, Bonferroni, and per-group BH FDR.

Changes:

New module fuma_gene2func.py with TissueEnrichment.tissue_enrichment public entry point and _compute_enrichment core.
Automatic identifier resolution: studyLocusId -> studyId via credible_set_df, optional studyId -> diseaseId via study_index_df (with array diseaseIds exploded).
Per-group BH FDR implemented with a cumulative-min window over descending ranks; SciPy hypergeom.sf invoked via a regular PySpark UDF.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+        # Align gene column name in gene_sets_df to match gene_col.
+        # gene_sets_df is expected to have exactly two columns: set_col and a gene column.
+        # If that gene column is named differently from gene_col (e.g. "geneId" vs "targetId"),
+        # rename it so all subsequent joins use a consistent column name.
+        gene_sets_gene_cols = [c for c in gene_sets_df.columns if c != set_col]
+        if len(gene_sets_gene_cols) == 1 and gene_sets_gene_cols[0] != gene_col:
+            gene_sets_df = gene_sets_df.withColumnRenamed(gene_sets_gene_cols[0], gene_col)


+                disease_mapping = disease_mapping.withColumnRenamed(
+                    study_disease_col, "diseaseId"
+                )
+            working = working.join(disease_mapping, on="studyId", how="left")


+        n_sets = float(gene_sets_df.select(set_col).distinct().count())
+
+        # BH FDR windows (per group, proper step-down monotonicity)
+        w_asc = Window.partitionBy(*group_cols).orderBy(f.col("p_value").asc())
+        w_desc_cummin = (
+            Window.partitionBy(*group_cols)
+            .orderBy(f.col("_rank").desc())
+            .rowsBetween(Window.unboundedPreceding, 0)
+        )
+
+        return (
+            counts.withColumn(
+                "expected_overlap",
+                f.col("n_input").cast(DoubleType())
+                * f.col("k_gene_set").cast(DoubleType())
+                / f.col("n_background").cast(DoubleType()),
+            )
+            .withColumn(
+                "fold_enrichment",
+                f.when(
+                    f.col("expected_overlap") > 0,
+                    f.col("k_overlap").cast(DoubleType()) / f.col("expected_overlap"),
+                ).otherwise(f.lit(0.0).cast(DoubleType())),
+            )
+            .withColumn(
+                "p_value",
+                _hypergeom_sf(
+                    f.col("k_overlap").cast("int"),
+                    f.col("n_background").cast("int"),
+                    f.col("k_gene_set").cast("int"),
+                    f.col("n_input").cast("int"),
+                ),
+            )
+            .withColumn(
+                "p_bonferroni",
+                f.least(f.lit(1.0), f.col("p_value") * f.lit(n_sets)),
+            )
+            .withColumn("_rank", f.rank().over(w_asc))
+            .withColumn(
+                "_bh_raw",
+                f.least(
+                    f.lit(1.0),
+                    f.col("p_value")
+                    * f.lit(n_sets)
+                    / f.col("_rank").cast(DoubleType()),
+                ),
+            )
+            .withColumn("p_fdr_bh", f.min("_bh_raw").over(w_desc_cummin))


+    from pyspark.sql import DataFrame
+
+
+class TissueEnrichment:


@@ -0,0 +1,355 @@
+"""Hypergeometric tissue enrichment for GWAS-prioritised genes (FUMA gene2func approach)."""


feat: add TissueEnrichment hypergeometric gene-set enrichment method

e8fd4ee

github-actions Bot added size-M Method Feature labels May 28, 2026

xyg123 requested a review from addramir May 28, 2026 13:37

addramir requested a review from Copilot May 28, 2026 15:24

Copilot started reviewing on behalf of addramir May 28, 2026 15:24 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add FUMA Gene2Func hypergeometric gene-set enrichment method#1229

feat: add FUMA Gene2Func hypergeometric gene-set enrichment method#1229
xyg123 wants to merge 1 commit into
devfrom
xg1_gene2func

xyg123 commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -0,0 +1,355 @@
		"""Hypergeometric tissue enrichment for GWAS-prioritised genes (FUMA gene2func approach)."""

Conversation

xyg123 commented May 28, 2026

✨ Context

🛠 What does this PR implement

🚦 Before submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants