feat: add FUMA Gene2Func hypergeometric gene-set enrichment method#1229
Open
xyg123 wants to merge 1 commit into
Open
feat: add FUMA Gene2Func hypergeometric gene-set enrichment method#1229xyg123 wants to merge 1 commit into
xyg123 wants to merge 1 commit into
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Adds a new TissueEnrichment class in src/gentropy/method/fuma_gene2func.py implementing FUMA's gene2func hypergeometric gene-set enrichment. Given a scored gene DataFrame (e.g. L2G predictions or OT association scores) and a long-format gene-sets DataFrame (e.g. GTEx DEGs), it tests whether prioritised genes are over-represented in each gene set per group, computing fold enrichment, one-sided hypergeometric p-values, Bonferroni, and per-group BH FDR.
Changes:
- New module
fuma_gene2func.pywithTissueEnrichment.tissue_enrichmentpublic entry point and_compute_enrichmentcore. - Automatic identifier resolution:
studyLocusId -> studyIdviacredible_set_df, optionalstudyId -> diseaseIdviastudy_index_df(with arraydiseaseIdsexploded). - Per-group BH FDR implemented with a cumulative-min window over descending ranks; SciPy
hypergeom.sfinvoked via a regular PySpark UDF.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+102
to
+108
| # Align gene column name in gene_sets_df to match gene_col. | ||
| # gene_sets_df is expected to have exactly two columns: set_col and a gene column. | ||
| # If that gene column is named differently from gene_col (e.g. "geneId" vs "targetId"), | ||
| # rename it so all subsequent joins use a consistent column name. | ||
| gene_sets_gene_cols = [c for c in gene_sets_df.columns if c != set_col] | ||
| if len(gene_sets_gene_cols) == 1 and gene_sets_gene_cols[0] != gene_col: | ||
| gene_sets_df = gene_sets_df.withColumnRenamed(gene_sets_gene_cols[0], gene_col) |
| disease_mapping = disease_mapping.withColumnRenamed( | ||
| study_disease_col, "diseaseId" | ||
| ) | ||
| working = working.join(disease_mapping, on="studyId", how="left") |
Comment on lines
+185
to
+232
| n_sets = float(gene_sets_df.select(set_col).distinct().count()) | ||
|
|
||
| # BH FDR windows (per group, proper step-down monotonicity) | ||
| w_asc = Window.partitionBy(*group_cols).orderBy(f.col("p_value").asc()) | ||
| w_desc_cummin = ( | ||
| Window.partitionBy(*group_cols) | ||
| .orderBy(f.col("_rank").desc()) | ||
| .rowsBetween(Window.unboundedPreceding, 0) | ||
| ) | ||
|
|
||
| return ( | ||
| counts.withColumn( | ||
| "expected_overlap", | ||
| f.col("n_input").cast(DoubleType()) | ||
| * f.col("k_gene_set").cast(DoubleType()) | ||
| / f.col("n_background").cast(DoubleType()), | ||
| ) | ||
| .withColumn( | ||
| "fold_enrichment", | ||
| f.when( | ||
| f.col("expected_overlap") > 0, | ||
| f.col("k_overlap").cast(DoubleType()) / f.col("expected_overlap"), | ||
| ).otherwise(f.lit(0.0).cast(DoubleType())), | ||
| ) | ||
| .withColumn( | ||
| "p_value", | ||
| _hypergeom_sf( | ||
| f.col("k_overlap").cast("int"), | ||
| f.col("n_background").cast("int"), | ||
| f.col("k_gene_set").cast("int"), | ||
| f.col("n_input").cast("int"), | ||
| ), | ||
| ) | ||
| .withColumn( | ||
| "p_bonferroni", | ||
| f.least(f.lit(1.0), f.col("p_value") * f.lit(n_sets)), | ||
| ) | ||
| .withColumn("_rank", f.rank().over(w_asc)) | ||
| .withColumn( | ||
| "_bh_raw", | ||
| f.least( | ||
| f.lit(1.0), | ||
| f.col("p_value") | ||
| * f.lit(n_sets) | ||
| / f.col("_rank").cast(DoubleType()), | ||
| ), | ||
| ) | ||
| .withColumn("p_fdr_bh", f.min("_bh_raw").over(w_desc_cummin)) |
| from pyspark.sql import DataFrame | ||
|
|
||
|
|
||
| class TissueEnrichment: |
| @@ -0,0 +1,355 @@ | |||
| """Hypergeometric tissue enrichment for GWAS-prioritised genes (FUMA gene2func approach).""" | |||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
✨ Context
To leverage the available gene prioritisation results from the open targets platform and produce relevant biosamples for each study and trait, this PR implements the computationally simple hypergeometric test used in FUMA to infer biosample significance and can be a starting point for the biosample enrichment catalog for studies which do not have full summary statistics.
opentargets/issues#4389
🛠 What does this PR implement
🚦 Before submitting
devbranch?make test)?uv run pre-commit run --all-files)?