From 740ae9214955fe15daafdd1e4cc2f4a945cebd0d Mon Sep 17 00:00:00 2001 From: Stuart Brown Date: Fri, 1 May 2026 14:27:47 -0400 Subject: [PATCH] Hide high-fanout PMIDs from the gene Record Literature table MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The PubMed query in the gene Record reads from ApidbTuning.GenePubmed_p, a denormalized cache of all gene <-> PMID associations across every PubMed-source ExternalDatabase (gene2pubmed, PMID, PubMed, etc.). A large fraction of these are high-fanout citations — typically NCBI gene2pubmed bulk-import rows where the same paper is auto-attached to many unrelated genes, contributing noise but no biological insight to the Literature section of gene pages. This change adds a NOT IN subquery to the existing PubMed query that excludes any pubmed_id whose distinct-gene fanout in ApidbTuning.GenePubmed_p is greater than 100. The data is not modified — the filter is purely at presentation time. The threshold applies to every PMID in the cache regardless of source. Curator- and Apollo-submitted citations are typically single-gene or low-fanout and pass through; the rare curator entry with >100 gene fanout is filtered alongside the bulk-import noise, which is the intended behavior since high fanout means low per-gene specificity regardless of provenance. Performance: the subquery operates on ApidbTuning.GenePubmed_p, whose indexes (gpm_gene_idx, gpm_tx_idx) cover pubmed_id and gene_source_id, so the GROUP BY can be index-scanned. If page load latency becomes a concern on larger sites, a follow-up PR can promote the high-fanout PMID list to its own tuning table. --- .../lib/wdk/model/records/geneTableQueries.xml | 17 +++++++++++++++++ 1 file changed, 17 insertions(+) diff --git a/Model/lib/wdk/model/records/geneTableQueries.xml b/Model/lib/wdk/model/records/geneTableQueries.xml index 309b2a94b7..f4af62b34d 100644 --- a/Model/lib/wdk/model/records/geneTableQueries.xml +++ b/Model/lib/wdk/model/records/geneTableQueries.xml @@ -3926,6 +3926,23 @@ from ( END authors FROM ApidbTuning.GenePubmed_p WHERE org_abbrev IN (%%PARTITION_KEYS%%) + AND pubmed_id NOT IN ( + -- Hide high-fanout citations: a PubMed ID associated with + -- more than 100 distinct genes is almost always generic + -- noise (e.g. NCBI gene2pubmed bulk imports that attach + -- one paper to a large gene set with no biological + -- specificity). The threshold applies to every PMID in + -- this tuning table regardless of source -- gene2pubmed, + -- curator-submitted, and Apollo-submitted entries are all + -- evaluated by fanout. A small number of genuinely + -- high-fanout curator entries may be filtered as a side + -- effect; this is intentional, since high fanout means + -- low per-gene specificity regardless of provenance. + SELECT pubmed_id + FROM ApidbTuning.GenePubmed_p + GROUP BY pubmed_id + HAVING COUNT(DISTINCT gene_source_id) > 100 + ) ]]>