Motivation
For substitutions, wt_peptide gives a principled 1:1 pair for differential binding. For indels, frameshifts, fusions, viral, and synthetic fragments there's no meaningful WT pair — but cross-reactivity / autoimmune risk is still a concern. The principled generalization is nearest-self: the closest peptide to the mutant in a curated healthy-tissue proteome, paired with that peptide's MHC binding at the same allele.
docs/fragments.md already reserves the self_nearest_* column namespace and DSL scope (self_nearest_peptide, self_nearest_edit_distance, self_nearest_gene, self_nearest_value / _score / _percentile_rank, etc.) but currently instructs producers to populate externally. This issue is about computing those columns inside topiary, opt-in.
Proposal
- Bundled reference proteome. Curated healthy-tissue human proteome with CTAs (cancer-testis antigens) explicitly excluded — CTAs are legitimate tumor-specific targets whose testis expression shouldn't count as "self" for cross-reactivity. Shipped as a downloadable artifact (too big for the wheel) with a loader.
- Pluggable matching. Default: exact k-mer index + Hamming fallback (fast, no BLAST dependency). Interface for callers to plug in BLAST or BLOSUM-weighted edit distance.
- Population path.
TopiaryPredictor(compute_self_nearest=True, self_reference=…) — runs the nearest-self lookup after prediction and populates the full reserved namespace. Rows with no hit within a configurable max distance stay NaN; caller decides via fillna or filters.
- Paired MHC binding. For each
self_nearest_peptide, run _predict_raw_peptides at the mutant's allele to populate self_nearest_value / _score / _percentile_rank. Same primitive as the predict_wt path.
Out of scope
- Healthy-tissue expression data (
self_nearest_tissues) — separate issue; needs GTEx / HPA integration.
- Non-human reference proteomes.
Why this instead of indel wt_peptide remapping
For non-SNV effects there is no coordinate-aligned "WT peer" that means what wt_peptide means for an SNV. Remapping would give you a residue string, but the biological question ("is this cross-reactive with a healthy self peptide at the same MHC?") is better answered by nearest-self across the whole self proteome than by the single residue span that happens to line up pre-shift.
Motivation
For substitutions,
wt_peptidegives a principled 1:1 pair for differential binding. For indels, frameshifts, fusions, viral, and synthetic fragments there's no meaningful WT pair — but cross-reactivity / autoimmune risk is still a concern. The principled generalization is nearest-self: the closest peptide to the mutant in a curated healthy-tissue proteome, paired with that peptide's MHC binding at the same allele.docs/fragments.mdalready reserves theself_nearest_*column namespace and DSL scope (self_nearest_peptide,self_nearest_edit_distance,self_nearest_gene,self_nearest_value/_score/_percentile_rank, etc.) but currently instructs producers to populate externally. This issue is about computing those columns inside topiary, opt-in.Proposal
TopiaryPredictor(compute_self_nearest=True, self_reference=…)— runs the nearest-self lookup after prediction and populates the full reserved namespace. Rows with no hit within a configurable max distance stay NaN; caller decides viafillnaor filters.self_nearest_peptide, run_predict_raw_peptidesat the mutant's allele to populateself_nearest_value/_score/_percentile_rank. Same primitive as thepredict_wtpath.Out of scope
self_nearest_tissues) — separate issue; needs GTEx / HPA integration.Why this instead of indel wt_peptide remapping
For non-SNV effects there is no coordinate-aligned "WT peer" that means what wt_peptide means for an SNV. Remapping would give you a residue string, but the biological question ("is this cross-reactive with a healthy self peptide at the same MHC?") is better answered by nearest-self across the whole self proteome than by the single residue span that happens to line up pre-shift.