Skip to content

Computed self_nearest_* columns against a curated non-CTA proteome #124

Description

@iskandr

Motivation

For substitutions, wt_peptide gives a principled 1:1 pair for differential binding. For indels, frameshifts, fusions, viral, and synthetic fragments there's no meaningful WT pair — but cross-reactivity / autoimmune risk is still a concern. The principled generalization is nearest-self: the closest peptide to the mutant in a curated healthy-tissue proteome, paired with that peptide's MHC binding at the same allele.

docs/fragments.md already reserves the self_nearest_* column namespace and DSL scope (self_nearest_peptide, self_nearest_edit_distance, self_nearest_gene, self_nearest_value / _score / _percentile_rank, etc.) but currently instructs producers to populate externally. This issue is about computing those columns inside topiary, opt-in.

Proposal

  1. Bundled reference proteome. Curated healthy-tissue human proteome with CTAs (cancer-testis antigens) explicitly excluded — CTAs are legitimate tumor-specific targets whose testis expression shouldn't count as "self" for cross-reactivity. Shipped as a downloadable artifact (too big for the wheel) with a loader.
  2. Pluggable matching. Default: exact k-mer index + Hamming fallback (fast, no BLAST dependency). Interface for callers to plug in BLAST or BLOSUM-weighted edit distance.
  3. Population path. TopiaryPredictor(compute_self_nearest=True, self_reference=…) — runs the nearest-self lookup after prediction and populates the full reserved namespace. Rows with no hit within a configurable max distance stay NaN; caller decides via fillna or filters.
  4. Paired MHC binding. For each self_nearest_peptide, run _predict_raw_peptides at the mutant's allele to populate self_nearest_value / _score / _percentile_rank. Same primitive as the predict_wt path.

Out of scope

  • Healthy-tissue expression data (self_nearest_tissues) — separate issue; needs GTEx / HPA integration.
  • Non-human reference proteomes.

Why this instead of indel wt_peptide remapping

For non-SNV effects there is no coordinate-aligned "WT peer" that means what wt_peptide means for an SNV. Remapping would give you a residue string, but the biological question ("is this cross-reactive with a healthy self peptide at the same MHC?") is better answered by nearest-self across the whole self proteome than by the single residue span that happens to line up pre-shift.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions