Skip to content

[BUG] Sanitize '|' inside CATH/InterPro annotation names #58

Description

@tsenoner

Context

Sibling of #56 (semicolon sanitization). ProtSpace encodes categorical annotation values as accession (name)|score1,score2 or accession (name)|EVIDENCE_CODE, joining multiple hits with ;. The | separates the label from a trailing score/evidence suffix.

Frontend parser (packages/core/src/components/data-loader/utils/conversion.ts parseAnnotationValue, ~L186-220):

  • Splits on the last pipe: const lastPipe = trimmed.lastIndexOf('|').
  • Suffix after it is interpreted as numeric scores, or an evidence code matching EVIDENCE_CODE_RE = /^(?:[A-Z]{2,5}|ECO:\d+)$/, otherwise the whole string is kept as the label.

Severity: lower than #56 (defensive / contract integrity)

Unlike ;, an in-name | does not currently corrupt InterPro/CATH output, because:

  1. The parser uses lastIndexOf('|'), so a real trailing |score is always found first and any in-name | stays in the label.
  2. Names are wrapped in (name), and the score is emitted after the closing paren — so for a value with no score, the suffix after an in-name | ends in ) (e.g. EXP)), which fails both the numeric check and EVIDENCE_CODE_RE (anchored, no )). It falls back to "whole string is the label".

So accession (name)|score is robust against in-name | as long as names stay inside the parentheses and the score stays outside them. The risk is real for any value not in that parenthesized shape, or if that invariant ever changes.

Genuine failure modes the backend should preclude

  • A bare (non-parenthesized) label|suffix value whose label legitimately ends in | + 2–5 uppercase letters → mis-read as an evidence code. Example: a name GO:12345|EXP (no parens, no score) → frontend yields label="GO:12345", evidence="EXP" (wrong).
  • A label ending in |<number> with no real score → the number is stripped as a score.
  • Any future producer change that emits a name outside ( … ) or a score inside it.

Note: the frontend export path only renders already-parsed labels; it does not re-serialize label|score, so there is no round-trip data corruption — the impact is mis-parsed labels/scores/evidence in legend, tooltip, sorting and filtering.

Fix

Mirror the ; handling: sanitize/normalize | out of names before assembling the value, so the accession (name)|score contract is unambiguous regardless of value shape.

  • src/protspace/data/annotations/retrievers/interpro_retriever.pyacc_with_name = f"{acc} ({name})" (~L362) and the |-joined score line (~L369): strip/replace | in name (e.g. with / or ,).
  • src/protspace/data/annotations/retrievers/cath_names.py — name read verbatim (~L93): same sanitization at the source.
  • Ideally sanitize ; and | (and normalize) in one shared name-cleaning step.
  • Add tests with names containing |.

Related

Acceptance criteria

  • Generated annotation names contain no | (and no ;).
  • Documented contract (docs/annotations.md) holds: the only structural | is the label↔score/evidence separator.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions