Context
Sibling of #56 (semicolon sanitization). ProtSpace encodes categorical annotation values as accession (name)|score1,score2 or accession (name)|EVIDENCE_CODE, joining multiple hits with ;. The | separates the label from a trailing score/evidence suffix.
Frontend parser (packages/core/src/components/data-loader/utils/conversion.ts parseAnnotationValue, ~L186-220):
- Splits on the last pipe:
const lastPipe = trimmed.lastIndexOf('|').
- Suffix after it is interpreted as numeric scores, or an evidence code matching
EVIDENCE_CODE_RE = /^(?:[A-Z]{2,5}|ECO:\d+)$/, otherwise the whole string is kept as the label.
Severity: lower than #56 (defensive / contract integrity)
Unlike ;, an in-name | does not currently corrupt InterPro/CATH output, because:
- The parser uses
lastIndexOf('|'), so a real trailing |score is always found first and any in-name | stays in the label.
- Names are wrapped in
(name), and the score is emitted after the closing paren — so for a value with no score, the suffix after an in-name | ends in ) (e.g. EXP)), which fails both the numeric check and EVIDENCE_CODE_RE (anchored, no )). It falls back to "whole string is the label".
So accession (name)|score is robust against in-name | as long as names stay inside the parentheses and the score stays outside them. The risk is real for any value not in that parenthesized shape, or if that invariant ever changes.
Genuine failure modes the backend should preclude
- A bare (non-parenthesized)
label|suffix value whose label legitimately ends in | + 2–5 uppercase letters → mis-read as an evidence code. Example: a name GO:12345|EXP (no parens, no score) → frontend yields label="GO:12345", evidence="EXP" (wrong).
- A label ending in
|<number> with no real score → the number is stripped as a score.
- Any future producer change that emits a name outside
( … ) or a score inside it.
Note: the frontend export path only renders already-parsed labels; it does not re-serialize label|score, so there is no round-trip data corruption — the impact is mis-parsed labels/scores/evidence in legend, tooltip, sorting and filtering.
Fix
Mirror the ; handling: sanitize/normalize | out of names before assembling the value, so the accession (name)|score contract is unambiguous regardless of value shape.
src/protspace/data/annotations/retrievers/interpro_retriever.py — acc_with_name = f"{acc} ({name})" (~L362) and the |-joined score line (~L369): strip/replace | in name (e.g. with / or ,).
src/protspace/data/annotations/retrievers/cath_names.py — name read verbatim (~L93): same sanitization at the source.
- Ideally sanitize
; and | (and normalize) in one shared name-cleaning step.
- Add tests with names containing
|.
Related
Acceptance criteria
- Generated annotation names contain no
| (and no ;).
- Documented contract (
docs/annotations.md) holds: the only structural | is the label↔score/evidence separator.
Context
Sibling of #56 (semicolon sanitization). ProtSpace encodes categorical annotation values as
accession (name)|score1,score2oraccession (name)|EVIDENCE_CODE, joining multiple hits with;. The|separates the label from a trailing score/evidence suffix.Frontend parser (
packages/core/src/components/data-loader/utils/conversion.tsparseAnnotationValue, ~L186-220):const lastPipe = trimmed.lastIndexOf('|').EVIDENCE_CODE_RE = /^(?:[A-Z]{2,5}|ECO:\d+)$/, otherwise the whole string is kept as the label.Severity: lower than #56 (defensive / contract integrity)
Unlike
;, an in-name|does not currently corrupt InterPro/CATH output, because:lastIndexOf('|'), so a real trailing|scoreis always found first and any in-name|stays in the label.(name), and the score is emitted after the closing paren — so for a value with no score, the suffix after an in-name|ends in)(e.g.EXP)), which fails both the numeric check andEVIDENCE_CODE_RE(anchored, no)). It falls back to "whole string is the label".So
accession (name)|scoreis robust against in-name|as long as names stay inside the parentheses and the score stays outside them. The risk is real for any value not in that parenthesized shape, or if that invariant ever changes.Genuine failure modes the backend should preclude
label|suffixvalue whose label legitimately ends in|+ 2–5 uppercase letters → mis-read as an evidence code. Example: a nameGO:12345|EXP(no parens, no score) → frontend yieldslabel="GO:12345",evidence="EXP"(wrong).|<number>with no real score → the number is stripped as a score.( … )or a score inside it.Note: the frontend export path only renders already-parsed labels; it does not re-serialize
label|score, so there is no round-trip data corruption — the impact is mis-parsed labels/scores/evidence in legend, tooltip, sorting and filtering.Fix
Mirror the
;handling: sanitize/normalize|out of names before assembling the value, so theaccession (name)|scorecontract is unambiguous regardless of value shape.src/protspace/data/annotations/retrievers/interpro_retriever.py—acc_with_name = f"{acc} ({name})"(~L362) and the|-joined score line (~L369): strip/replace|inname(e.g. with/or,).src/protspace/data/annotations/retrievers/cath_names.py— name read verbatim (~L93): same sanitization at the source.;and|(and normalize) in one shared name-cleaning step.|.Related
;in CATH/InterPro names (sibling).;repair for existing bundles: [BUG] CATH-Gene3D categories split incorrectly on ';' inside names protspace_web#282.Acceptance criteria
|(and no;).docs/annotations.md) holds: the only structural|is the label↔score/evidence separator.