Context
The web frontend splits multi-hit categorical annotation cells on ; between hits (contract: accession (name)|score;accession2 (name2)|score, see docs/annotations.md:133). The InterPro retriever builds each hit as f"{acc} ({name})" and joins hits with ";".join(...) (src/protspace/data/annotations/retrievers/interpro_retriever.py:362,369,375), inserting name verbatim with no ; sanitization.
But ~20% of CATH names contain a literal ; (names taken verbatim after the colon in src/protspace/data/annotations/retrievers/cath_names.py:93), e.g. Ribosomal Protein L15; Chain: K; domain 2. This violates the documented contract and causes the web app to shatter one value into bogus categories (domain 1), domain 2), …).
Fix
Sanitize/normalize names before embedding so the joined string is unambiguous — e.g. replace ; inside a name with , (or another safe separator) in interpro_retriever.py / cath_names.py. This makes newly generated bundles correct at the source and keeps the documented contract honest. It also covers the rare unbalanced-paren name (YojJ-like (1).
Related
Frontend repair for already-distributed bundles (parenthesis-aware split): tsenoner/protspace_web#282. That fix is required regardless; this issue fixes the data at the source.
Acceptance criteria
- Generated CATH-Gene3D annotation values contain no
; inside a name.
- A value round-trips to the correct categories in the web app.
Context
The web frontend splits multi-hit categorical annotation cells on
;between hits (contract:accession (name)|score;accession2 (name2)|score, seedocs/annotations.md:133). The InterPro retriever builds each hit asf"{acc} ({name})"and joins hits with";".join(...)(src/protspace/data/annotations/retrievers/interpro_retriever.py:362,369,375), insertingnameverbatim with no;sanitization.But ~20% of CATH names contain a literal
;(names taken verbatim after the colon insrc/protspace/data/annotations/retrievers/cath_names.py:93), e.g.Ribosomal Protein L15; Chain: K; domain 2. This violates the documented contract and causes the web app to shatter one value into bogus categories (domain 1),domain 2), …).Fix
Sanitize/normalize names before embedding so the joined string is unambiguous — e.g. replace
;inside a name with,(or another safe separator) ininterpro_retriever.py/cath_names.py. This makes newly generated bundles correct at the source and keeps the documented contract honest. It also covers the rare unbalanced-paren name (YojJ-like (1).Related
Frontend repair for already-distributed bundles (parenthesis-aware split): tsenoner/protspace_web#282. That fix is required regardless; this issue fixes the data at the source.
Acceptance criteria
;inside a name.