Skip to content

[BUG] Sanitize ';' inside CATH/InterPro annotation names #56

Description

@tsenoner

Context

The web frontend splits multi-hit categorical annotation cells on ; between hits (contract: accession (name)|score;accession2 (name2)|score, see docs/annotations.md:133). The InterPro retriever builds each hit as f"{acc} ({name})" and joins hits with ";".join(...) (src/protspace/data/annotations/retrievers/interpro_retriever.py:362,369,375), inserting name verbatim with no ; sanitization.

But ~20% of CATH names contain a literal ; (names taken verbatim after the colon in src/protspace/data/annotations/retrievers/cath_names.py:93), e.g. Ribosomal Protein L15; Chain: K; domain 2. This violates the documented contract and causes the web app to shatter one value into bogus categories (domain 1), domain 2), …).

Fix

Sanitize/normalize names before embedding so the joined string is unambiguous — e.g. replace ; inside a name with , (or another safe separator) in interpro_retriever.py / cath_names.py. This makes newly generated bundles correct at the source and keeps the documented contract honest. It also covers the rare unbalanced-paren name (YojJ-like (1).

Related

Frontend repair for already-distributed bundles (parenthesis-aware split): tsenoner/protspace_web#282. That fix is required regardless; this issue fixes the data at the source.

Acceptance criteria

  • Generated CATH-Gene3D annotation values contain no ; inside a name.
  • A value round-trips to the correct categories in the web app.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions