Skip to content

[BUG] Unnamed CATH superfamilies inherit parent topology name (e.g. 6.20.10.10 mislabeled "Laminin") #57

Description

@tsenoner

Summary

In the default Swiss-Prot dataset, the CATH-Gene3D superfamily 6.20.10.10 is labeled "Laminin", but that superfamily has no name in CATH (6.20.10.10 on cathdb). Only 2.10.25.10 legitimately carries the "Laminin" name (2.10.25.10 on cathdb). The name is being incorrectly propagated onto an unnamed superfamily.

Root cause

_parse_cath_names() in src/protspace/data/annotations/retrievers/cath_names.py fills every unnamed 4-level superfamily with its parent topology (3-level) name (lines 101–105):

# Fill unnamed superfamilies with parent topology name
for code in unnamed_superfamilies:
    parent = ".".join(code.split(".")[:3])
    if parent in names:
        names[code] = names[parent]

Evidence from the CATH names file

From the official cath-names.txt (latest release):

2.10.25       3sovA02   :Laminin
2.10.25.10    3sovA02   :Laminin      <- superfamily explicitly named "Laminin" (correct)
6.20.10       3s6xC01   :Laminin      <- topology named "Laminin"
6.20.10.10    1lmmA01   :             <- superfamily has NO name in CATH
6.20.10.20    3s6xC01   :             <- also unnamed
6.20.10.30    4glxA05   :             <- also unnamed

2.10.25.10 is explicitly assigned "Laminin" → correct. 6.20.10.10 has an empty name → the fallback copies the parent topology 6.20.10's name "Laminin" onto it → wrong.

Why it matters

  • The label "Laminin" is not assigned to superfamily 6.20.10.10 by CATH; showing it misrepresents the annotation.
  • All sibling unnamed superfamilies under a named topology collapse to the same label — 6.20.10.10, 6.20.10.20, and 6.20.10.30 all become "Laminin", making three distinct superfamilies indistinguishable in the legend.

This is currently intentional behavior (see the module docstring and the test_unnamed_superfamily_inherits_topology test in tests/.../test_cath_names.py), but the resulting labels don't reflect CATH and silently lose information.

Suggested fixes (to discuss)

  1. Don't inherit — keep the superfamily code (e.g. 6.20.10.10) as the label when CATH has no name, so siblings stay distinct and no false name is shown.
  2. Inherit but disambiguate — e.g. Laminin (6.20.10.10) so the parent-derived name is visible but the superfamily remains identifiable.
  3. Code as identity, name as secondary — keep the CATH code as the identity and only use the topology name as a tooltip/secondary display.

Affected

  • File: src/protspace/data/annotations/retrievers/cath_names.py (_parse_cath_names, lines ~98–105)
  • Consumers: InterProRetriever._resolve_entry_names(), TedRetriever._resolve_cath_name()
  • Surfaces in the default dataset shipped to protspace.app

Related: #56 (CATH/InterPro annotation name sanitization).

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions