Skip to content

_scanvi.py:148 SCANVI.load_query_data missing extend_categories=True rejects unknown_celltype_label #113

Description

@joschkahey

Summary

popv.algorithms._scanvi.SCANVI rejects query data with ValueError: Category unknown not found in source registry. Cannot transfer setup without extend_categories = True whenever prediction_mode="inference" runs against a query where any cell's _labels_annotation carries the unknown_celltype_label sentinel (default "unassigned" for the Tabula Sapiens pretrained references).

Reproduction

from popv.hub import HubModel
hub = HubModel.from_pretrained("popv/tabula_sapiens_All_Cells")
hub.annotate_data(
    query,
    save_folder="popv_out",
    prediction_mode="inference",
    methods_list=["KNN_SCVI", "Support_Vector", "XGboost", "CELLTYPIST", "KNN_BBKNN", "KNN_HARMONY", "SCANVI_POPV"],
)

When the SCANVI_POPV voter runs, scvi-tools raises:

File "popv/algorithms/_scanvi.py", line 148, in compute_integration
    self.model = scvi.model.SCANVI.load_query_data(
File ".../scvi/model/_scanvi.py", ...
ValueError: Category unknown not found in source registry. Cannot transfer setup without extend_categories = True

Root cause

popv.preprocessing.Process_Query._setup_dataset:306-366 writes unknown_celltype_label into every query cell's _labels_annotation column and adds the sentinel string to the Categorical's category list. The trained TS scANVI model's source registry contains only real CL labels — the sentinel is not in it.

_scanvi.py:148 calls scvi.model.SCANVI.load_query_data(...) without extend_categories=True in the non-retrain branch:

# popv/algorithms/_scanvi.py:148
self.model = scvi.model.SCANVI.load_query_data(
    query,
    os.path.join(adata.uns["_save_path_trained_models"], "scanvi"),
    freeze_classifier=True,
)

scvi-tools' validation rejects the unseen category and aborts.

Why the official tutorial does not hit this

The TS HubModel tutorial defaults to prediction_mode="fast" which skips the SCANVI voter's load_query_data call. Out-of-distribution queries running prediction_mode="inference" with SCANVI_POPV in methods_list reliably hit this.

Suggested patch

Pass extend_categories=True in the non-retrain branch of _scanvi.py:148:

  self.model = scvi.model.SCANVI.load_query_data(
      query,
      os.path.join(adata.uns["_save_path_trained_models"], "scanvi"),
      freeze_classifier=True,
+     extend_categories=True,
  )

This is the same posture scvi-tools recommends for cross-dataset query adaptation; downstream prediction continues to map the extended category onto the trained label space.

Affected releases

Verified on 0.6.0 and main. No code diff between 0.6.0 and main on popv/algorithms/_scanvi.py beyond a docstring rename (verified via git diff 0.6.0..origin/main popv/algorithms/_scanvi.py).

Context

We carry an in-template monkey-patch for this in our downstream pipeline (Cytoreason nf-core-scdownstream) at modules/local/popv_ensemble/templates/popv_patches.py while waiting for the upstream fix. Happy to provide additional reproduction artifacts if useful.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions