Hi again! I opened a separate issue about the spot-wise cosine similarity, but I think there may also be an independent issue in the gene alignment.
R implementation
In the R implementation, both matrices are indexed using the same ordered vector of common genes:
cg <- intersect(colnames(ref$X), colnames(srt$X))
srt$X <- srt$X[, cg]
ref$X <- ref$X[, cg]
This ensures that both matrices have the same gene ordering before downstream computations.
Python implementation
In the Python implementation, the common genes are identified first:
common_genes = np.intersect1d(spatial['genes'], ref['genes'])
but each matrix is then filtered independently:
sp_idx = np.where(np.isin(spatial['genes'], common_genes))[0]
rf_idx = np.where(np.isin(ref['genes'], common_genes))[0]
np.isin() preserves the original order of the array being filtered. Therefore, this relies on the common genes having the same relative order in the spatial and reference matrices.
This matters because the least-squares initialization, cosine similarities, reconstructed expression, and gradients all assume column-wise correspondence between the two expression matrices.
Different ordering can occur when the datasets have undergone different preprocessing. For example, if reference differential-expression selection returns genes in score-ranked order while the spatial preprocessing preserves the original feature order, the resulting matrices may no longer share the same relative gene ordering.
Unless I'm missing something, it would be safer to index both matrices using the same ordered vector of common genes (as in the R implementation), rather than relying on the existing ordering being identical.
Hi again! I opened a separate issue about the spot-wise cosine similarity, but I think there may also be an independent issue in the gene alignment.
R implementation
In the R implementation, both matrices are indexed using the same ordered vector of common genes:
This ensures that both matrices have the same gene ordering before downstream computations.
Python implementation
In the Python implementation, the common genes are identified first:
common_genes = np.intersect1d(spatial['genes'], ref['genes'])but each matrix is then filtered independently:
np.isin()preserves the original order of the array being filtered. Therefore, this relies on the common genes having the same relative order in the spatial and reference matrices.This matters because the least-squares initialization, cosine similarities, reconstructed expression, and gradients all assume column-wise correspondence between the two expression matrices.
Different ordering can occur when the datasets have undergone different preprocessing. For example, if reference differential-expression selection returns genes in score-ranked order while the spatial preprocessing preserves the original feature order, the resulting matrices may no longer share the same relative gene ordering.
Unless I'm missing something, it would be safer to index both matrices using the same ordered vector of common genes (as in the R implementation), rather than relying on the existing ordering being identical.