Skip to content

Spin Design

Avirup Sircar edited this page Jun 7, 2026 · 10 revisions

MultiVectorProductSpace Architecture Plan


0. Memory Layout Reference

Symbol glossary

All index lists are written slowest → fastest. Multi-cell global arrays use C row-major. Hamiltonian blocks within a cell use col-major (BLAS convention).

Symbol Meaning Range
M total local DOF count on this MPI rank ≥ 1
S numSpaces = number of spin channels 1 (unpolarized) or 2 (collinear / non-collinear)
N numOrbitals = numVectorsPerSpace ≥ 1
C numCells ≥ 1
d, d_k number of basis DOFs in cell k; written d when uniform ≥ 1
Q quadrature points per cell ≥ 1
K number of XC potential components (1, S, or S²) 1, 2, or 4
dim spatial dimension 3
i global context: local DOF index on this rank 0..M-1
i, j per-cell context: row and column DOF index within cell 0..d-1
s, s_in, s_out spin index 0..S-1
n orbital index 0..N-1
k §0.1–§0.6, §5–§12: cell index 0..C-1
k §4a–§4b: XC component index 0..K-1
c §4a–§4b: cell index 0..C-1
p spatial DOF index within cell (gather loops) 0..d-1
p_v virtual DOF index within cell = p·S + s (non-collinear) 0..S·d-1
q quadrature point index within cell 0..Q-1
row, col row / column of the (S·d)×(S·d) Hamiltonian matrix 0..S·d-1
α Cartesian spatial component 0..dim-1

Symbol scoping note: i and j denote cell-local DOF indices (0..d-1) in all per-cell formulas (§0.3 H_cell, §4a block copies). In global eigenvector formulas (§0.1, §0.2) i is the rank-local DOF index (0..M-1). The context (per-cell vs. global) disambiguates the range unambiguously.


0.1 Eigenvectors (MultiVectorProductSpace / MultiVectorProductSpaceBlocked)

Global:   M  ×  S  ×  N       (M slowest, N fastest)
Flat:     data[i*S*N + s*N + n]
  • i = local DOF index (0..M-1)
  • s = space index (0..S-1)
  • n = orbital index (0..N-1)

Two cell gather formats — selected by the multivector type passed to copyFieldToCellWiseData:

Collinear  (MultiVectorProductSpaceBlocked):
  numComponents = S*N,  cellLocalIds = spatial node indices (Σd_k entries)
  Output (C row-major): Σd_k × S*N   →  per cell: d_k × S*N
  BLAS:  S*N × d_k
  Spin-s block: ptr = xCell + s*N,  lda = S*N   (no copy, valid BLAS strided submatrix)

Non-collinear  (MultiVectorProductSpace):
  numComponents = N,  cellLocalIds = virtual spin-DOF indices i*S+s  (S*Σd_k entries)
  virtualId computed inline as spatialId * S + s — no separate spinAwareCellLocalIds array
  Output (C row-major): S*Σd_k × N  →  per cell: S*d_k × N
  BLAS:  N × S*d_k

Unpolarized  (MultiVectorProductSpace, S=1):
  Same ProductSpace overload as non-collinear; S=1 degenerates:
  virtualId=spatialId*1+0=spatialId, BLAS N × d_k — identical to old plain-MultiVector path

xCell storage — bracket notation (slowest → fastest)

Physical storage per cell is identical for all three modes:

xCell (per cell):  [dof (d_k)]  [spin (S)]  [orbital (N)]

The BLAS k-dim (contracting index) boundary differs by mode:

Mode BLAS k-dim (contracting) BLAS m-dim lda
Collinear [dof] per spin-s diagonal block [orbital (N)] S*N
Non-collinear [dof][spin] (virtual DOF i*S+s) [orbital (N)] N
Unpolarized [dof] [orbital (N)] N

Why non-collinear contracts over spin: Non-collinear H has nonzero off-diagonal spin blocks (H_{01}, H_{10} ≠ 0). The output at spin s_out requires:

(Hψ)_{s_out,n} = Σ_{s_in} Σ_i  H_{(j,s_out),(i,s_in)}  ψ_{i,s_in,n}

Both spatial DOFs and spin channels must be contracted simultaneously. Collinear is the special case: H_{01} = H_{10} = 0, spins decouple, one GEMM per diagonal block with no cross-spin contraction.


0.2 Eigenvalues and Fractional Occupancy

Unified flat format for all cases:

std::vector<RealType> eigenValues           // size Nflattened = S*N
std::vector<RealType> fractionalOccupancy   // size Nflattened = S*N
Flat index: s*N + n
  • Unpolarized (S=1): Nflattened = N, degenerates to existing format.
  • Collinear (S=2): eigenValues[s*N + n] = eigenvalue of space-s orbital-n. Each spin channel has independent eigenvalues and occupancies: fractionalOccupancy[0*N + n] ≠ fractionalOccupancy[1*N + n] in general.
  • NonCollinear (S=2): eigenValues[s*N + n] — only N are physically distinct (one eigenvalue per spinor). By construction eigenValues[0*N+n] == eigenValues[1*N+n] and fractionalOccupancy[0*N+n] == fractionalOccupancy[1*N+n] for all n. The flat S*N format is kept for uniformity; FractionalOccupancyFunction sets both s=0 and s=1 entries to the same value for non-collinear.

RayleighRitzEigenSolver blocked and coupled overloads both output flat std::vector<RealType> of size Nflattened — no std::vector<std::vector<RealType>>.


0.3 Cell-wise Hamiltonian buffer

Owned by KohnShamOperatorContextFE. Same shape for Collinear and NonCollinear:

numCells × (S·d)²      row-major

Within one cell, the (S·d) × (S·d) block matrix is laid out as:

[ H_{00}  H_{01} ]     Collinear (spin-blocked) only: H_{s_in, s_out} occupies
[ H_{10}  H_{11} ]     rows [s_in*d : (s_in+1)*d], cols [s_out*d : (s_out+1)*d].
                        Non-collinear uses DOF-outer ordering — spin blocks are
                        non-contiguous (see H_cell bracket notation below).

Flat offset of element (row, col) within cell k (col-major within cell):
  k*(S*d)^2 + col*(S*d) + row
  where  row, col ∈ 0..S·d-1

Leading dimension (lda/ldb) for BLAS calls: S*d   (shared for all three matrices X, H, Y)

Diagonal block s pointer arithmetic (for collinear GEMM):

H_s_ptr = H_cell_ptr + s*d*(S*d + 1)    // top-left of block (s,s)
ldb     = S*d                            // row stride — native BLAS lda parameter

Collinear: H_{01} = H_{10} = 0, only diagonal blocks H_{00}, H_{11} filled. NonCollinear: all four blocks filled. Unpolarized: S=1, degenerates to numCells × d² (existing layout, no change).

Why not numCells × S × S × d × d? The spin-spin-outer layout would give flat offset k*S²*d² + s_in*S*d² + s_out*d² + i*d + j, which is not a contiguous (S·d)×(S·d) matrix. The single non-collinear GEMM would require a copy or a complex strided call. The numCells × (S·d)² layout avoids this — both the full GEMM and the diagonal sub-GEMMs operate with the same lda = S*d.

H_cell bracket notation (col-major within cell, slowest → fastest)

Two distinct inner orderings, one per spin mode:

Collinear (spin-blocked):
  H_cell:  [cell]  [col_spin (S)]  [col_dof (d)]  [row_spin (S)]  [row_dof (d)]
  flat = (s_out·d + j)·(S·d) + s_in·d + i
  diagonal block s:  ptr = H_cell + s·d·(S·d + 1),  lda = S·d

Non-collinear (DOF-outer):
  H_cell:  [cell]  [col_dof (d)]  [col_spin (S)]  [row_dof (d)]  [row_spin (S)]
  flat = (j·S + s_out)·(S·d) + i·S + s_in

Unpolarized (S=1):
  H_cell:  [cell]  [col_dof (d)]  [row_dof (d)]
  flat = j·d + i

xCell col ↔ H_cell row — must match for GEMM correctness:

Mode xCell k-dim (col) H_cell row index Match
Collinear [dof] within spin-s block s_in·d + i = spin fixed, then dof
Non-collinear [dof][spin] DOF-outer: i·S+s_in i·S + s_in = dof outer, spin inner
Unpolarized [dof] [dof]

The two inner layouts are not interchangeable — each is forced by its gather's virtual DOF ordering and cannot be unified.


0.4 Interpolated wavefunction (QuadratureValuesContainer)

After FEBasisOperations::interpolate, the QVC holds Nflattened = S*N components per quadrature point. Layout within one cell (C contiguous cells):

numCells × Q × S × N      (numCells slowest, N fastest)
Flat: cell*Q*S*N + q*S*N + s*N + n

This matches the global eigenvector layout (S second-fastest, N fastest) so the gather from M × S × N into the QVC is a contiguous block copy per cell.


0.5 Density

Density is stored in DensityDescrAttr::Val as a vector of ncomp QVCs, each with 1 component per quadrature point (scalar field). This matches the existing codebase convention used by ExchangeCorrelationFE and RDM1FE.

Mode ncomp densVal[0] densVal[1] densVal[2] densVal[3]
Unpolarized 1 ρ_total
Collinear 2 ρ_total Mz
Non-collinear 4 ρ_total Mx My Mz

Definitions (computed from wavefunctions at each quad point q):

Unpolarized (S=1):
  ρ_total(q) = Σ_n f_n |ψ_n(q)|²

Collinear (S=2):
  f[s*N+n] independent for each spin s
  ρ_total(q) = Σ_s Σ_n f[s*N+n] |ψ_{s,n}(q)|²
  Mz(q)      = Σ_n f[0*N+n] |ψ_{0,n}(q)|²  −  Σ_n f[1*N+n] |ψ_{1,n}(q)|²

Non-collinear (S=2):
  f[0*N+n] = f[1*N+n] = f_n  (same occupancy for both spin components of spinor n)
  ρ_total(q) = Σ_n f_n ( |ψ_{0,n}(q)|² + |ψ_{1,n}(q)|² )
  Mz(q)      = Σ_n f_n ( |ψ_{0,n}(q)|² − |ψ_{1,n}(q)|² )
  Mx(q)      = Σ_n f_n   2 Re[ ψ*_{0,n}(q) ψ_{1,n}(q) ]
  My(q)      = Σ_n f_n (−2 Im[ ψ*_{0,n}(q) ψ_{1,n}(q) ])

Relationship to spin-density matrix (used internally by XC functionals):

ρ_{↑↑} = (ρ_total + Mz) / 2
ρ_{↓↓} = (ρ_total − Mz) / 2
ρ_{↑↓} = (Mx − i·My) / 2
ρ_{↓↑} = (Mx + i·My) / 2

ExchangeCorrelationFE::reinitField already performs this conversion for collinear:

spinVals[0][i] = 0.5 * (rhoTotal[i] + Mz[i]);   // ρ↑
spinVals[1][i] = 0.5 * (rhoTotal[i] - Mz[i]);    // ρ↓

The non-collinear conversion (Mx, My, Mz → full spin-density matrix) follows the same logic.


0.6 Gradient of density

Same ncomp-vector-of-QVCs convention as density. Each gradient QVC has 3 components per quadrature point (Cartesian α = 0, 1, 2).

Mode ncomp gradDensVal[0] gradDensVal[1] gradDensVal[2] gradDensVal[3]
Unpolarized 1 ∇ρ_total
Collinear 2 ∇ρ_total ∇Mz
Non-collinear 4 ∇ρ_total ∇Mx ∇My ∇Mz
Unpolarized:
  ∇ρ_total(q, α) = Σ_n f_n 2 Re[ψ*_n(q) ∂_α ψ_n(q)]

Collinear:
  ∇ρ_total(q, α) = Σ_s Σ_n f[s*N+n] 2 Re[ψ*_{s,n}(q) ∂_α ψ_{s,n}(q)]
  ∇Mz(q, α)      = Σ_n f[0*N+n] 2 Re[ψ*_{0,n} ∂_α ψ_{0,n}]
                  − Σ_n f[1*N+n] 2 Re[ψ*_{1,n} ∂_α ψ_{1,n}]

Non-collinear:
  ∇ρ_total, ∇Mz same structure as collinear with f_n uniform
  ∇Mx(q, α) = Σ_n f_n 2 Re[∂_α(ψ*_{0,n} ψ_{1,n})]
  ∇My(q, α) = Σ_n f_n (−2 Im[∂_α(ψ*_{0,n} ψ_{1,n})])

0.7 Unified per-mode layout: multiVector → xCell → H_cell → yCell

Per cell k with d = d_k; p = spatial DOF index within cell (0..d-1). All BLAS GEMMs below are H · x → y where H is presented col-major, x/y are presented as col-major (C row-major viewed transposed). lda = leading dimension (column stride).


Collinear (S=2, MultiVectorProductSpaceBlocked)

multiVector  (global, C row-major):
  layout:  [DOF (M)]  [spin (S)]  [orbital (N)]
  flat:    i·S·N + s·N + n        i ∈ 0..M-1

xCell  (per cell, C row-major):
  layout:  [spatial DOF (d)]  [spin (S)]  [orbital (N)]
  flat:    p·S·N + s·N + n        p ∈ 0..d-1
  BLAS view (transposed):  [spin-orbital (S·N)] × [spatial DOF (d)]
  spin-s block:  ptr = xCell + s·N,  lda = S·N   (no copy — valid BLAS strided submatrix)

H_cell  (per cell, col-major, lda = S·d):
  layout:  [col_spin (S)]  [col_dof (d)]  [row_spin (S)]  [row_dof (d)]
  flat:    (s_out·d + j)·(S·d) + s_in·d + i        i,j ∈ 0..d-1
  diagonal block s:  ptr_s = H_cell + s·d·(S·d+1)

GEMM  (batchCount = S×C, one per spin-s block per cell):
  H_s · xCell_s  →  yCell_s
  (d × d) · (d × N) = (d × N)
  contracts over:  [col DOF (d)] of H = [spatial DOF (d)] of xCell_s  ✓
  (off-diagonal blocks H_{s_in≠s_out} = 0; only diagonal blocks used)

yCell  (per cell, C row-major):
  layout:  [spatial DOF (d)]  [spin (S)]  [orbital (N)]    ← identical to xCell
  spin-s block:  ptr = yCell + s·N,  lda = S·N

Non-collinear (S=2, MultiVectorProductSpace)

multiVector  (global, C row-major):
  layout:  [DOF (M)]  [spin (S)]  [orbital (N)]
  flat:    i·S·N + s·N + n        (identical to collinear)

xCell  (per cell, C row-major):
  virtualId = spatialId·S + s  →  DOF-outer within cell: [spatial DOF (d)]  [spin (S)]
  layout:  [spatial DOF (d)]  [spin (S)]  [orbital (N)]
  flat:    (p·S + s)·N + n        p ∈ 0..d-1
  BLAS view (transposed):  [orbital (N)] × [virtual DOF (S·d)]
  lda = N

H_cell  (per cell, col-major, lda = S·d):
  layout:  [col_dof (d)]  [col_spin (S)]  [row_dof (d)]  [row_spin (S)]
  flat:    (j·S + s_out)·(S·d) + i·S + s_in        i,j ∈ 0..d-1

GEMM  (batchCount = C, one per cell):
  H · xCell  →  yCell
  (S·d × S·d) · (S·d × N) = (S·d × N)
  contracts over:  [col virtual DOF (S·d)] of H = [virtual DOF (S·d)] of xCell  ✓
  (off-diagonal H blocks H_{s_in,s_out≠s_in} ≠ 0; spin channels must be contracted together)

yCell  (per cell, C row-major):
  layout:  [spatial DOF (d)]  [spin (S)]  [orbital (N)]    ← identical to xCell
  lda = N

Unpolarized (S=1, MultiVectorProductSpace)

multiVector  (global, C row-major):
  layout:  [DOF (M)]  [orbital (N)]          (S=1 collapses spin axis)
  flat:    i·N + n        i ∈ 0..M-1

xCell  (per cell, C row-major):
  S=1 → virtualId = spatialId·1 + 0 = spatialId  (degenerates to spatial gather)
  layout:  [spatial DOF (d)]  [orbital (N)]
  flat:    p·N + n        p ∈ 0..d-1
  BLAS view (transposed):  [orbital (N)] × [spatial DOF (d)]
  lda = N

H_cell  (per cell, col-major, lda = d):
  layout:  [col_dof (d)]  [row_dof (d)]      (no spin blocks; S=1 → (S·d)²=d²)
  flat:    j·d + i        i,j ∈ 0..d-1

GEMM  (batchCount = C, one per cell):
  H · xCell  →  yCell
  (d × d) · (d × N) = (d × N)
  contracts over:  [col DOF (d)] = [spatial DOF (d)]  ✓

yCell  (per cell, C row-major):
  layout:  [spatial DOF (d)]  [orbital (N)]    ← identical to xCell
  lda = N

Contraction alignment — xCell column ↔ H_cell row (must match for GEMM correctness):

Mode xCell BLAS k-dim (columns of x) H_cell row index Aligned?
Collinear p within spin-s block (0..d-1) s_in·d + i, spin fixed → i ∈ 0..d-1
Non-collinear p·S+s (virtual DOF, 0..S·d-1) i·S+s_in (DOF-outer, 0..S·d-1)
Unpolarized p (0..d-1) i (0..d-1)

1. Goals and Constraints

Enable two storage patterns with minimal forking.

Explicitly unchanged:

  • ChebyshevFilteredEigenSolver / ChebyshevFilter — internal batching logic untouched; correctness for spin is ensured by KohnShamEigenSolver passing eigenVectorBatchSize=S*N
  • FEBasisOperations::interpolate
  • FEBasisOperations::computeFEMatricesAPI/signature unchanged; QuadratureValuesContainer already carries numComponents (K), so passing a K-component V_xc QVC naturally produces K dofs×dofs matrices in one call (basis values computed once). No new overload needed. Implementation extended: K>1 throw removed; output resized to K × basisOverlapSize; K loop added inside the cell block loop (see §4a / Phase 11).

Modified (minimal changes only, existing logic untouched):

  • LanczosExtremeEigenSolverd_initialGuess and constructor/reinit parameter changed from Vector to MultiVector; existing Vector callers unaffected (implicit upcast)
  • RayleighRitzEigenSolver — existing solve(MultiVector&) extended with one dynamic_cast branch (Blocked) + unconditional ProductSpace cast; no new public overloads
  • KohnShamOperatorContextFE — stores d_spinMode; owns H cell buffers; reinit zero-fills then axpy-accumulates fully-assembled buffers from each component's getLocal(); apply dispatches via single private computeAxCellWise with if/else on d_spinMode
  • FECellWiseDataOperations — two typed overloads of copyFieldToCellWiseData / addCellWiseDataToFieldData as clean public API; gather/scatter primitives only
  • KohnShamEigenSolver — constructs typed product-space multivector; sets eigenVectorBatchSize=S*N
  • KohnShamDFT, DensityCalculator, ExchangeCorrelationFE

Dispatch design — three clear rules:

  1. ksdft/d_spinMode is the sole dispatch mechanism in apply. No dynamic_cast, no isBlockDiagonal flag. Single private computeAxCellWise with one if/else on d_spinMode.
  2. linearAlgebra/dynamic_cast on eigenVectors in RayleighRitzEigenSolver::solve. Works because eigenVectors is the original typed object passed by reference all the way down from KohnShamEigenSolver. ChebyshevFilteredEigenSolver is completely unchanged — its d_XinBatch is plain MultiVector; ksdft handles dispatch via d_spinMode.
  3. basis/ — spin-agnostic. FEBasisOperations::interpolate uses numComponents = X.getNumberComponents() = S*N — no dynamic_cast, no spin knowledge. FECellWiseDataOperations typed overloads exist as clean public API for callers with typed objects; they are not the dispatch mechanism in the apply hot path.

Rule: linearAlgebra/ and basis/ layers are physics-agnostic.


2. Class Hierarchy and Memory Layout

2.1 Class hierarchy

MultiVector
  └── MultiVectorProductSpace          (S=1: unpolarized, S=2: non-collinear)
        └── MultiVectorProductSpaceBlocked    (S=2: collinear)
  • MultiVectorProductSpace with S=1 covers the unpolarized case — no separate plain-MultiVector path needed for new code.
  • MultiVectorProductSpaceBlocked inherits from MultiVectorProductSpace so the compiler picks the most-derived overload in RayleighRitzEigenSolver::solve and MultiVectorOps::project / rotate.

2.2 Unified memory layout

All classes use the same axis order:

M  (slowest)  ×  numSpaces  ×  numVectors  (fastest)
  • MlocallyOwnedSize(), DOF count per MPI rank
  • numSpaces — product-space inner dimension (S=1 unpolarized, S=2 collinear/non-collinear)
  • numVectorsPerSpace — number of vectors per space component (N)
  • Nflattened = numSpaces × numVectorsPerSpace ← what the base MultiVector sees as numVectors()

Data for DOF i, space s, vector n sits at flat offset i*S*N + s*N + n.

The only difference between the two classes is the RayleighRitz dispatch:

  • MultiVectorProductSpace — single ELPA call (unpolarized S=1, non-collinear S=2)
  • MultiVectorProductSpaceBlocked — ELPA loop over S independent eigenproblems (collinear S=2)

H buffer shape (numCells × (S·dofs)²) is identical for both classes. computeAxCellWise is a single private method in KohnShamOperatorContextFE that dispatches on d_spinMode — the gather format and GEMM shape differ by spin mode (see §5). RayleighRitz dispatch differs: MultiVectorProductSpaceBlocked → ELPA loop; MultiVectorProductSpace → single ELPA.


3. New Classes in src/linearAlgebra/

3.1 MultiVectorProductSpace

Derives from MultiVector with numVectors_base = numSpaces × numVectorsPerSpace. Covers unpolarized (S=1) and non-collinear (S=2).

Files to add:
  src/linearAlgebra/MultiVectorProductSpace.h
  src/linearAlgebra/MultiVectorProductSpace.t.cpp
template <typename ValueType, utils::MemorySpace memorySpace>
class MultiVectorProductSpace : public MultiVector<ValueType, memorySpace>
{
public:
  MultiVectorProductSpace(
    std::shared_ptr<const utils::mpi::MPIPatternP2P<memorySpace>> mpiPatternP2P,
    std::shared_ptr<LinAlgOpContext<memorySpace>>                  linAlgOpContext,
    size_type                                                      numSpaces,          // S
    size_type                                                      numVectorsPerSpace, // N
    ValueType                                                      initVal = 0);

  size_type numSpaces()          const;   // S
  size_type numVectorsPerSpace() const;   // N
  // base numVectors() = S*N  (Nflattened)

private:
  size_type d_numSpaces;
  size_type d_numVectorsPerSpace;
};

3.2 MultiVectorProductSpaceBlocked

Derives from MultiVectorProductSpace. Identical layout and interface. Differs only in that RayleighRitzEigenSolver::solve and MultiVectorOps::project / rotate pick the blocked overload (ELPA loop) via most-derived overload resolution. Used for collinear (S=2).

Files to add:
  src/linearAlgebra/MultiVectorProductSpaceBlocked.h
  src/linearAlgebra/MultiVectorProductSpaceBlocked.t.cpp
template <typename ValueType, utils::MemorySpace memorySpace>
class MultiVectorProductSpaceBlocked
  : public MultiVectorProductSpace<ValueType, memorySpace>
{
public:
  // Inherits constructor, numSpaces(), numVectorsPerSpace() unchanged.
  using MultiVectorProductSpace<ValueType, memorySpace>::MultiVectorProductSpace;
};

3.3 Inherited operations — no overrides needed

Both classes inherit all MultiVector operations operating on Nflattened = S*N flat columns. No overrides are needed:

add, axpy, scalar ops — element-wise on all Nflattened entries. Correct for both collinear and non-collinear as-is.

l2Norms() / lInfNorms() — returns S*N values, one per flat (s,n) column:

  • Collinear: correct — S*N independent norms, one per spin-orbital.
  • Non-collinear: returns per-spin-component norms, not per-spinor norm (true spinor norm = sqrt(||ψ_{0,n}||² + ||ψ_{1,n}||²)). However ChebyshevFilter uses these only for normalization/convergence checks — per-component norms are acceptable there and cause no correctness issue.

dot() — computes S*N independent dot products, one per flat column pair:

  • Collinear: correct — S*N independent dot products.
  • Non-collinear: the RayleighRitz projection requires P_mn = Σ_s X[:,s*N+m]^H (HX)[:,s*N+n] — a block-trace sum across s that the base dot() does NOT perform. This is why MultiVectorOps::project exists — it handles the coupled sum explicitly and is the only path used for the projection step. The base dot() is never called directly for RayleighRitz projection.

4. Unified H Buffer and Cell Gather

Both MultiVectorProductSpace and MultiVectorProductSpaceBlocked share:

H cell buffer shape (owned by KohnShamOperatorContextFE):

numCells × (S·dofs)²

Layout within one cell: (S·dofs) × (S·dofs) col-major (BLAS convention), block-indexed as H_cell[s_in * dofs : (s_in+1)*dofs, s_out * dofs : (s_out+1)*dofs] (collinear spin-blocked only; non-collinear uses DOF-outer ordering — see §0.3 bracket notation).

Cell gather — format differs by spin mode (see §5 and §0.1):

Collinear  (Blocked overload):  S*N × d_k   spatial-flat; spin-s block via ptr+s*N, lda=S*N
Unpolarized/NonCollinear (ProductSpace overload):  N × S*d_k  spin-aware; virtualId=spatialId*S+s

Cell scatter-back: mirrors gather, same typed overload.

H assembly in KohnShamOperatorContextFE::reinit:

Each Hamiltonian component is constructed with SpinMode and encapsulates its own spin structure. getLocal() always returns a numCells × (S·dofs)² buffer, fully assembled. KohnShamOperatorContextFE::reinit is uniform — it just accumulates:

std::fill(d_HCellBuffer.begin(), d_HCellBuffer.end(), 0);
for (auto& h : d_hamiltonianComponents)
  axpy(1.0, h.getLocal(), d_HCellBuffer);   // all same shape: numCells × (S·dofs)²

SpinMode is encapsulated inside each Hamiltonian component for H assembly. The operator stores d_spinMode for the if/else dispatch in computeAxCellWise (collinear vs else-branch).

What each component does internally in getLocal():

Component Spin structure computeFEMatrices (one call, K outputs) Post-processing in getLocal() Note
KineticFE, ElectrostaticLocalFE No spin index — same potential for all s K=1 → 1 dofs×dofs matrix (same for all s) Copy that same matrix into each of the S diagonal blocks of (S·dofs)²; off-diagonal zero 1 result → S copies
ExchangeCorrelationFE (Collinear) K=2 V_xc components (V_xc^↑ ≠ V_xc^↓) K=2 → 2 different dofs×dofs matrices 2 strided block-copies into diagonal blocks (0,0) and (1,1) of (S·dofs)²; off-diagonal zero 2 distinct results
ExchangeCorrelationFE (NonCollinear) K=4 V_xc components (V_xc^↑↑, V_xc^↑↓, V_xc^↓↑, V_xc^↓↓ all different) K=4 → 4 different dofs×dofs matrices 4 DOF-outer element scatters into all four blocks; element (i,j) of block k=(s_in,s_out) placed at (row=i·S+s_in, col=j·S+s_out) in the (S·dofs)² matrix (stride S; see §4b) 4 distinct results
Any component (Unpolarized, S=1) S=1, (S·dofs)²=dofs² K=1 → 1 dofs×dofs matrix No block-copy needed — buffer IS the H cell matrix degenerate

computeFEMatrices — unchanged; K encoded in QVC numComponents

QuadratureValuesContainer already carries numComponents = K. The existing computeFEMatrices interface is unchanged — getLocal() passes the K-component V_xc QVC directly and receives K dofs×dofs matrices (one per component), with basis values computed once internally. No new overload needed.

Unpolarized (K=1):
  density QVC:  numCells × Q × 1    [ρ]
  V_xc QVC:     numCells × Q × 1    [V_xc]
  output:       numCells × 1 × dofs²  → placed at diagonal block (0,0) of dofs² buffer

Collinear (K=2):
  density QVC:  numCells × Q × 2    [ρ_↑, ρ_↓]   ← ExchangeCorrelationFE internal form;
                                                      converted from densVal [ρ_total, Mz] (§0.5)
                                                      via ρ_↑=(ρ_total+Mz)/2, ρ_↓=(ρ_total−Mz)/2
  V_xc QVC:     numCells × Q × 2    [V_xc^↑, V_xc^↓]
  output:       numCells × 2 × dofs²  → block-copied to diagonal blocks (0,0),(1,1) of (S·dofs)²

NonCollinear (K=4):
  density QVC:  numCells × Q × 4    [ρ_↑↑, ρ_↑↓, ρ_↓↑, ρ_↓↓]   ← ExchangeCorrelationFE internal form;
                                                                       converted from densVal [ρ_total,Mx,My,Mz] (§0.5)
                                                                       via ρ_{↑↑}=(ρ_total+Mz)/2, ρ_{↓↓}=(ρ_total−Mz)/2,
                                                                            ρ_{↑↓}=(Mx−i·My)/2, ρ_{↓↑}=(Mx+i·My)/2
  V_xc QVC:     numCells × Q × 4    [V_xc^{↑↑}, …, V_xc^{↓↓}]
  output:       numCells × 4 × dofs²  → block-copied to all four blocks of (S·dofs)²

Each component's getLocal() calls computeFEMatrices once with the K-component QVC, then does K block-copies (or element scatters) into the (S·dofs)² H buffer:

  • Collinear: lda-change copy to diagonal block (k,k) at (k·dofs, k·dofs) offset.
  • Non-collinear: element scatter (stride S) to DOF-outer block (s_in, s_out) at top-left offset (s_in, s_out) (not (s_in·dofs, s_out·dofs); see §4a / §4b for exact formulas). Spin-block assembly lives entirely in getLocal().

4a. ExchangeCorrelationFE — computeFEMatrices Extension and getLocal()

Why existing computeFEMatrices fails for K > 1

Two hard blockers in the current implementation:

Hard throw for K > 1 (BasisWeakFormKernelWithField, FEBasisOperations.t.cpp):

utils::throwException(
  f.getNumberComponents() == 1,
  "quadValuesContainer f in BasisWeakFormKernelWithField"
  " can be only a scalar field in real space with 1 component.");

Removed when the function is extended to handle K > 1.

Output layout mismatch: existing function resizes cellWiseFEData to Σdofs_k² (one d×d per cell). For collinear S=2, getLocal() must return Σ(S·d_k)² — a (S·d)×(S·d) per cell. A block copy is always required after computeFEMatrices.

K > 1 extension — no new overload

The existing function is modified in place:

  • Throw for K > 1 removed.
  • cellWiseFEData resized to K × basisOverlapSize.
  • K loop placed inside the cell block loop — getBasisDataInCellRange called once per block, reused for all K components.
  • Basis layer remains spin-agnostic: K is a generic component count.

Output storage: [K][cell][d²] (K slowest, element fastest within cell)

flat offset for component k, cell c, col-major element (i,j):
  k * basisOverlapSize + cellOffset_c + j*d_c + i

where  basisOverlapSize = Σ_c d_c²
       cellOffset_c     = Σ_{c'<c} d_{c'}²

K values and k → (s_in, s_out) mapping:

Spin mode K s_in s_out
Unpolarized 1 0 0
Collinear S=2 k k (diagonal only; use s_in=k, s_out=k directly)
Non-collinear S²=4 k/S k%S

Internal structure (K loop inside cell block loop):

K = f.getNumberComponents()
cellWiseFEData.resize(K * basisOverlapSize, 0)

// Build fxJxW for all K components (one device kernel):
// fxJxW[k * numCumulativeQuadCells + q] = f(q, k) * JxW[q]
// f.data() layout: [quad][K], flat index q*K + k
buildAllComponentFxJxW(f, jxwStorage, K, fxJxW)

cell block loop:
  getBasisDataInCellRange(cellStartId, cellEndId, d_basisDataInCellRange)  // ONCE per block

  for k = 0..K-1:
    scaleStridedVarBatched(
      m=1, n=d, k=Q,
      A = fxJxW.data() + k*numCumulativeQuadCells + quadCellsInBlockOffset,
      B = d_basisDataInCellRange,
      C = d_fxJxWxNBlock)

    gemmStridedVarBatched(
      transA='N', transB='C', m=d, n=d, k=Q,  lda=ldb=ldc=d,
      A = d_fxJxWxNBlock,
      B = d_basisDataInCellRange,
      C = cellWiseFEData.begin() + k*basisOverlapSize + NifNjStartOffset)

  NifNjStartOffset        += numCumulativeDofsxDofsCellsInBlock
  quadCellsInBlockOffset  += numCumulativeQuadCellsInBlock
end cell block loop

d_fxJxWxNBlock is reused for each k — same size as current. K=1 degenerates exactly to the existing path.

GGA with K > 1 — no dim loop

Matrix element for GGA component k:

M_k[i,j] = Σ_{alpha} Σ_q  f_{k,alpha}(q)·JxW(q)·N_i(q)·∂N_j/∂x_alpha(q)

d_derExcWithSigmaTimesGradRhoQuadMemspace: K*dim components, layout [quad][K][dim].

Inside the K loop, two parameter changes relative to LDA:

LDA GGA
fxJxW size per component Q Q·dim
scaleStridedVarBatched m 1 dim
d_fxJxWxNBlock per cell d×Q d×(Q·dim)
gemmStridedVarBatched k Q Q·dim
GEMM B basisData gradBasis
Output per (k, cell) d×d d×d (same)

Both A and B have identical column ordering q*dim + alpha, so the double contraction over (q, alpha) is correct with a single GEMM per (k, cell block). No dim loop needed.

Block copy in getLocal(): [K][cell][d²][cell][(S·d)²]

Source: component k, cell c — plain col-major d×d at k*basisOverlapSize + cellOffset_c.

Target: col-major (S·d)×(S·d) per cell, ldb = S·d. Two formulas by spin mode:

Collinear — spin-blocked H (k → diagonal block (k, k)):

Pure lda-change; src and dst both contiguous within each column copy.

For each cell c, column j = 0..d-1:

src:  cellWiseFEData[k*basisOverlapSize + cellSrcOffset + j*d]   (d contiguous)
dst:  cellWiseStorage[cellDstOffset + (k*d+j)*(S*d) + k*d]       (d contiguous)
length: d,  lda_src = d,  lda_dst = S*d

Non-collinear — DOF-outer H (k → block (s_in=k/S, s_out=k%S)):

Element scatter; stride S in dst (50% GPU write coalescing for S=2).

For each cell c, column j = 0..d-1, row i = 0..d-1:

src:  cellWiseFEData[k*basisOverlapSize + cellSrcOffset + j*d + i]      (sequential)
dst:  cellWiseStorage[cellDstOffset + (j*S+s_out)*(S*d) + i*S+s_in]    (stride S)

Source reads are sequential for both modes. Destination write coalescing:

Spin mode Dst stride within column GPU coalescing
Collinear stride 1 (d consecutive) fully coalesced
Non-collinear stride S=2 50% coalesced

Copy cost is negligible relative to the GEMM (ratio ≈ Q = 1728× cheaper).

Off-diagonal blocks for collinear are covered by zero-init of cellWiseStorage on entry.

getLocal() — one computeFEMatrices call, K copyIntoBlock calls

void getLocal(Storage &cellWiseStorage) const
{
  const size_type K = d_xcPotentialQuadMemspace->getNumberComponents();
  const size_type S = d_S;

  cellWiseStorage.resize(hSize, ValueType(0));   // zero-init covers collinear off-diag blocks

  // single call: basis fetched once per cell block, K GEMMs inside
  // output layout: [K][cell][d²]
  d_feBasisOp->computeFEMatrices(
    LinearLocalOp::IDENTITY, VectorMathOp::MULT,
    VectorMathOp::MULT,      LinearLocalOp::IDENTITY,
    *d_xcPotentialQuadMemspace,   // K-component QVC
    d_xcCellWiseTemp,             // Storage member, K × Σdofs²
    *d_linAlgOpContext);

  // K block-copies: [K][cell][d²] → [cell][(S·d)²]
  for (size_type k = 0; k < K; ++k)
    {
      if (K == S) {
        // collinear: component k → diagonal block (k,k); spin-blocked H; lda-change copy
        copyIntoBlockSpinBlocked(d_xcCellWiseTemp, k,
                                 cellWiseStorage, /*s=*/k, S,
                                 d_numCellDofs, *d_linAlgOpContext);
      } else {
        // non-collinear: component k → block (s_in=k/S, s_out=k%S); DOF-outer H; element scatter
        const size_type s_in  = k / S;
        const size_type s_out = k % S;
        copyIntoBlockDOFOuter(d_xcCellWiseTemp, k,
                              cellWiseStorage, s_in, s_out, S,
                              d_numCellDofs, *d_linAlgOpContext);
      }
    }
}

Unpolarized (K=S=1): the K==S branch fires with s=0. copyIntoBlockSpinBlocked places the single d×d block at diagonal (0,0) — no actual copy since H cell IS the d×d buffer (S=1).

Spin logic lives only in getLocal(). computeFEMatrices is spin-agnostic throughout.


4b. getLocal() — Memory layout before and after copyIntoBlock

In §4a notation: c = cell index (0..C-1), k = XC component index (0..K-1), i, j = cell-local DOF indices (0..d-1, where d = d_c).

Before copyIntoBlock — d_xcCellWiseTemp

d_xcCellWiseTemp:  [component (K)]  [all cells]  [d²]
  total size:  K × basisOverlapSize   elements
  basisOverlapSize = Σ_{c=0}^{C-1} d_c²
  cellOffset_c     = Σ_{c'=0}^{c-1} d_{c'}²   (cumulative dofs² before cell c)

Per component k, per cell c:  col-major d_c × d_c block
  element (row i, col j):
    flat = k·basisOverlapSize + cellOffset_c + j·d_c + i
    i,j ∈ 0..d_c-1

This is the raw output of computeFEMatrices — K independent col-major d×d matrices stacked component-first, then cell-interleaved within each component's slice.

After copyIntoBlock — cellWiseStorage

cellWiseStorage:  [all cells]  [(S·d)²]
  total size:  C × (S·d)²   elements   (d uniform for clarity)
  cellDstOffset_c = c·(S·d)²   (each cell occupies exactly (S·d)² elements)

Per cell c:  col-major (S·d) × (S·d) block
  element (row, col):
    flat = cellDstOffset_c + col·(S·d) + row
    row, col ∈ 0..S·d-1

Collinear (K=S=2) — copyIntoBlockSpinBlocked, component k → diagonal block (k,k)

Pure lda-change: src columns are length-d; dst columns are length-S·d but only d entries written (the spin-k sub-column). No element reordering. Full GPU write coalescing.

For cell c, column j ∈ 0..d-1:  (copy d contiguous elements)

  src start:  d_xcCellWiseTemp + k·basisOverlapSize + cellOffset_c + j·d_c
              (element (row=0, col=j) of component k, cell c)

  dst start:  cellWiseStorage + cellDstOffset_c + (k·d+j)·(S·d) + k·d
              (row k·d, col k·d+j of the (S·d)×(S·d) block — top-left of diagonal block k)

  copy length:  d   (rows i=0..d-1 of column j, block-k row offset = k·d)
  src lda = d,  dst lda = S·d

Off-diagonal blocks (row spin ≠ col spin) remain zero from cellWiseStorage zero-init.

Non-collinear (K=S²=4) — copyIntoBlockDOFOuter, component k → block (s_in=k/S, s_out=k%S)

Element scatter: each source element maps to a non-adjacent destination (stride S in dst). 50% GPU write coalescing for S=2.

For cell c, column j ∈ 0..d-1, row i ∈ 0..d-1:  (one scalar write per element)

  src:  d_xcCellWiseTemp + k·basisOverlapSize + cellOffset_c + j·d_c + i
        (element (row i, col j) of component k, cell c)

  dst:  cellWiseStorage + cellDstOffset_c + (j·S + s_out)·(S·d) + i·S + s_in
        (col = j·S+s_out,  row = i·S+s_in  of the (S·d)×(S·d) DOF-outer block)

  dst stride between consecutive i: S   (50% coalescing)

Unpolarized (K=S=1) — copyIntoBlockSpinBlocked with k=0, s=0

S=1 → (S·d)² = d²; cellWiseStorage has the same shape as the d_xcCellWiseTemp k=0 slice. copyIntoBlockSpinBlocked places the d×d block at (0,0) with lda_dst = 1·d = d — src and dst are byte-for-byte identical; the copy is a no-op or a plain memcpy.

Summary

Mode K d_xcCellWiseTemp shape Copy kernel cellWiseStorage inner layout Write coalescing
Unpolarized 1 [1][cell][d²] SpinBlocked(k=0, s=0) col-major d×d (S=1) 100%
Collinear 2 [2][cell][d²] SpinBlocked(k→diag k) col-major (S·d)² spin-blocked 100%
Non-collinear 4 [4][cell][d²] DOFOuter(k→(k/S, k%S)) col-major (S·d)² DOF-outer 50%

5. FECellWiseDataOperations — typed overloads replace raw pointer API

File modified:
  src/basis/FECellWiseDataOperations.h / .cpp / DeviceKernels

Gather API — typed overloads, raw pointer implementations removed

copyFieldToCellWiseData and addCellWiseDataToFieldData are overloaded on the multivector type. The type encodes gather strategy at compile time — no runtime branching, no extra spin parameters, no spinAwareCellLocalIds array.

// ── Collinear (MultiVectorProductSpaceBlocked) ──────────────────────────────
// Spatial-flat gather: numComponents=S*N, spatial node indices, totalCellDofs=Σd_k
// Output BLAS: S*N × d_k per cell
// Spin-s block: ptr = xCell + s*N,  lda = S*N  (valid BLAS strided submatrix, no copy)
static void copyFieldToCellWiseData(
  const MultiVectorProductSpaceBlocked<VT, MS>&  X,
  const size_type*                                cellLocalIdsStartPtr,
  const size_type                                 totalCellDofs,        // Σd_k
  VT*                                             cellWiseStorage);

// ── Unpolarized (S=1) + Non-collinear (S=2) (MultiVectorProductSpace) ───────
// Spin-aware gather: numComponents=N, virtual DOF index i*S+s computed inline
// cellLocalIdsStartPtr: spatial indices unchanged; totalCellDofs=Σd_k (not S*Σd_k)
// Output BLAS: N × S*d_k per cell
// S=1 (unpolarized): virtualId=spatialId*1+0=spatialId — degenerates to existing path
// No spinAwareCellLocalIds array — virtualId = spatialId * S + s computed per thread
static void copyFieldToCellWiseData(
  const MultiVectorProductSpace<VT, MS>&  X,
  const size_type*                         cellLocalIdsStartPtr,
  const size_type                          totalCellDofs,
  VT*                                      cellWiseStorage);

Overload resolution is automatic: MultiVectorProductSpaceBlocked IS-A MultiVectorProductSpace, so the compiler picks the most-derived match — collinear gets the Blocked overload, unpolarized and non-collinear get the ProductSpace overload. No MultiVector base overload needed — all callers hold a typed product-space object.

Symmetric addCellWiseDataToFieldData overloads follow the same two signatures.

Raw pointer overloads removed from the ksdft-facing public API. Every ksdft caller (in computeAxCellWise / computeAxCellWiseOptimized) passes a typed product-space object.

Exception — FEBasisOperations::interpolate: interpolate takes const MultiVector<VT,MS>& X (base type, spin-agnostic). It calls copyFieldToCellWiseData with X.data() and numComponents = X.getNumberComponents() = S*N — a purely spatial gather treating all S*N components as independent. For this case the existing raw pointer overload is retained as an internal utility (not the ksdft hot path). interpolate itself is unchanged.

Internal kernel logic per overload

Overload numComponents DOF loop iterations Address formula
MultiVectorProductSpaceBlocked S*N Σd_k data[spatialId * S*N + comp]
MultiVectorProductSpace (S=1 or S=2) N S * Σd_k (inner loop over s) data[(spatialId*S + s) * N + n]

The non-collinear kernel loops s = 0..S-1 and writes rows s*d_k...(s+1)*d_k - 1 of the output. S = X.numSpaces() is extracted from the multivector object — no separate parameter.

computeAxCellWise — single private method in KohnShamOperatorContextFE, one if/else on d_spinMode

computeAxCellWise is a private method of KohnShamOperatorContextFE. It reads d_spinMode directly — no spin parameter at the call site, no named forks. FECellWiseDataOperations provides only the gather/scatter primitives.

Collinear is the only special case — it exploits block-diagonal H to avoid wasted FLOPs. Unpolarized and non-collinear share the same else-branch: spin-aware gather with S=1 degenerates exactly to the existing spatial gather (virtualId = spatialId*1+0 = spatialId, H_cell is d_k × d_k, GEMM is identical to the existing unpolarized path).

// KohnShamOperatorContextFE::apply
computeAxCellWise(X, Y);   // no spin params — d_spinMode is a member

// KohnShamOperatorContextFE::computeAxCellWise (private)
void computeAxCellWise(const MultiVector<VT,MS>& X, MultiVector<VT,MS>& Y)
{
  if (d_spinMode == SpinMode::Collinear) {
    // spatial-flat gather: numComponents=S*N → xCell BLAS S*N × d_k
    // ONE gemmStridedVarBatched(S*numCells): exploit block-diagonal H, no wasted FLOPs
  }
  else {
    // spin-aware gather: numComponents=N, virtualId=spatialId*S+s → xCell BLAS N × S*d_k
    // ONE gemmStridedVarBatched(numCells): full S*d_k × S*d_k H
    // S=1 unpolarized: degenerates identically to existing path
    // S=2 non-collinear: full off-diagonal H blocks
  }
}

else-branch (Unpolarized S=1 and NonCollinear S=2) — ONE gemmStridedVarBatched(numCells):

gemmStridedVarBatched(batchCount = numCells,
  m[k] = N,     n[k] = S*d_k,  k[k] = S*d_k,
  A[k] = xCell_k,   lda[k] = N,
  B[k] = H_cell_k,  ldb[k] = S*d_k,
  C[k] = yCell_k,   ldc[k] = N,
  alpha=1, beta=0)

S=1: S*d_k = d_k, H_cell is d_k × d_k — identical to existing unpolarized GEMM. S=2: S*d_k = 2*d_k, full 2*d_k × 2*d_k H — FLOPs: 8·N·d_k²

Collinear branch — ONE gemmStridedVarBatched(S×numCells):

gemmStridedVarBatched(batchCount = S * numCells,
  m[b] = N,    n[b] = d_k,   k[b] = d_k,
  A[b] = xCell_flat_k + s*N,          lda[b] = S*N,
  B[b] = H_cell_k + s*d_k*(S*d_k+1),  ldb[b] = S*d_k,
  C[b] = yCell_flat_k + s*N,          ldc[b] = S*N,
  alpha=1, beta=0)

FLOPs per cell k: 4·N·d_k² (S=2) — diagonal H blocks only, no wasted FLOPs


5a. computeAxCellWiseOptimized — Non-Local PSP with Spin

computeAxCellWiseOptimized is a separate private method of KohnShamOperatorContextFE with its own if/else on d_spinMode — it does not call computeAxCellWise. Both routes share the same dispatch pattern but differ structurally: the optimized route interleaves non-local PSP steps between the gather and the H_cell GEMM, so the gather cannot be owned by the GEMM step.

The spatial-flat gather (step 1) is always done first — non-local PSP requires it regardless of spin mode. The H_cell GEMM (step 5) then has its own if/else.

Full flow in computeAxCellWiseOptimized

1. spatialFlatGather(X.data(), S*N, cellLocalIds, totalCellDofs, xCellValues_flat)
       → xCellValues_flat  BLAS S*N × d_k
       [always spatial-flat: C is spatial, S*N cols treated independently]

2. applyCconjtransOnX(cellRange, xCellValues_flat)
       → accumulates d_CX  (S*N × numProj)          [no change]

   [loop over cell blocks complete]

3. applyAllReduceOnCconjtransX()                    [no change]
4. applyVOnCconjtransX()                            [no change for non-SOC]

   [second cell loop]

5. H_cell GEMM — own if/else on d_spinMode:

   if (d_spinMode == SpinMode::Collinear):
     reuse xCellValues_flat from step 1 (lda=S*N offset per spin block)
     ONE gemmStridedVarBatched(S×numCells); m=N, n=d_k, k=d_k

   else (Unpolarized / NonCollinear):
     NonCollinear: additional spinAwareGather → xCellValues_spin BLAS N × S*d_k
     Unpolarized:  S=1 → spatial-flat = spin-aware; reuse xCellValues_flat directly
     ONE gemmStridedVarBatched(numCells); m=N, n=S*d_k, k=S*d_k

6. applyCOnVCconjtransX(cellRange, yCellValues_flat)
       → adds C*V*d_CX to yCellValues_flat          [no change]

7. spatialFlatScatter(yCellValues_flat, S*N, cellLocalIds, totalCellDofs, Y.data())
       [no change to kernel]

Why AtomCenterNonLocalOpContextFE requires zero changes (non-SOC)

cellWiseGEMM always uses mSizes = numVecs. The projector C_{p,j} is purely spatial — no spin index. Setting numVecs = S*N makes it treat all S*N spin-orbital columns as independent vectors contracted against the same C. Spin channels never couple through C or the diagonal V:

d_CX[p, s*N+n]  = Σ_j C_{p,j} * X[DOF_j, s, n]   // correct for all (s,n) simultaneously
V * d_CX         →  scalar multiply per projector    // same V for all s (non-SOC)
Y[DOF_j, s, n] += Σ_p C_{p,j} * V_p * d_CX[p, s*N+n]  // correct scatter

Only change in computeAxCellWiseOptimized: call reinitCX(S*N) instead of reinitCX(numVecs) (where numVecs was previously N).

Scratch buffer sizing

xCellValues_flat:  cellBlockSize × S*N × d_max    // same as current xCellValues with numVecs=S*N
yCellValues_flat:  cellBlockSize × S*N × d_max    // same as current yCellValues with numVecs=S*N

No additional buffers. The two existing scratch buffers are sufficient — the H_cell GEMM writes its result directly into yCellValues_flat via strided submatrix access.

SOC extension (future)

Only applyVOnCconjtransX needs modification. d_CX is restructured as S × N × numProj and V becomes a S × S matrix per projector:

d_CX_out[s, n, p] = Σ_{s'} V_{s, s', p} * d_CX[s', n, p]

Everything else (applyCconjtransOnX, applyCOnVCconjtransX, gather/scatter) is unchanged even for SOC.


6. How FEBasisOperations Changes

6.1 interpolateno change, spin-agnostic

Basis functions φ_j(r) carry no spin index — interpolate is a purely spatial operation. copyFieldToCellWiseData is called with numComponents = X.getNumberComponents() = S*N, treating all S*N orbital columns as independent. No dynamic_cast, no spin knowledge. QuadratureValuesContainer allocated with Nflattened = S*N components at call site. Works identically for unpolarized, collinear, and non-collinear — zero changes.

6.2 computeFEMatrices — API unchanged, implementation extended (see §4 / §4a)

No new public overload. QuadratureValuesContainer already carries K components; passing a K-component V_xc QVC produces K dofs×dofs output matrices in one call. Each Hamiltonian component's getLocal() block-copies those K results into (S·dofs)². Implementation change: K>1 throw removed; output resized to K × basisOverlapSize; K loop added inside cell block loop (Phase 11).


6.3 DensityCalculator and RDM1FE — spin density extension

DensityCalculator directly outputs ρ_total and M (§0.5 convention) via a single extended function. No separate ρ↑/ρ↓ stage. One unified kernel handles all spin modes.

Symbol conventions for this section

ncomp = number of density components = 1 (unpolarized), 2 (collinear), 4 (non-collinear). N = numOrbitals (numVectorsPerSpace), S = numSpaces. psiBatchQuad layout: [S*N spin-orbital columns][numQuadInBlock quad points] (i.e. spin-orbital index is the batch index; existing layout, unchanged). rhoBatch layout: [numQuadInBlock quad points][ncomp]new: ncomp fastest.


DensityCalculatorKernels.h — single unified kernel

Replace the current single-output computeRhoInBatch with one that outputs ncomp values per quad point:

template <typename ValueType, typename RealType, utils::MemorySpace memorySpace>
class DensityCalculatorKernels
{
public:
  static void
  computeRhoInBatch(
    size_type                                        batchSize,    // N (orbitals per spin in this batch)
    size_type                                        numSpaces,    // S
    std::pair<size_type,size_type>                   cellRange,
    const RealType*                                  occupationInBatch,  // S·batchSize values: occupationInBatch[s·batchSize + n] (unpolarized: batchSize values, s=0 only)
    ValueType*                                       psiBatchQuad,       // [S*N][numQuadInBlock]
    RealType*                                        scratchBatchQuad,   // [S*N][numQuadInBlock], existing modPsiSq scratch
    std::shared_ptr<const QuadratureRuleContainer>   quadRuleContainer,
    RealType*                                        rhoBatch,           // [numQuadInBlock][ncomp]
    SpinMode                                         spinMode,
    LinAlgOpContext<memorySpace>&                    linAlgOpContext);
};

Per quad point q, per orbital n in batch — kernel accumulation:

Unpolarized (S=1, ncomp=1):
  ψ = psiBatchQuad[n][q]
  rhoBatch[q][0] += f[n] · |ψ|²

Collinear (S=2, ncomp=2):
  ψ₀ = psiBatchQuad[n    ][q]   ← spin-0 of orbital n
  ψ₁ = psiBatchQuad[N + n][q]   ← spin-1 of orbital n
  f₀ = occupation[n],   f₁ = occupation[N + n]   (different per spin)
  rhoBatch[q][0] += f₀·|ψ₀|² + f₁·|ψ₁|²    ← ρ_total
  rhoBatch[q][1] += f₀·|ψ₀|² − f₁·|ψ₁|²    ← Mz

Non-collinear (S=2, ncomp=4):
  ψ₀ = psiBatchQuad[n    ][q],   ψ₁ = psiBatchQuad[N + n][q]
  f  = occupation[n]              (f[n] == f[N+n])
  rhoBatch[q][0] += f·(|ψ₀|² + |ψ₁|²)              ← ρ_total
  rhoBatch[q][1] += f·2·Re[conj(ψ₀)·ψ₁]            ← Mx
  rhoBatch[q][2] += f·(−2)·Im[conj(ψ₀)·ψ₁]         ← My
  rhoBatch[q][3] += f·(|ψ₀|² − |ψ₁|²)              ← Mz

scratchBatchQuad (existing modPsiSqBatchQuad) is reused for intermediate |ψ|² and cross-term intermediates — no additional scratch pointer. occupation for collinear has size S*N; kernel reads occupation[n] and occupation[N+n]. occupation for non-collinear has size S*N but f[n] == f[N+n]; kernel reads occupation[n]. Unpolarized: occupation size N; kernel reads occupation[n] (existing behaviour).

DEVICE partial specialisation follows the same interface — one specialisation covers all modes.


DensityCalculator.h — extended computeRho

Signature change (single function, no overload):

// Before:
void computeRho(const std::vector<RealType>& occupation,
                const MultiVector<...>& waveFunc,
                QVC& rho);   // 1 component per quad point

// After:
void computeRho(const std::vector<RealType>& occupation,   // size S*N
                const MultiVector<...>& waveFunc,           // S*N components
                QVC&                    rho,                // ncomp components per quad point
                SpinMode                spinMode);

Output rho has ncomp components per quad point — flat index q*ncomp + comp. The caller (RDM1FE) pre-allocates rho with ncomp components.

Private scratch members that change:

// Before:
QVC* d_rhoMemspace;                        // 1 component per quad
MemoryStorage<RealType, memorySpace>* d_rhoBatch;   // flat per-batch accumulator

// After:
QVC* d_rhoMemspace;                        // ncomp components per quad (resized on demand)
MemoryStorage<RealType, memorySpace>* d_rhoBatch;   // [numQuadInBlock × ncomp] per batch

d_rhoMemspace and d_rhoBatch are resized to ncomp × numQuadPoints on first call with a given SpinMode. d_modPsiSqBatchQuad (existing scratch) is reused for the cross-term intermediates in non-collinear — size S*N × numQuadInBlock covers both.

The outer cell-block loop is unchanged. The psi-batch loop changes for spin cases:

  • Unpolarized (S=1): unchanged — loop steps 0..N in increments of waveFuncBatchSize; one stridedBlockCopy per batch; psiBatchInterim has waveFuncBatchSize columns.
  • Collinear / Non-collinear (S=2): the batch unit is N orbitals per spin (= waveFuncBatchSize / S). Per iteration, psiBatchInterim (S·N columns) is populated by two stridedBlockCopy calls — one for spin-0 columns [psiStartId, psiEndId) into psiBatchInterim columns 0..N-1, and one for spin-1 columns [N+psiStartId, N+psiEndId) into psiBatchInterim columns N..2N-1. The loop then runs from psiStartId = 0 to N (not S*N) so each iteration processes the same N orbital indices for both spins simultaneously. computeRhoInBatch is called once per psi-batch with batchSize = N and numSpaces = S.

The axpy accumulation into d_rhoMemspace operates on ncomp × numQuadInBlock elements instead of numQuadInBlock — no further structural change.


RDM1FE::getDescriptors — adapt output to densVal vector format

RDM1FE currently allocates densVal as AttrStorage(ncomp, QVC(quadRuleContainer, 1, 0)) — a vector of ncomp QVCs each with 1 component per quad point.

After calling computeRho (which returns a single ncomp-component QVC), copy each component into densVal[k]:

QVC rhoCombined(quadRuleContainer, ncomp, 0.0);
d_densCalc->computeRho(occ, *this->d_ksOrbs, rhoCombined, spinMode);

for (size_type k = 0; k < ncomp; ++k)
  // copy component k of rhoCombined into densVal[k] (1-component QVC)
  std::copy(rhoCombined.begin() + k,
            rhoCombined.end(),   /* stride ncomp */
            densVal[k].begin());

This keeps the existing densVal vector-of-QVCs interface that ExchangeCorrelationFE already consumes — no change downstream.

GGA: same extension — computeRho also fills gradient components when requested, outputting a 3·ncomp-component QVC (3 Cartesian × ncomp density observables).


Summary of file changes

File Change
DensityCalculatorKernels.h computeRhoInBatch: add numSpaces, SpinMode params; rhoBatch becomes [numQuadInBlock][ncomp]
DensityCalculator.h computeRho: QVC& rhoQVC& rho with ncomp + SpinMode; d_rhoMemspace and d_rhoBatch resized to ncomp
DensityCalculator.t.cpp Single psi-batch loop unchanged; computeRhoInBatch called with SpinMode; axpy on ncomp×numQuad
RDM1FE.t.cpp getDescriptors: pass spinMode; split ncomp-QVC output into densVal vector

7. Rayleigh-Ritz for Product Spaces

ChebyshevFilteredEigenSolver is not modified. It calls ChebyshevFilter then RayleighRitzEigenSolver::solve(MultiVector&) — the existing signature. Because the static type at that call site is MultiVector&, compile-time overload resolution cannot select derived-type overloads. Dispatch must therefore happen at runtime via dynamic_cast inside RayleighRitzEigenSolver::solve.

Call chain

KohnShamEigenSolver::solve(H, BInv, X, eigenValues, ...)
  → ChebyshevFilteredEigenSolver::solve(H, BInv, X, ...)   // unchanged, X seen as MultiVector&
      → ChebyshevFilter(H, BInv, X, ...)                   // unchanged, Nflattened cols
      → RayleighRitzEigenSolver::solve(H, eigenValues, X, ...)
          // same MultiVector& signature; dynamic_cast inside selects path:
          // MultiVectorProductSpaceBlocked* → blocked/collinear ELPA loop
          // MultiVectorProductSpace*        → coupled (unpolarized S=1 + non-collinear S=2)

RayleighRitzEigenSolver::solvedynamic_cast dispatch, no new public overloads

The existing solve(const OpContext&, std::vector<RealType>&, MultiVector<VT,MS>&, bool) signature is kept unchanged. No private helper methods (d_solveBlocked, d_solveCoupled) are needed — after the dynamic_cast, the correctly-typed reference is passed directly to MultiVectorOps static functions, which pick the right specialisation at compile time. The branches are inline blocks inside solve:

EigenSolverError solve(const OpContext& A, std::vector<RealType>& eigenValues,
                       MultiVector<VT,MS>& X, bool computeEigenVectors)
{
    // Check most-derived first (Blocked ⊂ ProductSpace ⊂ MultiVector)
    if (auto* Xb = dynamic_cast<MultiVectorProductSpaceBlocked<VT,MS>*>(&X)) {
        // compiler picks Blocked specialisation — ELPA loop over s=0..S-1
        std::vector<ScaLAPACKMatrix<VT>> Ps(Xb->numSpaces(), ...);
        MultiVectorOps::project(A, *Xb, Ps, *d_elpaScala,
                                *d_XinBatch, *d_XoutBatch, d_scratchSBlock,
                                d_eigenVecBatchSize);
        for (size_type s = 0; s < Xb->numSpaces(); ++s) {
            elpa_eigenvectors(..., Ps[s]..., eigenValues_s...);
            // fill eigenValues[s*N + n]
        }
        if (computeEigenVectors)
            MultiVectorOps::rotate(*Xb, Qs, *d_elpaScala,
                                   d_scratchRotBlock, d_scratchRotMatBlock);
        return success;
    }

    // Blocked check above covers collinear; all remaining callers are ProductSpace
    // (unpolarized S=1 degenerates correctly — single ELPA, N×N projected matrix)
    auto* Xp = dynamic_cast<MultiVectorProductSpace<VT,MS>*>(&X);
    // (assert Xp != nullptr — no plain MultiVector callers exist in the new design)
    ScaLAPACKMatrix<VT> P(...);
    MultiVectorOps::project(A, *Xp, P, *d_elpaScala,
                            *d_XinBatch, *d_XoutBatch, d_scratchSBlock,
                            d_eigenVecBatchSize);
    elpa_eigenvectors(..., P..., eigenValues...);
    if (computeEigenVectors)
        MultiVectorOps::rotate(*Xp, Q, *d_elpaScala,
                               d_scratchRotBlock, d_scratchRotMatBlock);
    return success;
}

dynamic_cast cost is negligible — called once per SCF RayleighRitz step.

No copy for spin-s block access (collinear): Inside MultiVectorOps::project (Blocked specialisation), spin-s columns are at positions [s*N, (s+1)*N). computeXTransOpX receives a raw pointer with explicit lda:

ValueType* Xs_ptr = Xb.data() + s * Xb.numVectorsPerSpace();
size_type  lda    = Xb.numVectors();   // = S*N, true row stride — NOT N
// computeXTransOpX(Xs_ptr, M, N, lda, ...) — no copy into contiguous temp

8. MultiVectorOps src/linearAlgebra/

Motivation: consolidate existing scattered functions

project, rotate, and orthonormalize are not written from scratch. They already exist as private / internal functions, split across three files:

Existing function Lives in What it does
computeXTransOpX RayleighRitzEigenSolver.t.cpp project: X^H Op X, batched
fillParallelOverlapMatrix ElpaScalapackOperations.t.cpp project overlap (no Op)
subspaceRotation ElpaScalapackOperations.t.cpp rotate: X ← X·Q
orthonormalization body OrthonormalizationFunctions.t.cpp orthonormalize

Move these into MultiVectorOps as the canonical location. RayleighRitzEigenSolver, OrthonormalizationFunctions, and the new blocked overloads all call the same MultiVectorOps implementations — no logic duplication.

The one required modification to subspaceRotation and computeXTransOpX: add an explicit lda parameter (defaulting to N). Currently both hardcode lda = N (e.g., X + idof * N in subspaceRotation; numVec as the GEMM lda in computeXTransOpX). With an explicit lda, the blocked overload passes lda = S*N and accesses only the spin-s columns — no copy, same kernel.

Files to add:
  src/linearAlgebra/MultiVectorOps.h
  src/linearAlgebra/MultiVectorOps.t.cpp

Files modified (existing logic moved, not rewritten):
  src/linearAlgebra/ElpaScalapackOperations.t.cpp
    ← subspaceRotation: add lda parameter (default N); body unchanged
    ← fillParallelOverlapMatrix: add lda parameter (default N); body unchanged
  src/linearAlgebra/RayleighRitzEigenSolver.t.cpp
    ← computeXTransOpX: add lda parameter (default numVec); body unchanged
    ← solve(): replace internal calls with MultiVectorOps::project / rotate
  src/linearAlgebra/OrthonormalizationFunctions.t.cpp
    ← body moved to MultiVectorOps::orthonormalize; thin wrapper remains

MultiVectorOps is a stateless class of static functions. All scratch is owned by the caller (RayleighRitzEigenSolver already manages scratch via MultivectorScratch) and passed in explicitly — no internal allocation, consistent with the blasLapack pattern. Dispatch is compile-time via template specialisation on the multivector type. dynamic_cast is NOT inside MultiVectorOps — it is done once in RayleighRitzEigenSolver::solve to unlock the type; after that, the correctly-typed reference is passed to the static functions.

class MultiVectorOps {
public:
    // ---------- project ----------

    // Blocked specialisation (collinear): S independent N×N projected matrices.
    // Loop s=0..S-1: Xs_ptr = X.data()+s*N, lda=S*N → computeXTransOpX, no copy.
    template <typename VTO, typename VTOp, utils::MemorySpace MS>
    static void project(
      const OperatorContext<VTO,VTOp,MS>&                              Op,
      MultiVectorProductSpaceBlocked<VTOp,MS>&                         X,
      std::vector<ScaLAPACKMatrix<scalar_type<VTO,VTOp>>>&             Ps,   // S N×N matrices
      const ElpaScalapackManager&                                      elpa,
      MultiVector<VTOp,MS>&                                            scratchXin,
      MultiVector<VTOp,MS>&                                            scratchXout,
      MemoryStorage<scalar_type<VTO,VTOp>,MS>&                         scratchSBlock,
      size_type                                                         batchSize = 0);

    // Coupled specialisation (non-collinear / unpolarized): single S*N×S*N matrix.
    // computeXTransOpX(X.data(), M, S*N, lda=S*N, Op, P) — no copy.
    template <typename VTO, typename VTOp, utils::MemorySpace MS>
    static void project(
      const OperatorContext<VTO,VTOp,MS>&               Op,
      MultiVectorProductSpace<VTOp,MS>&                 X,
      ScaLAPACKMatrix<scalar_type<VTO,VTOp>>&           P,    // single S*N×S*N matrix
      const ElpaScalapackManager&                       elpa,
      MultiVector<VTOp,MS>&                             scratchXin,
      MultiVector<VTOp,MS>&                             scratchXout,
      MemoryStorage<scalar_type<VTO,VTOp>,MS>&          scratchSBlock,
      size_type                                         batchSize = 0);

    // ---------- rotate ----------

    // Blocked: loop s, subspaceRotation(X.data()+s*N, M, N, lda=S*N, Qs[s]).
    // Scratch temp is M×N only, reused across s — S× smaller than coupled.
    template <typename VT, utils::MemorySpace MS>
    static void rotate(
      MultiVectorProductSpaceBlocked<VT,MS>&            X,
      const std::vector<ScaLAPACKMatrix<VT>>&           Qs,   // S N×N matrices
      const ElpaScalapackManager&                       elpa,
      MemoryStorage<VT,MS>&                             scratchRotBlock,   // M×N temp
      MemoryStorage<VT,MS>&                             scratchRotMatBlock);

    // Coupled: subspaceRotation(X.data(), M, S*N, lda=S*N, Q).
    // One M×S*N scratch temp (unavoidable — in-place X=X*Q always needs output buffer).
    template <typename VT, utils::MemorySpace MS>
    static void rotate(
      MultiVectorProductSpace<VT,MS>&                   X,
      const ScaLAPACKMatrix<VT>&                        Q,
      const ElpaScalapackManager&                       elpa,
      MemoryStorage<VT,MS>&                             scratchRotBlock,   // M×S*N temp
      MemoryStorage<VT,MS>&                             scratchRotMatBlock);

    // ---------- orthonormalize ----------

    // Two specialisations (Blocked / ProductSpace). No base MultiVector overload —
    // all callers hold a typed product-space object.
    // Scratch: overlap ScaLAPACKMatrix + Cholesky temp — all passed in by caller.
    template <typename VT, utils::MemorySpace MS>
    static void orthonormalize(
      MultiVectorProductSpaceBlocked<VT,MS>&            X,
      const ElpaScalapackManager&                       elpa,
      /* scratch: */ ScaLAPACKMatrix<VT>&               scratchOverlap,
      MemoryStorage<VT,MS>&                             scratchSBlock);

    template <typename VT, utils::MemorySpace MS>
    static void orthonormalize(
      MultiVectorProductSpace<VT,MS>&                   X,
      const ElpaScalapackManager&                       elpa,
      ScaLAPACKMatrix<VT>&                              scratchOverlap,
      MemoryStorage<VT,MS>&                             scratchSBlock);
};

Call flow from RayleighRitzEigenSolver:

// One conditional cast (Blocked); ProductSpace cast is unconditional — no plain MultiVector callers.
if (auto* Xb = dynamic_cast<MultiVectorProductSpaceBlocked<VT,MS>*>(&X)) {
    // compiler picks Blocked specialisation at compile time
    MultiVectorOps::project(Op, *Xb, Ps, elpa,
                            d_scratchXin, d_scratchXout, d_scratchSBlock);
    MultiVectorOps::rotate (*Xb, Qs, elpa,
                            d_scratchRotBlock, d_scratchRotMatBlock);
    return success;
}
// All remaining callers are MultiVectorProductSpace (unpolarized S=1 or non-collinear S=2)
auto* Xp = static_cast<MultiVectorProductSpace<VT,MS>*>(&X);
MultiVectorOps::project(Op, *Xp, P, elpa,
                        d_scratchXin, d_scratchXout, d_scratchSBlock);
MultiVectorOps::rotate (*Xp, Q, elpa,
                        d_scratchRotBlock, d_scratchRotMatBlock);

RayleighRitzEigenSolver owns all scratch (as it already does via MultivectorScratch) and passes it through. MultiVectorOps is entirely stateless — no allocations.


8a. No-Copy Audit — All Data Movement Points

Every place where spin-block data is accessed is classified below. Goal: no intermediate contiguous copy of a spin-s block; pointer arithmetic + explicit lda throughout.

Location Copy needed? Reason / mechanism
MultiVectorOps::project blocked No Static specialisation; blocked loop uses Xs_ptr=data+s*N, lda=S*N — valid BLAS strided submatrix; scratch passed by caller
MultiVectorOps::project coupled (ProductSpace) No Full M×S*N buffer; S=1 degenerates to N×N; scratch passed by caller
MultiVectorOps::rotate coupled (ProductSpace) One M×S*N temp (unavoidable) In-place X=X*Q always needs output buffer; passed as scratchRotBlock
MultiVectorOps::rotate blocked One M×N temp, reused across s S× smaller than coupled; same scratchRotBlock reused each s iteration
Orthonormalization (both specialisations) No Blocked: Xs_ptr + lda trick; all scratch passed by caller
ChebyshevFilter scratch No eigenVectorBatchSize=S*N → batch IS the full spin-coupled multivector; stridedBlockCopy copies all S*N columns in one shot; H applied via d_spinMode switch in KohnShamOperatorContextFE::apply — no spin-block extraction, no type inspection of batch buffer
computeAxCellWise collinear branch No gemmStridedVarBatched reads X_s via ptr+s*N, lda=S*N — pointer arithmetic within the cell buffer
computeAxCellWise else-branch (unpolarized/non-collinear) No Single GEMM on full N×S*d_k cell buffer; S=1 degenerates to existing
Cell gather (vectorToCell) collinear Necessary layout transform M×S×N → S*N×d_k (BLAS); spatial-flat, copyFieldToCellWiseData(Blocked)
Cell gather (vectorToCell) non-collinear + unpolarized Necessary layout transform M×S×N → N×Sd_k (BLAS); spin-aware, copyFieldToCellWiseData(ProductSpace), virtualId=spatialIdS+s inline; S=1 degenerates to existing spatial gather
Cell scatter (cellToVector) Necessary layout transform Mirrors gather; typed addCellWiseDataToFieldData overload
getLocal() spin-independent (KineticFE etc.) No intermediate dofs×dofs result placed into diagonal block via strided block-copy: dst_ptr = H_cell + s*dofs*(S*dofs+1), ldb=S*dofs
getLocal() XC non-collinear No intermediate DOF-outer element scatter (stride S): element (i,j) of source block → H_cell + cellDstOffset + (j·S+s_out)·(S·dofs) + i·S+s_in; top-left of block (s_in,s_out) at H_cell + s_out·(S·dofs) + s_in (no dofs factor — DOF-outer, not spin-blocked)
distributeParentToChild/Child No Takes MultiVector&; blockSize=S*N; flat buffer — no spin-block extraction
interpolate No Works on full Nflattened=SN columns; each DOF's SN values contiguous
LanczosExtremeEigenSolver No Initial guess is MultiVectorProductSpace(S, N=1); Lanczos applies H to S-column multivector without spin-block extraction

Rule encoded in MultiVectorOps blocked overloads:

// Spin-s block: pointer into X without copy.
// lda = X.numVectors() (= S*N), NOT X.numVectorsPerSpace() (= N).
const VT* Xs = X.data() + s * X.numVectorsPerSpace();
size_type lda = X.numVectors();

Any future use of spin-s subblock data should follow this pattern.


9. KohnShamEigenSolver — product-space dispatch

KohnShamEigenSolver constructs the right multivector type based on SpinMode and constructs ChebyshevFilteredEigenSolver with eigenVectorBatchSize = S*N for spin cases. The type of X drives all downstream dispatch automatically.

// Unpolarized (S=1):
const size_type S = 1;
MultiVectorProductSpace<VT, MS> X(mpiPattern, ctx, /*numSpaces=*/S, N);
// batchSize = N for unpolarized — existing behaviour unchanged
auto chebyshevSolver = ChebyshevFilteredEigenSolver(..., /*eigenVectorBatchSize=*/N, ...);
chebyshevSolver.solve(H, BInv, X, eigenValues, ...);   // → RayleighRitz coupled overload

// NonCollinear (S=2):
const size_type S = 2;
MultiVectorProductSpace<VT, MS> X(mpiPattern, ctx, /*numSpaces=*/S, N);
// batchSize = S*N: full spin-coupled vector processed at once by ChebyshevFilter
auto chebyshevSolver = ChebyshevFilteredEigenSolver(..., /*eigenVectorBatchSize=*/S*N, ...);
chebyshevSolver.solve(H, BInv, X, eigenValues, ...);   // → RayleighRitz coupled overload

// Collinear (S=2):
const size_type S = 2;
MultiVectorProductSpaceBlocked<VT, MS> X(mpiPattern, ctx, /*numSpaces=*/S, N);
// batchSize = S*N: same reason — operator must see both spin channels to select H_ss block
auto chebyshevSolver = ChebyshevFilteredEigenSolver(..., /*eigenVectorBatchSize=*/S*N, ...);
chebyshevSolver.solve(H, BInv, X, eigenValues, ...);   // → RayleighRitz blocked overload (ELPA loop)

Why eigenVectorBatchSize = S*N is required for spin

ChebyshevFilter calls A.apply(X_batch, Y_batch) where X_batch is the batch buffer (d_XinBatch), a plain MultiVector allocated with eigenVectorBatchSize columns. stridedBlockCopy itself is mechanically correct — it copies the right columns. The problem is downstream:

  • Non-collinear: H has off-diagonal spin blocks coupling spin-0 and spin-1 columns. A batch with fewer than S*N columns is missing one spin channel → filtered vector is wrong.
  • Collinear: H is block-diagonal. d_XinBatch is a plain MultiVector — but KohnShamOperatorContextFE::apply dispatches via d_spinMode, not by inspecting X. With eigenVectorBatchSize = S*N the full spin-coupled vector is always passed, so the collinear path correctly applies both H_00 and H_11 blocks.

Setting eigenVectorBatchSize = S*N makes the "batch" exactly the full spin-coupled multivector — stridedBlockCopy copies all S*N columns in one shot (trivially correct), and A.apply receives a vector where the spin structure is preserved.

ChebyshevFilteredEigenSolver is not modified. Only KohnShamEigenSolver changes the eigenVectorBatchSize argument passed at construction time.

No spin logic, no explicit ELPA calls, no MultiVectorOps calls in KohnShamEigenSolver — everything is driven by the multivector type.


10. Summary Table

Aspect Unpolarized Collinear NonCollinear
MultiVector type MultiVectorProductSpace (S=1) MultiVectorProductSpaceBlocked (S=2) MultiVectorProductSpace (S=2)
H buffer shape numCells × dofs² numCells × (S·dofs)² (off-diag zero) numCells × (S·dofs)² (full)
computeFEMatrices calls per component 1 (K=1 QVC) 1 (K=2 QVC) 1 (K=4 QVC)
Cell gather format (BLAS) N × d_k (spin-aware S=1 degenerate) S*N × d_k (spatial-flat) N × S*d_k (spin-aware, virtual DOF i*S+s)
Cell GEMM path else-branch of computeAxCellWise (numCells batched) if-branch of computeAxCellWise (S×numCells batched) else-branch of computeAxCellWise (numCells batched)
FLOPs per cell (S=2) N·dofs² (S=1 degenerate) 2·N·dofs² (diag blocks only) 4·N·dofs² (full H)
Dispatch in apply d_spinMode if/else (else-branch) d_spinMode if/else (if-branch) d_spinMode if/else (else-branch)
Dispatch in RayleighRitz dynamic_cast on eigenVectors dynamic_cast on eigenVectors dynamic_cast on eigenVectors
interpolate spin-agnostic, numComponents=S*N spin-agnostic, numComponents=S*N spin-agnostic, numComponents=S*N
ELPA calls per SCF 1 S (loop) 1

11. Complete File Inventory

New files

src/linearAlgebra/
  MultiVectorProductSpace.h
  MultiVectorProductSpace.t.cpp
  MultiVectorProductSpaceBlocked.h
  MultiVectorProductSpaceBlocked.t.cpp
  MultiVectorOps.h
  MultiVectorOps.t.cpp

No MultiVectorOpsKernels needed — MultiVectorOps is a thin static wrapper over existing functions (computeXTransOpX, subspaceRotation, existing blasLapack routines) with only the lda parameter addition. New kernels are in FECellWiseDataOperations (gather/scatter) and Hamiltonian components (getLocal block-copy), not in linearAlgebra/.

Files modified (additions only, existing logic untouched)

src/basis/FEBasisOperations.h / .t.cpp
  ← computeFEMatrices: unchanged; QVC numComponents=K already encodes spin blocks;
    each Hamiltonian component's getLocal() passes K-component QVC and block-copies
    the K dofs×dofs outputs into (S·dofs)² (no new overload needed)

src/basis/FECellWiseDataOperations.h / .cpp / DeviceKernels
  ← raw pointer overloads of copyFieldToCellWiseData / addCellWiseDataToFieldData REMOVED
  ← two typed overloads added as clean public API:
      MultiVectorProductSpaceBlocked (spatial-flat, numComponents=S*N),
      MultiVectorProductSpace (spin-aware, numComponents=N, virtualId=spatialId*S+s inline;
        S=1 unpolarized degenerates to existing spatial gather)
  ← no base MultiVector overload — all callers hold a typed product-space object
  ← gather/scatter primitives only (typed public overloads + internal spatialFlatGather /
    spinAwareGather helpers); no computeAxCellWise* functions here

src/ksdft/KohnShamOperatorContextFE.h / .t.cpp
  ← SpinMode enum; stores d_spinMode, d_S, d_N at construction
  ← unified numCells×(S·dofs)² H buffer
  ← reinit: zero-fills H buffer then axpy-accumulates fully-assembled numCells×(S·dofs)²
    buffers returned by each component's getLocal() (block-copy lives inside getLocal())
  ← apply calls single private computeAxCellWise(X, Y); one if/else on d_spinMode:
      if Collinear   → spatial-flat gather + S×numCells GEMM (diag H blocks)
      else           → spin-aware gather + numCells GEMM (full H; S=1 degenerates to existing)
  ← no named spin forks, no isBlockDiagonal parameter anywhere

src/ksdft/KohnShamEigenSolver.h / .t.cpp
  ← constructs typed product-space multivector per SpinMode:
      unpolarized → MultiVectorProductSpace(S=1) → RayleighRitz ProductSpace overload (single ELPA, degenerates to N×N)
      collinear   → MultiVectorProductSpaceBlocked(S=2) → RayleighRitz Blocked overload (ELPA loop)
      non-collinear → MultiVectorProductSpace(S=2) → RayleighRitz ProductSpace overload (single ELPA, S*N×S*N)
  ← sets eigenVectorBatchSize=S*N for spin cases (=N for unpolarized)

src/ksdft/KohnShamDFT.h / .t.cpp
  ← SpinMode propagation

src/ksdft/KineticFE.h / .t.cpp
  ← constructed with SpinMode; getLocal() returns numCells×(S·dofs)²
  ← computes dofs×dofs once (cached), copies into diagonal blocks; off-diagonal zero

src/ksdft/ElectrostaticLocalFE.h / .t.cpp
src/ksdft/ElectrostaticExcFE.h / .t.cpp
  ← same pattern as KineticFE (spin-independent; single compute, diagonal copy)

src/ksdft/ExchangeCorrelationFE.h / .t.cpp
  ← constructed with SpinMode; getLocal() returns numCells×(S·dofs)²
  ← Collinear: S diagonal blocks from per-spin XC potentials
  ← NonCollinear: all S² blocks from full spin-matrix XC potential
  ← per-space density inputs to XC functional

src/ksdft/DensityCalculatorKernels.h
  ← computeRhoInBatch: add numSpaces (S) + SpinMode params; rhoBatch becomes [numQuadInBlock][ncomp];
    kernel accumulates ρ_total/Mz (collinear) or ρ_total/Mx/My/Mz (non-collinear) directly;
    DEVICE partial specialisation follows same interface

src/ksdft/DensityCalculator.h / .t.cpp
  ← computeRho: add SpinMode param; QVC& rho gains ncomp components per quad;
    d_rhoMemspace and d_rhoBatch resized to ncomp×numQuad;
    psi-batch loop: two stridedBlockCopy calls for spin cases (§6.3)

src/ksdft/RDM1FE.h / .t.cpp
  ← getDescriptors: allocate rhoCombined QVC with ncomp components; call computeRho once;
    stride-ncomp copy of component k into densVal[k] to preserve vector-of-1-QVC interface

Explicitly unchanged

src/linearAlgebra/MultiVector.*            (base class, not modified)
src/linearAlgebra/ChebyshevFilteredEigenSolver.*   (unchanged; caller controls batchSize)
src/linearAlgebra/ChebyshevFilter.*
src/basis/FEBasisOperations::interpolate
src/basis/EFEConstraintsLocalDealii.*      (see §11.1 below)
src/basis/CFEConstraintsLocalDealii.*
src/basis/ConstraintsInternal.*

§11.1 — FE Constraints and distributeParentToChild / distributeChildToParent

EFEConstraintsLocalDealii::distributeParentToChild and distributeChildToParent both have the signature:

void distributeParentToChild(
    linearAlgebra::MultiVector<ValueTypeBasisCoeff, memorySpace>& vectorData,
    size_type blockSize) const;

Since MultiVectorProductSpace IS-A MultiVector (public inheritance), the product-space multivectors bind to this parameter directly — no overload or template specialisation needed.

The only change is at the call site (inside KohnShamOperatorContextFE or wherever constraints are applied): pass blockSize = Nflattened = S * numVectorsPerSpace instead of the scalar numVectorsPerSpace.

Why this is correct:

In the M × S × N memory layout, all S*N values for DOF i sit contiguously at byte offset i * S * N * sizeof(ValueType). The ConstraintsInternal kernel processes each constrained DOF by copying blockSize contiguous values from the master DOF, weighted by the constraint coefficient. With blockSize = S*N it applies the same geometric constraint coefficient to every component at that DOF — which is physically correct because FE hanging-node / periodicity constraints are purely spatial.

Constrained DOF i → master DOFs j₀, j₁, … with coefficients c₀, c₁, …
For each master DOF jₖ:
  for sn in 0..S*N-1:           // blockSize = S*N iterations, handled inside kernel
    data[i*S*N + sn] += cₖ * data[jₖ*S*N + sn]

This degenerates to the existing scalar case when S=1 (same blockSize = numVectorsPerSpace).

Summary: no changes to EFEConstraintsLocalDealii, CFEConstraintsLocalDealii, or ConstraintsInternal. Only the caller changes blockSize from N to S*N.

Modified (minimal changes only, existing logic untouched)

src/linearAlgebra/LanczosExtremeEigenSolver.h / .t.cpp
  ← d_initialGuess: Vector → MultiVector
  ← constructor/reinit parameter: const Vector& → const MultiVector&
  ← existing Vector callers unaffected (Vector IS-A MultiVector, implicit upcast)
  ← spin callers pass MultiVectorProductSpace(mpiPattern, ctx, S, /*numVectorsPerSpace=*/1) as initial guess

src/linearAlgebra/RayleighRitzEigenSolver.h / .t.cpp
  ← existing solve(MultiVector&) extended: one if dynamic_cast to Blocked + unconditional
    ProductSpace cast (no plain MultiVector path); no new public overloads, no private helpers
  ← Blocked branch: MultiVectorOps::project/rotate Blocked overloads (ELPA loop)
  ← ProductSpace path: MultiVectorOps::project/rotate ProductSpace overloads (single ELPA;
    S=1 unpolarized degenerates to N×N projected matrix, identical to old scalar path)
  ← computeXTransOpX: add lda parameter (default = numVec); body unchanged

src/linearAlgebra/ElpaScalapackOperations.h / .t.cpp
  ← subspaceRotation: add lda parameter (default = N); body unchanged — lda replaces hardcoded N in X+idof*N
  ← fillParallelOverlapMatrix: add lda parameter (default = N); body unchanged

src/linearAlgebra/OrthonormalizationFunctions.h / .t.cpp
  ← body of orthonormalization moved to MultiVectorOps::orthonormalize (two overloads)
  ← thin wrapper remains calling MultiVectorOps::orthonormalize with lda = X.numVectors()
  ← no base MultiVector overload in MultiVectorOps::orthonormalize

12. Implementation Phases

Phase Work
1 MultiVectorProductSpace — derives from MultiVector; constructor, numSpaces(), numVectorsPerSpace().
2 MultiVectorProductSpaceBlocked — derives from MultiVectorProductSpace; inherits constructor via using.
3 MultiVectorOps: static class wrapping computeXTransOpX / subspaceRotation with lda param; two specialisations per function (Blocked / ProductSpace).
4 MultiVectorOps::project — coupled overload (single SN×SN) and blocked overload (S independent N×N).
5 MultiVectorOps::rotate — coupled overload (same Q) and blocked overload (S independent Q_s).
6 FECellWiseDataOperations: remove raw pointer overloads; add two typed overloads of copyFieldToCellWiseData / addCellWiseDataToFieldData (MultiVectorProductSpaceBlocked spatial-flat; MultiVectorProductSpace spin-aware with inline virtual DOF spatialId*S+s, S=1 degenerates to unpolarized). No base MultiVector overload.
7 KohnShamOperatorContextFESpinMode enum; store d_spinMode, d_S, d_N at construction; unified numCells×(S·dofs)² H buffer; reinit: zero H buffer + axpy over each component's getLocal() (block-copy into (S·dofs)² lives inside each component's getLocal()); private computeAxCellWise with one if/else on d_spinMode (collinear: spatial-flat gather + S×numCells GEMM; else: spin-aware gather + numCells GEMM, S=1 degenerates to existing); private computeAxCellWiseOptimized with own if/else on d_spinMode.
8 RayleighRitzEigenSolver — extend existing solve(MultiVector&) with two inline dynamic_cast branches; BlockedMultiVectorOps::project/rotate blocked overloads (ELPA loop); ProductSpace → coupled overloads (single ELPA).
9 LanczosExtremeEigenSolver — change d_initialGuess and constructor/reinit param from Vector to MultiVector.
10 DensityCalculatorKernels.h: add numSpaces (S) and SpinMode params to computeRhoInBatch; rhoBatch layout becomes [numQuadInBlock][ncomp]; kernel accumulates ρ_total/Mz (collinear) or ρ_total/Mx/My/Mz (non-collinear) directly — no ρ↑/ρ↓ stage. DEVICE partial specialisation follows same interface. DensityCalculator.h/.t.cpp: computeRho gains SpinMode param and QVC& rho gains ncomp components; d_rhoMemspace and d_rhoBatch resized to ncomp × numQuad; axpy accumulation operates on ncomp × numQuadInBlock elements; cell-block/psi-batch loop structure unchanged. RDM1FE.t.cpp: getDescriptors allocates QVC rhoCombined(quadRuleContainer, ncomp, 0), calls computeRho once, then copies component k into densVal[k] (stride-ncomp copy) to preserve the existing vector-of-1-component-QVCs interface consumed by ExchangeCorrelationFE.
11 computeFEMatrices K>1 extension: remove throw; resize output to K × basisOverlapSize; add K loop inside cell block loop; add buildAllComponentFxJxW device kernel. GGA: m=dim in scaleStridedVarBatched, k=Q*dim in GEMM, use gradBasis.
12 copyIntoBlockSpinBlocked kernel: collinear lda-change copy (src lda=d → dst lda=S*d), diagonal block (k,k) at (k*d+j)*(S*d)+k*d. copyIntoBlockDOFOuter kernel: non-collinear element scatter at stride S, block (k/S, k%S) at (j*S+s_out)*(S*d)+i*S+s_in.
13 ExchangeCorrelationFE::getLocal(): add d_xcCellWiseTemp storage member (K × Σdofs²); single computeFEMatrices call + K copyIntoBlock calls; dispatch on K==S (collinear) vs K==S*S (non-collinear).

13. Open Decisions

  1. Uniform numOrbitals per space. Both classes assume the same N for every space index. Different N_s per block would require a partitioned layout.

  2. ElpaScalapackManager sharing. All MultiVectorOps::project blocked-overload calls produce N×N matrices of equal size; they share one ProcessGrid.

  3. Orthogonalisation for product spaces. The Cholesky-GS overlap matrix (B overlap) also needs product-space-aware projection. MultiVectorOps::project (blocked and ProductSpace overloads) handle B as well as H — just pass B as the operator.

  4. Runtime dispatch in apply — resolved. d_spinMode (enum stored in KohnShamOperatorContextFE) is the sole dispatch mechanism. No isBlockDiagonal flag. No compile-time template branching — computeAxCellWise is a single private method with one if/else on d_spinMode.

  5. H block-copy helper in getLocal(). Each Hamiltonian component's getLocal() calls computeFEMatrices to produce dofs×dofs blocks and places them into the (S·dofs)×(S·dofs) per-cell buffer via a block-copy or element-scatter kernel:

    • Spin-independent (KineticFE etc.) and collinear XC: lda-change strided copy to diagonal block (k·dofs, k·dofs) (spin-blocked offsets).
    • Non-collinear XC: element scatter with stride S to DOF-outer block (top-left at (s_out·(S·dofs)+s_in), no dofs factor — see §4b). getLocal() returns the fully assembled numCells×(S·dofs)² buffer. KohnShamOperatorContextFE::reinit only accumulates these buffers with axpy.

Clone this wiki locally