-
Notifications
You must be signed in to change notification settings - Fork 3
Spin Design
All index lists are written slowest → fastest. Multi-cell global arrays use C row-major. Hamiltonian blocks within a cell use col-major (BLAS convention).
| Symbol | Meaning | Range |
|---|---|---|
M |
total local DOF count on this MPI rank | ≥ 1 |
S |
numSpaces = number of spin channels | 1 (unpolarized) or 2 (collinear / non-collinear) |
N |
numOrbitals = numVectorsPerSpace | ≥ 1 |
C |
numCells | ≥ 1 |
d, d_k
|
number of basis DOFs in cell k; written d when uniform |
≥ 1 |
Q |
quadrature points per cell | ≥ 1 |
K |
number of XC potential components (1, S, or S²) | 1, 2, or 4 |
dim |
spatial dimension | 3 |
i |
global context: local DOF index on this rank | 0..M-1 |
i, j
|
per-cell context: row and column DOF index within cell | 0..d-1 |
s, s_in, s_out
|
spin index | 0..S-1 |
n |
orbital index | 0..N-1 |
k |
§0.1–§0.6, §5–§12: cell index | 0..C-1 |
k |
§4a–§4b: XC component index | 0..K-1 |
c |
§4a–§4b: cell index | 0..C-1 |
p |
spatial DOF index within cell (gather loops) | 0..d-1 |
p_v |
virtual DOF index within cell = p·S + s (non-collinear) |
0..S·d-1 |
q |
quadrature point index within cell | 0..Q-1 |
row, col
|
row / column of the (S·d)×(S·d) Hamiltonian matrix | 0..S·d-1 |
α |
Cartesian spatial component | 0..dim-1 |
Symbol scoping note: i and j denote cell-local DOF indices (0..d-1) in all per-cell formulas (§0.3 H_cell, §4a block copies). In global eigenvector formulas (§0.1, §0.2) i is the rank-local DOF index (0..M-1). The context (per-cell vs. global) disambiguates the range unambiguously.
Global: M × S × N (M slowest, N fastest)
Flat: data[i*S*N + s*N + n]
-
i= local DOF index (0..M-1) -
s= space index (0..S-1) -
n= orbital index (0..N-1)
Two cell gather formats — selected by the multivector type passed to copyFieldToCellWiseData:
Collinear (MultiVectorProductSpaceBlocked):
numComponents = S*N, cellLocalIds = spatial node indices (Σd_k entries)
Output (C row-major): Σd_k × S*N → per cell: d_k × S*N
BLAS: S*N × d_k
Spin-s block: ptr = xCell + s*N, lda = S*N (no copy, valid BLAS strided submatrix)
Non-collinear (MultiVectorProductSpace):
numComponents = N, cellLocalIds = virtual spin-DOF indices i*S+s (S*Σd_k entries)
virtualId computed inline as spatialId * S + s — no separate spinAwareCellLocalIds array
Output (C row-major): S*Σd_k × N → per cell: S*d_k × N
BLAS: N × S*d_k
Unpolarized (MultiVectorProductSpace, S=1):
Same ProductSpace overload as non-collinear; S=1 degenerates:
virtualId=spatialId*1+0=spatialId, BLAS N × d_k — identical to old plain-MultiVector path
Physical storage per cell is identical for all three modes:
xCell (per cell): [dof (d_k)] [spin (S)] [orbital (N)]
The BLAS k-dim (contracting index) boundary differs by mode:
| Mode | BLAS k-dim (contracting) | BLAS m-dim | lda |
|---|---|---|---|
| Collinear |
[dof] per spin-s diagonal block |
[orbital (N)] |
S*N |
| Non-collinear |
[dof][spin] (virtual DOF i*S+s) |
[orbital (N)] |
N |
| Unpolarized | [dof] |
[orbital (N)] |
N |
Why non-collinear contracts over spin:
Non-collinear H has nonzero off-diagonal spin blocks (H_{01}, H_{10} ≠ 0). The output at spin s_out requires:
(Hψ)_{s_out,n} = Σ_{s_in} Σ_i H_{(j,s_out),(i,s_in)} ψ_{i,s_in,n}
Both spatial DOFs and spin channels must be contracted simultaneously.
Collinear is the special case: H_{01} = H_{10} = 0, spins decouple, one GEMM per diagonal block with no cross-spin contraction.
Unified flat format for all cases:
std::vector<RealType> eigenValues // size Nflattened = S*N
std::vector<RealType> fractionalOccupancy // size Nflattened = S*N
Flat index: s*N + n
-
Unpolarized (S=1):
Nflattened = N, degenerates to existing format. -
Collinear (S=2):
eigenValues[s*N + n]= eigenvalue of space-s orbital-n. Each spin channel has independent eigenvalues and occupancies:fractionalOccupancy[0*N + n] ≠ fractionalOccupancy[1*N + n]in general. -
NonCollinear (S=2):
eigenValues[s*N + n]— only N are physically distinct (one eigenvalue per spinor). By constructioneigenValues[0*N+n] == eigenValues[1*N+n]andfractionalOccupancy[0*N+n] == fractionalOccupancy[1*N+n]for all n. The flatS*Nformat is kept for uniformity;FractionalOccupancyFunctionsets both s=0 and s=1 entries to the same value for non-collinear.
RayleighRitzEigenSolver blocked and coupled overloads both output flat
std::vector<RealType> of size Nflattened — no std::vector<std::vector<RealType>>.
Owned by KohnShamOperatorContextFE. Same shape for Collinear and NonCollinear:
numCells × (S·d)² row-major
Within one cell, the (S·d) × (S·d) block matrix is laid out as:
[ H_{00} H_{01} ] Collinear (spin-blocked) only: H_{s_in, s_out} occupies
[ H_{10} H_{11} ] rows [s_in*d : (s_in+1)*d], cols [s_out*d : (s_out+1)*d].
Non-collinear uses DOF-outer ordering — spin blocks are
non-contiguous (see H_cell bracket notation below).
Flat offset of element (row, col) within cell k (col-major within cell):
k*(S*d)^2 + col*(S*d) + row
where row, col ∈ 0..S·d-1
Leading dimension (lda/ldb) for BLAS calls: S*d (shared for all three matrices X, H, Y)
Diagonal block s pointer arithmetic (for collinear GEMM):
H_s_ptr = H_cell_ptr + s*d*(S*d + 1) // top-left of block (s,s)
ldb = S*d // row stride — native BLAS lda parameter
Collinear: H_{01} = H_{10} = 0, only diagonal blocks H_{00}, H_{11} filled.
NonCollinear: all four blocks filled.
Unpolarized: S=1, degenerates to numCells × d² (existing layout, no change).
Why not numCells × S × S × d × d?
The spin-spin-outer layout would give flat offset k*S²*d² + s_in*S*d² + s_out*d² + i*d + j,
which is not a contiguous (S·d)×(S·d) matrix. The single non-collinear GEMM would
require a copy or a complex strided call. The numCells × (S·d)² layout avoids this —
both the full GEMM and the diagonal sub-GEMMs operate with the same lda = S*d.
Two distinct inner orderings, one per spin mode:
Collinear (spin-blocked):
H_cell: [cell] [col_spin (S)] [col_dof (d)] [row_spin (S)] [row_dof (d)]
flat = (s_out·d + j)·(S·d) + s_in·d + i
diagonal block s: ptr = H_cell + s·d·(S·d + 1), lda = S·d
Non-collinear (DOF-outer):
H_cell: [cell] [col_dof (d)] [col_spin (S)] [row_dof (d)] [row_spin (S)]
flat = (j·S + s_out)·(S·d) + i·S + s_in
Unpolarized (S=1):
H_cell: [cell] [col_dof (d)] [row_dof (d)]
flat = j·d + i
xCell col ↔ H_cell row — must match for GEMM correctness:
| Mode | xCell k-dim (col) | H_cell row index | Match |
|---|---|---|---|
| Collinear |
[dof] within spin-s block |
s_in·d + i = spin fixed, then dof |
✓ |
| Non-collinear |
[dof][spin] DOF-outer: i·S+s_in
|
i·S + s_in = dof outer, spin inner |
✓ |
| Unpolarized | [dof] |
[dof] |
✓ |
The two inner layouts are not interchangeable — each is forced by its gather's virtual DOF ordering and cannot be unified.
After FEBasisOperations::interpolate, the QVC holds Nflattened = S*N components
per quadrature point. Layout within one cell (C contiguous cells):
numCells × Q × S × N (numCells slowest, N fastest)
Flat: cell*Q*S*N + q*S*N + s*N + n
This matches the global eigenvector layout (S second-fastest, N fastest) so the
gather from M × S × N into the QVC is a contiguous block copy per cell.
Density is stored in DensityDescrAttr::Val as a vector of ncomp QVCs, each with
1 component per quadrature point (scalar field). This matches the existing codebase
convention used by ExchangeCorrelationFE and RDM1FE.
| Mode | ncomp | densVal[0] |
densVal[1] |
densVal[2] |
densVal[3] |
|---|---|---|---|---|---|
| Unpolarized | 1 | ρ_total | — | — | — |
| Collinear | 2 | ρ_total | Mz | — | — |
| Non-collinear | 4 | ρ_total | Mx | My | Mz |
Definitions (computed from wavefunctions at each quad point q):
Unpolarized (S=1):
ρ_total(q) = Σ_n f_n |ψ_n(q)|²
Collinear (S=2):
f[s*N+n] independent for each spin s
ρ_total(q) = Σ_s Σ_n f[s*N+n] |ψ_{s,n}(q)|²
Mz(q) = Σ_n f[0*N+n] |ψ_{0,n}(q)|² − Σ_n f[1*N+n] |ψ_{1,n}(q)|²
Non-collinear (S=2):
f[0*N+n] = f[1*N+n] = f_n (same occupancy for both spin components of spinor n)
ρ_total(q) = Σ_n f_n ( |ψ_{0,n}(q)|² + |ψ_{1,n}(q)|² )
Mz(q) = Σ_n f_n ( |ψ_{0,n}(q)|² − |ψ_{1,n}(q)|² )
Mx(q) = Σ_n f_n 2 Re[ ψ*_{0,n}(q) ψ_{1,n}(q) ]
My(q) = Σ_n f_n (−2 Im[ ψ*_{0,n}(q) ψ_{1,n}(q) ])
Relationship to spin-density matrix (used internally by XC functionals):
ρ_{↑↑} = (ρ_total + Mz) / 2
ρ_{↓↓} = (ρ_total − Mz) / 2
ρ_{↑↓} = (Mx − i·My) / 2
ρ_{↓↑} = (Mx + i·My) / 2
ExchangeCorrelationFE::reinitField already performs this conversion for collinear:
spinVals[0][i] = 0.5 * (rhoTotal[i] + Mz[i]); // ρ↑
spinVals[1][i] = 0.5 * (rhoTotal[i] - Mz[i]); // ρ↓The non-collinear conversion (Mx, My, Mz → full spin-density matrix) follows the same logic.
Same ncomp-vector-of-QVCs convention as density. Each gradient QVC has 3 components
per quadrature point (Cartesian α = 0, 1, 2).
| Mode | ncomp | gradDensVal[0] |
gradDensVal[1] |
gradDensVal[2] |
gradDensVal[3] |
|---|---|---|---|---|---|
| Unpolarized | 1 | ∇ρ_total | — | — | — |
| Collinear | 2 | ∇ρ_total | ∇Mz | — | — |
| Non-collinear | 4 | ∇ρ_total | ∇Mx | ∇My | ∇Mz |
Unpolarized:
∇ρ_total(q, α) = Σ_n f_n 2 Re[ψ*_n(q) ∂_α ψ_n(q)]
Collinear:
∇ρ_total(q, α) = Σ_s Σ_n f[s*N+n] 2 Re[ψ*_{s,n}(q) ∂_α ψ_{s,n}(q)]
∇Mz(q, α) = Σ_n f[0*N+n] 2 Re[ψ*_{0,n} ∂_α ψ_{0,n}]
− Σ_n f[1*N+n] 2 Re[ψ*_{1,n} ∂_α ψ_{1,n}]
Non-collinear:
∇ρ_total, ∇Mz same structure as collinear with f_n uniform
∇Mx(q, α) = Σ_n f_n 2 Re[∂_α(ψ*_{0,n} ψ_{1,n})]
∇My(q, α) = Σ_n f_n (−2 Im[∂_α(ψ*_{0,n} ψ_{1,n})])
Per cell k with d = d_k; p = spatial DOF index within cell (0..d-1).
All BLAS GEMMs below are H · x → y where H is presented col-major, x/y are presented
as col-major (C row-major viewed transposed). lda = leading dimension (column stride).
Collinear (S=2, MultiVectorProductSpaceBlocked)
multiVector (global, C row-major):
layout: [DOF (M)] [spin (S)] [orbital (N)]
flat: i·S·N + s·N + n i ∈ 0..M-1
xCell (per cell, C row-major):
layout: [spatial DOF (d)] [spin (S)] [orbital (N)]
flat: p·S·N + s·N + n p ∈ 0..d-1
BLAS view (transposed): [spin-orbital (S·N)] × [spatial DOF (d)]
spin-s block: ptr = xCell + s·N, lda = S·N (no copy — valid BLAS strided submatrix)
H_cell (per cell, col-major, lda = S·d):
layout: [col_spin (S)] [col_dof (d)] [row_spin (S)] [row_dof (d)]
flat: (s_out·d + j)·(S·d) + s_in·d + i i,j ∈ 0..d-1
diagonal block s: ptr_s = H_cell + s·d·(S·d+1)
GEMM (batchCount = S×C, one per spin-s block per cell):
H_s · xCell_s → yCell_s
(d × d) · (d × N) = (d × N)
contracts over: [col DOF (d)] of H = [spatial DOF (d)] of xCell_s ✓
(off-diagonal blocks H_{s_in≠s_out} = 0; only diagonal blocks used)
yCell (per cell, C row-major):
layout: [spatial DOF (d)] [spin (S)] [orbital (N)] ← identical to xCell
spin-s block: ptr = yCell + s·N, lda = S·N
Non-collinear (S=2, MultiVectorProductSpace)
multiVector (global, C row-major):
layout: [DOF (M)] [spin (S)] [orbital (N)]
flat: i·S·N + s·N + n (identical to collinear)
xCell (per cell, C row-major):
virtualId = spatialId·S + s → DOF-outer within cell: [spatial DOF (d)] [spin (S)]
layout: [spatial DOF (d)] [spin (S)] [orbital (N)]
flat: (p·S + s)·N + n p ∈ 0..d-1
BLAS view (transposed): [orbital (N)] × [virtual DOF (S·d)]
lda = N
H_cell (per cell, col-major, lda = S·d):
layout: [col_dof (d)] [col_spin (S)] [row_dof (d)] [row_spin (S)]
flat: (j·S + s_out)·(S·d) + i·S + s_in i,j ∈ 0..d-1
GEMM (batchCount = C, one per cell):
H · xCell → yCell
(S·d × S·d) · (S·d × N) = (S·d × N)
contracts over: [col virtual DOF (S·d)] of H = [virtual DOF (S·d)] of xCell ✓
(off-diagonal H blocks H_{s_in,s_out≠s_in} ≠ 0; spin channels must be contracted together)
yCell (per cell, C row-major):
layout: [spatial DOF (d)] [spin (S)] [orbital (N)] ← identical to xCell
lda = N
Unpolarized (S=1, MultiVectorProductSpace)
multiVector (global, C row-major):
layout: [DOF (M)] [orbital (N)] (S=1 collapses spin axis)
flat: i·N + n i ∈ 0..M-1
xCell (per cell, C row-major):
S=1 → virtualId = spatialId·1 + 0 = spatialId (degenerates to spatial gather)
layout: [spatial DOF (d)] [orbital (N)]
flat: p·N + n p ∈ 0..d-1
BLAS view (transposed): [orbital (N)] × [spatial DOF (d)]
lda = N
H_cell (per cell, col-major, lda = d):
layout: [col_dof (d)] [row_dof (d)] (no spin blocks; S=1 → (S·d)²=d²)
flat: j·d + i i,j ∈ 0..d-1
GEMM (batchCount = C, one per cell):
H · xCell → yCell
(d × d) · (d × N) = (d × N)
contracts over: [col DOF (d)] = [spatial DOF (d)] ✓
yCell (per cell, C row-major):
layout: [spatial DOF (d)] [orbital (N)] ← identical to xCell
lda = N
Contraction alignment — xCell column ↔ H_cell row (must match for GEMM correctness):
| Mode | xCell BLAS k-dim (columns of x) | H_cell row index | Aligned? |
|---|---|---|---|
| Collinear |
p within spin-s block (0..d-1) |
s_in·d + i, spin fixed → i ∈ 0..d-1 |
✓ |
| Non-collinear |
p·S+s (virtual DOF, 0..S·d-1) |
i·S+s_in (DOF-outer, 0..S·d-1) |
✓ |
| Unpolarized |
p (0..d-1) |
i (0..d-1) |
✓ |
Enable two storage patterns with minimal forking.
Explicitly unchanged:
-
ChebyshevFilteredEigenSolver/ChebyshevFilter— internal batching logic untouched; correctness for spin is ensured byKohnShamEigenSolverpassingeigenVectorBatchSize=S*N FEBasisOperations::interpolate-
FEBasisOperations::computeFEMatrices— API/signature unchanged;QuadratureValuesContaineralready carriesnumComponents(K), so passing a K-component V_xc QVC naturally produces Kdofs×dofsmatrices in one call (basis values computed once). No new overload needed. Implementation extended: K>1 throw removed; output resized toK × basisOverlapSize; K loop added inside the cell block loop (see §4a / Phase 11).
Modified (minimal changes only, existing logic untouched):
-
LanczosExtremeEigenSolver—d_initialGuessand constructor/reinitparameter changed fromVectortoMultiVector; existingVectorcallers unaffected (implicit upcast) -
RayleighRitzEigenSolver— existingsolve(MultiVector&)extended with onedynamic_castbranch (Blocked) + unconditional ProductSpace cast; no new public overloads -
KohnShamOperatorContextFE— storesd_spinMode; owns H cell buffers;reinitzero-fills then axpy-accumulates fully-assembled buffers from each component'sgetLocal();applydispatches via single privatecomputeAxCellWisewith if/else ond_spinMode -
FECellWiseDataOperations— two typed overloads ofcopyFieldToCellWiseData/addCellWiseDataToFieldDataas clean public API; gather/scatter primitives only -
KohnShamEigenSolver— constructs typed product-space multivector; setseigenVectorBatchSize=S*N -
KohnShamDFT,DensityCalculator,ExchangeCorrelationFE
Dispatch design — three clear rules:
-
ksdft/—d_spinModeis the sole dispatch mechanism inapply. Nodynamic_cast, noisBlockDiagonalflag. Single privatecomputeAxCellWisewith one if/else ond_spinMode. -
linearAlgebra/—dynamic_castoneigenVectorsinRayleighRitzEigenSolver::solve. Works becauseeigenVectorsis the original typed object passed by reference all the way down fromKohnShamEigenSolver.ChebyshevFilteredEigenSolveris completely unchanged — itsd_XinBatchis plainMultiVector; ksdft handles dispatch viad_spinMode. -
basis/— spin-agnostic.FEBasisOperations::interpolateusesnumComponents = X.getNumberComponents() = S*N— nodynamic_cast, no spin knowledge.FECellWiseDataOperationstyped overloads exist as clean public API for callers with typed objects; they are not the dispatch mechanism in theapplyhot path.
Rule: linearAlgebra/ and basis/ layers are physics-agnostic.
MultiVector
└── MultiVectorProductSpace (S=1: unpolarized, S=2: non-collinear)
└── MultiVectorProductSpaceBlocked (S=2: collinear)
-
MultiVectorProductSpacewithS=1covers the unpolarized case — no separate plain-MultiVectorpath needed for new code. -
MultiVectorProductSpaceBlockedinherits fromMultiVectorProductSpaceso the compiler picks the most-derived overload inRayleighRitzEigenSolver::solveandMultiVectorOps::project/rotate.
All classes use the same axis order:
M (slowest) × numSpaces × numVectors (fastest)
-
M—locallyOwnedSize(), DOF count per MPI rank -
numSpaces— product-space inner dimension (S=1 unpolarized, S=2 collinear/non-collinear) -
numVectorsPerSpace— number of vectors per space component (N) -
Nflattened=numSpaces × numVectorsPerSpace← what the baseMultiVectorsees asnumVectors()
Data for DOF i, space s, vector n sits at flat offset i*S*N + s*N + n.
The only difference between the two classes is the RayleighRitz dispatch:
-
MultiVectorProductSpace— single ELPA call (unpolarized S=1, non-collinear S=2) -
MultiVectorProductSpaceBlocked— ELPA loop over S independent eigenproblems (collinear S=2)
H buffer shape (numCells × (S·dofs)²) is identical for both classes. computeAxCellWise is a single private method in KohnShamOperatorContextFE that dispatches on d_spinMode — the gather format and GEMM shape differ by spin mode (see §5). RayleighRitz dispatch differs: MultiVectorProductSpaceBlocked → ELPA loop; MultiVectorProductSpace → single ELPA.
Derives from MultiVector with numVectors_base = numSpaces × numVectorsPerSpace.
Covers unpolarized (S=1) and non-collinear (S=2).
Files to add:
src/linearAlgebra/MultiVectorProductSpace.h
src/linearAlgebra/MultiVectorProductSpace.t.cpp
template <typename ValueType, utils::MemorySpace memorySpace>
class MultiVectorProductSpace : public MultiVector<ValueType, memorySpace>
{
public:
MultiVectorProductSpace(
std::shared_ptr<const utils::mpi::MPIPatternP2P<memorySpace>> mpiPatternP2P,
std::shared_ptr<LinAlgOpContext<memorySpace>> linAlgOpContext,
size_type numSpaces, // S
size_type numVectorsPerSpace, // N
ValueType initVal = 0);
size_type numSpaces() const; // S
size_type numVectorsPerSpace() const; // N
// base numVectors() = S*N (Nflattened)
private:
size_type d_numSpaces;
size_type d_numVectorsPerSpace;
};Derives from MultiVectorProductSpace. Identical layout and interface.
Differs only in that RayleighRitzEigenSolver::solve and MultiVectorOps::project
/ rotate pick the blocked overload (ELPA loop) via most-derived overload resolution.
Used for collinear (S=2).
Files to add:
src/linearAlgebra/MultiVectorProductSpaceBlocked.h
src/linearAlgebra/MultiVectorProductSpaceBlocked.t.cpp
template <typename ValueType, utils::MemorySpace memorySpace>
class MultiVectorProductSpaceBlocked
: public MultiVectorProductSpace<ValueType, memorySpace>
{
public:
// Inherits constructor, numSpaces(), numVectorsPerSpace() unchanged.
using MultiVectorProductSpace<ValueType, memorySpace>::MultiVectorProductSpace;
};Both classes inherit all MultiVector operations operating on Nflattened = S*N
flat columns. No overrides are needed:
add, axpy, scalar ops — element-wise on all Nflattened entries.
Correct for both collinear and non-collinear as-is.
l2Norms() / lInfNorms() — returns S*N values, one per flat (s,n) column:
- Collinear: correct — S*N independent norms, one per spin-orbital.
- Non-collinear: returns per-spin-component norms, not per-spinor norm (true spinor
norm =
sqrt(||ψ_{0,n}||² + ||ψ_{1,n}||²)). HoweverChebyshevFilteruses these only for normalization/convergence checks — per-component norms are acceptable there and cause no correctness issue.
dot() — computes S*N independent dot products, one per flat column pair:
- Collinear: correct — S*N independent dot products.
- Non-collinear: the RayleighRitz projection requires
P_mn = Σ_s X[:,s*N+m]^H (HX)[:,s*N+n]— a block-trace sum across s that the basedot()does NOT perform. This is whyMultiVectorOps::projectexists — it handles the coupled sum explicitly and is the only path used for the projection step. The basedot()is never called directly for RayleighRitz projection.
Both MultiVectorProductSpace and MultiVectorProductSpaceBlocked share:
H cell buffer shape (owned by KohnShamOperatorContextFE):
numCells × (S·dofs)²
Layout within one cell: (S·dofs) × (S·dofs) col-major (BLAS convention), block-indexed as
H_cell[s_in * dofs : (s_in+1)*dofs, s_out * dofs : (s_out+1)*dofs]
(collinear spin-blocked only; non-collinear uses DOF-outer ordering — see §0.3 bracket notation).
Cell gather — format differs by spin mode (see §5 and §0.1):
Collinear (Blocked overload): S*N × d_k spatial-flat; spin-s block via ptr+s*N, lda=S*N
Unpolarized/NonCollinear (ProductSpace overload): N × S*d_k spin-aware; virtualId=spatialId*S+s
Cell scatter-back: mirrors gather, same typed overload.
H assembly in KohnShamOperatorContextFE::reinit:
Each Hamiltonian component is constructed with SpinMode and encapsulates its own
spin structure. getLocal() always returns a numCells × (S·dofs)² buffer, fully
assembled. KohnShamOperatorContextFE::reinit is uniform — it just accumulates:
std::fill(d_HCellBuffer.begin(), d_HCellBuffer.end(), 0);
for (auto& h : d_hamiltonianComponents)
axpy(1.0, h.getLocal(), d_HCellBuffer); // all same shape: numCells × (S·dofs)²SpinMode is encapsulated inside each Hamiltonian component for H assembly. The operator
stores d_spinMode for the if/else dispatch in computeAxCellWise (collinear vs else-branch).
What each component does internally in getLocal():
| Component | Spin structure |
computeFEMatrices (one call, K outputs) |
Post-processing in getLocal()
|
Note |
|---|---|---|---|---|
KineticFE, ElectrostaticLocalFE
|
No spin index — same potential for all s | K=1 → 1 dofs×dofs matrix (same for all s) |
Copy that same matrix into each of the S diagonal blocks of (S·dofs)²; off-diagonal zero |
1 result → S copies |
ExchangeCorrelationFE (Collinear) |
K=2 V_xc components (V_xc^↑ ≠ V_xc^↓) | K=2 → 2 different dofs×dofs matrices |
2 strided block-copies into diagonal blocks (0,0) and (1,1) of (S·dofs)²; off-diagonal zero |
2 distinct results |
ExchangeCorrelationFE (NonCollinear) |
K=4 V_xc components (V_xc^↑↑, V_xc^↑↓, V_xc^↓↑, V_xc^↓↓ all different) | K=4 → 4 different dofs×dofs matrices |
4 DOF-outer element scatters into all four blocks; element (i,j) of block k=(s_in,s_out) placed at (row=i·S+s_in, col=j·S+s_out) in the (S·dofs)² matrix (stride S; see §4b) |
4 distinct results |
| Any component (Unpolarized, S=1) | S=1, (S·dofs)²=dofs²
|
K=1 → 1 dofs×dofs matrix |
No block-copy needed — buffer IS the H cell matrix | degenerate |
QuadratureValuesContainer already carries numComponents = K. The existing
computeFEMatrices interface is unchanged — getLocal() passes the K-component
V_xc QVC directly and receives K dofs×dofs matrices (one per component), with basis
values computed once internally. No new overload needed.
Unpolarized (K=1):
density QVC: numCells × Q × 1 [ρ]
V_xc QVC: numCells × Q × 1 [V_xc]
output: numCells × 1 × dofs² → placed at diagonal block (0,0) of dofs² buffer
Collinear (K=2):
density QVC: numCells × Q × 2 [ρ_↑, ρ_↓] ← ExchangeCorrelationFE internal form;
converted from densVal [ρ_total, Mz] (§0.5)
via ρ_↑=(ρ_total+Mz)/2, ρ_↓=(ρ_total−Mz)/2
V_xc QVC: numCells × Q × 2 [V_xc^↑, V_xc^↓]
output: numCells × 2 × dofs² → block-copied to diagonal blocks (0,0),(1,1) of (S·dofs)²
NonCollinear (K=4):
density QVC: numCells × Q × 4 [ρ_↑↑, ρ_↑↓, ρ_↓↑, ρ_↓↓] ← ExchangeCorrelationFE internal form;
converted from densVal [ρ_total,Mx,My,Mz] (§0.5)
via ρ_{↑↑}=(ρ_total+Mz)/2, ρ_{↓↓}=(ρ_total−Mz)/2,
ρ_{↑↓}=(Mx−i·My)/2, ρ_{↓↑}=(Mx+i·My)/2
V_xc QVC: numCells × Q × 4 [V_xc^{↑↑}, …, V_xc^{↓↓}]
output: numCells × 4 × dofs² → block-copied to all four blocks of (S·dofs)²
Each component's getLocal() calls computeFEMatrices once with the K-component
QVC, then does K block-copies (or element scatters) into the (S·dofs)² H buffer:
-
Collinear: lda-change copy to diagonal block
(k,k)at(k·dofs, k·dofs)offset. -
Non-collinear: element scatter (stride S) to DOF-outer block
(s_in, s_out)at top-left offset(s_in, s_out)(not(s_in·dofs, s_out·dofs); see §4a / §4b for exact formulas). Spin-block assembly lives entirely ingetLocal().
Two hard blockers in the current implementation:
Hard throw for K > 1 (BasisWeakFormKernelWithField, FEBasisOperations.t.cpp):
utils::throwException(
f.getNumberComponents() == 1,
"quadValuesContainer f in BasisWeakFormKernelWithField"
" can be only a scalar field in real space with 1 component.");Removed when the function is extended to handle K > 1.
Output layout mismatch: existing function resizes cellWiseFEData to Σdofs_k² (one d×d
per cell). For collinear S=2, getLocal() must return Σ(S·d_k)² — a (S·d)×(S·d) per cell.
A block copy is always required after computeFEMatrices.
The existing function is modified in place:
- Throw for K > 1 removed.
-
cellWiseFEDataresized toK × basisOverlapSize. - K loop placed inside the cell block loop —
getBasisDataInCellRangecalled once per block, reused for all K components. - Basis layer remains spin-agnostic: K is a generic component count.
Output storage: [K][cell][d²] (K slowest, element fastest within cell)
flat offset for component k, cell c, col-major element (i,j):
k * basisOverlapSize + cellOffset_c + j*d_c + i
where basisOverlapSize = Σ_c d_c²
cellOffset_c = Σ_{c'<c} d_{c'}²
K values and k → (s_in, s_out) mapping:
| Spin mode | K | s_in | s_out |
|---|---|---|---|
| Unpolarized | 1 | 0 | 0 |
| Collinear | S=2 | k |
k (diagonal only; use s_in=k, s_out=k directly) |
| Non-collinear | S²=4 | k/S |
k%S |
Internal structure (K loop inside cell block loop):
K = f.getNumberComponents()
cellWiseFEData.resize(K * basisOverlapSize, 0)
// Build fxJxW for all K components (one device kernel):
// fxJxW[k * numCumulativeQuadCells + q] = f(q, k) * JxW[q]
// f.data() layout: [quad][K], flat index q*K + k
buildAllComponentFxJxW(f, jxwStorage, K, fxJxW)
cell block loop:
getBasisDataInCellRange(cellStartId, cellEndId, d_basisDataInCellRange) // ONCE per block
for k = 0..K-1:
scaleStridedVarBatched(
m=1, n=d, k=Q,
A = fxJxW.data() + k*numCumulativeQuadCells + quadCellsInBlockOffset,
B = d_basisDataInCellRange,
C = d_fxJxWxNBlock)
gemmStridedVarBatched(
transA='N', transB='C', m=d, n=d, k=Q, lda=ldb=ldc=d,
A = d_fxJxWxNBlock,
B = d_basisDataInCellRange,
C = cellWiseFEData.begin() + k*basisOverlapSize + NifNjStartOffset)
NifNjStartOffset += numCumulativeDofsxDofsCellsInBlock
quadCellsInBlockOffset += numCumulativeQuadCellsInBlock
end cell block loop
d_fxJxWxNBlock is reused for each k — same size as current. K=1 degenerates exactly to the
existing path.
Matrix element for GGA component k:
M_k[i,j] = Σ_{alpha} Σ_q f_{k,alpha}(q)·JxW(q)·N_i(q)·∂N_j/∂x_alpha(q)
d_derExcWithSigmaTimesGradRhoQuadMemspace: K*dim components, layout [quad][K][dim].
Inside the K loop, two parameter changes relative to LDA:
| LDA | GGA | |
|---|---|---|
| fxJxW size per component | Q | Q·dim |
scaleStridedVarBatched m |
1 | dim |
d_fxJxWxNBlock per cell |
d×Q | d×(Q·dim) |
gemmStridedVarBatched k |
Q | Q·dim |
| GEMM B | basisData | gradBasis |
| Output per (k, cell) | d×d | d×d (same) |
Both A and B have identical column ordering q*dim + alpha, so the double contraction over
(q, alpha) is correct with a single GEMM per (k, cell block). No dim loop needed.
Source: component k, cell c — plain col-major d×d at k*basisOverlapSize + cellOffset_c.
Target: col-major (S·d)×(S·d) per cell, ldb = S·d. Two formulas by spin mode:
Collinear — spin-blocked H (k → diagonal block (k, k)):
Pure lda-change; src and dst both contiguous within each column copy.
For each cell c, column j = 0..d-1:
src: cellWiseFEData[k*basisOverlapSize + cellSrcOffset + j*d] (d contiguous)
dst: cellWiseStorage[cellDstOffset + (k*d+j)*(S*d) + k*d] (d contiguous)
length: d, lda_src = d, lda_dst = S*d
Non-collinear — DOF-outer H (k → block (s_in=k/S, s_out=k%S)):
Element scatter; stride S in dst (50% GPU write coalescing for S=2).
For each cell c, column j = 0..d-1, row i = 0..d-1:
src: cellWiseFEData[k*basisOverlapSize + cellSrcOffset + j*d + i] (sequential)
dst: cellWiseStorage[cellDstOffset + (j*S+s_out)*(S*d) + i*S+s_in] (stride S)
Source reads are sequential for both modes. Destination write coalescing:
| Spin mode | Dst stride within column | GPU coalescing |
|---|---|---|
| Collinear | stride 1 (d consecutive) | fully coalesced |
| Non-collinear | stride S=2 | 50% coalesced |
Copy cost is negligible relative to the GEMM (ratio ≈ Q = 1728× cheaper).
Off-diagonal blocks for collinear are covered by zero-init of cellWiseStorage on entry.
void getLocal(Storage &cellWiseStorage) const
{
const size_type K = d_xcPotentialQuadMemspace->getNumberComponents();
const size_type S = d_S;
cellWiseStorage.resize(hSize, ValueType(0)); // zero-init covers collinear off-diag blocks
// single call: basis fetched once per cell block, K GEMMs inside
// output layout: [K][cell][d²]
d_feBasisOp->computeFEMatrices(
LinearLocalOp::IDENTITY, VectorMathOp::MULT,
VectorMathOp::MULT, LinearLocalOp::IDENTITY,
*d_xcPotentialQuadMemspace, // K-component QVC
d_xcCellWiseTemp, // Storage member, K × Σdofs²
*d_linAlgOpContext);
// K block-copies: [K][cell][d²] → [cell][(S·d)²]
for (size_type k = 0; k < K; ++k)
{
if (K == S) {
// collinear: component k → diagonal block (k,k); spin-blocked H; lda-change copy
copyIntoBlockSpinBlocked(d_xcCellWiseTemp, k,
cellWiseStorage, /*s=*/k, S,
d_numCellDofs, *d_linAlgOpContext);
} else {
// non-collinear: component k → block (s_in=k/S, s_out=k%S); DOF-outer H; element scatter
const size_type s_in = k / S;
const size_type s_out = k % S;
copyIntoBlockDOFOuter(d_xcCellWiseTemp, k,
cellWiseStorage, s_in, s_out, S,
d_numCellDofs, *d_linAlgOpContext);
}
}
}Unpolarized (K=S=1): the K==S branch fires with s=0. copyIntoBlockSpinBlocked places the
single d×d block at diagonal (0,0) — no actual copy since H cell IS the d×d buffer (S=1).
Spin logic lives only in getLocal(). computeFEMatrices is spin-agnostic throughout.
In §4a notation: c = cell index (0..C-1), k = XC component index (0..K-1),
i, j = cell-local DOF indices (0..d-1, where d = d_c).
d_xcCellWiseTemp: [component (K)] [all cells] [d²]
total size: K × basisOverlapSize elements
basisOverlapSize = Σ_{c=0}^{C-1} d_c²
cellOffset_c = Σ_{c'=0}^{c-1} d_{c'}² (cumulative dofs² before cell c)
Per component k, per cell c: col-major d_c × d_c block
element (row i, col j):
flat = k·basisOverlapSize + cellOffset_c + j·d_c + i
i,j ∈ 0..d_c-1
This is the raw output of computeFEMatrices — K independent col-major d×d matrices
stacked component-first, then cell-interleaved within each component's slice.
cellWiseStorage: [all cells] [(S·d)²]
total size: C × (S·d)² elements (d uniform for clarity)
cellDstOffset_c = c·(S·d)² (each cell occupies exactly (S·d)² elements)
Per cell c: col-major (S·d) × (S·d) block
element (row, col):
flat = cellDstOffset_c + col·(S·d) + row
row, col ∈ 0..S·d-1
Pure lda-change: src columns are length-d; dst columns are length-S·d but only d entries written (the spin-k sub-column). No element reordering. Full GPU write coalescing.
For cell c, column j ∈ 0..d-1: (copy d contiguous elements)
src start: d_xcCellWiseTemp + k·basisOverlapSize + cellOffset_c + j·d_c
(element (row=0, col=j) of component k, cell c)
dst start: cellWiseStorage + cellDstOffset_c + (k·d+j)·(S·d) + k·d
(row k·d, col k·d+j of the (S·d)×(S·d) block — top-left of diagonal block k)
copy length: d (rows i=0..d-1 of column j, block-k row offset = k·d)
src lda = d, dst lda = S·d
Off-diagonal blocks (row spin ≠ col spin) remain zero from cellWiseStorage zero-init.
Element scatter: each source element maps to a non-adjacent destination (stride S in dst). 50% GPU write coalescing for S=2.
For cell c, column j ∈ 0..d-1, row i ∈ 0..d-1: (one scalar write per element)
src: d_xcCellWiseTemp + k·basisOverlapSize + cellOffset_c + j·d_c + i
(element (row i, col j) of component k, cell c)
dst: cellWiseStorage + cellDstOffset_c + (j·S + s_out)·(S·d) + i·S + s_in
(col = j·S+s_out, row = i·S+s_in of the (S·d)×(S·d) DOF-outer block)
dst stride between consecutive i: S (50% coalescing)
S=1 → (S·d)² = d²; cellWiseStorage has the same shape as the d_xcCellWiseTemp k=0 slice.
copyIntoBlockSpinBlocked places the d×d block at (0,0) with lda_dst = 1·d = d —
src and dst are byte-for-byte identical; the copy is a no-op or a plain memcpy.
| Mode | K |
d_xcCellWiseTemp shape |
Copy kernel |
cellWiseStorage inner layout |
Write coalescing |
|---|---|---|---|---|---|
| Unpolarized | 1 | [1][cell][d²] |
SpinBlocked(k=0, s=0) |
col-major d×d (S=1) |
100% |
| Collinear | 2 | [2][cell][d²] |
SpinBlocked(k→diag k) |
col-major (S·d)² spin-blocked |
100% |
| Non-collinear | 4 | [4][cell][d²] |
DOFOuter(k→(k/S, k%S)) |
col-major (S·d)² DOF-outer |
50% |
File modified:
src/basis/FECellWiseDataOperations.h / .cpp / DeviceKernels
copyFieldToCellWiseData and addCellWiseDataToFieldData are overloaded on the
multivector type. The type encodes gather strategy at compile time — no runtime
branching, no extra spin parameters, no spinAwareCellLocalIds array.
// ── Collinear (MultiVectorProductSpaceBlocked) ──────────────────────────────
// Spatial-flat gather: numComponents=S*N, spatial node indices, totalCellDofs=Σd_k
// Output BLAS: S*N × d_k per cell
// Spin-s block: ptr = xCell + s*N, lda = S*N (valid BLAS strided submatrix, no copy)
static void copyFieldToCellWiseData(
const MultiVectorProductSpaceBlocked<VT, MS>& X,
const size_type* cellLocalIdsStartPtr,
const size_type totalCellDofs, // Σd_k
VT* cellWiseStorage);
// ── Unpolarized (S=1) + Non-collinear (S=2) (MultiVectorProductSpace) ───────
// Spin-aware gather: numComponents=N, virtual DOF index i*S+s computed inline
// cellLocalIdsStartPtr: spatial indices unchanged; totalCellDofs=Σd_k (not S*Σd_k)
// Output BLAS: N × S*d_k per cell
// S=1 (unpolarized): virtualId=spatialId*1+0=spatialId — degenerates to existing path
// No spinAwareCellLocalIds array — virtualId = spatialId * S + s computed per thread
static void copyFieldToCellWiseData(
const MultiVectorProductSpace<VT, MS>& X,
const size_type* cellLocalIdsStartPtr,
const size_type totalCellDofs,
VT* cellWiseStorage);Overload resolution is automatic: MultiVectorProductSpaceBlocked IS-A MultiVectorProductSpace,
so the compiler picks the most-derived match — collinear gets the Blocked overload, unpolarized
and non-collinear get the ProductSpace overload. No MultiVector base overload needed — all
callers hold a typed product-space object.
Symmetric addCellWiseDataToFieldData overloads follow the same two signatures.
Raw pointer overloads removed from the ksdft-facing public API. Every ksdft caller
(in computeAxCellWise / computeAxCellWiseOptimized) passes a typed product-space object.
Exception — FEBasisOperations::interpolate: interpolate takes const MultiVector<VT,MS>& X
(base type, spin-agnostic). It calls copyFieldToCellWiseData with X.data() and
numComponents = X.getNumberComponents() = S*N — a purely spatial gather treating all S*N
components as independent. For this case the existing raw pointer overload is retained as an
internal utility (not the ksdft hot path). interpolate itself is unchanged.
| Overload | numComponents |
DOF loop iterations | Address formula |
|---|---|---|---|
MultiVectorProductSpaceBlocked |
S*N |
Σd_k |
data[spatialId * S*N + comp] |
MultiVectorProductSpace (S=1 or S=2) |
N |
S * Σd_k (inner loop over s) |
data[(spatialId*S + s) * N + n] |
The non-collinear kernel loops s = 0..S-1 and writes rows s*d_k...(s+1)*d_k - 1
of the output. S = X.numSpaces() is extracted from the multivector object — no
separate parameter.
computeAxCellWise is a private method of KohnShamOperatorContextFE. It reads
d_spinMode directly — no spin parameter at the call site, no named forks.
FECellWiseDataOperations provides only the gather/scatter primitives.
Collinear is the only special case — it exploits block-diagonal H to avoid wasted FLOPs.
Unpolarized and non-collinear share the same else-branch: spin-aware gather with S=1
degenerates exactly to the existing spatial gather (virtualId = spatialId*1+0 = spatialId,
H_cell is d_k × d_k, GEMM is identical to the existing unpolarized path).
// KohnShamOperatorContextFE::apply
computeAxCellWise(X, Y); // no spin params — d_spinMode is a member
// KohnShamOperatorContextFE::computeAxCellWise (private)
void computeAxCellWise(const MultiVector<VT,MS>& X, MultiVector<VT,MS>& Y)
{
if (d_spinMode == SpinMode::Collinear) {
// spatial-flat gather: numComponents=S*N → xCell BLAS S*N × d_k
// ONE gemmStridedVarBatched(S*numCells): exploit block-diagonal H, no wasted FLOPs
}
else {
// spin-aware gather: numComponents=N, virtualId=spatialId*S+s → xCell BLAS N × S*d_k
// ONE gemmStridedVarBatched(numCells): full S*d_k × S*d_k H
// S=1 unpolarized: degenerates identically to existing path
// S=2 non-collinear: full off-diagonal H blocks
}
}else-branch (Unpolarized S=1 and NonCollinear S=2) — ONE gemmStridedVarBatched(numCells):
gemmStridedVarBatched(batchCount = numCells,
m[k] = N, n[k] = S*d_k, k[k] = S*d_k,
A[k] = xCell_k, lda[k] = N,
B[k] = H_cell_k, ldb[k] = S*d_k,
C[k] = yCell_k, ldc[k] = N,
alpha=1, beta=0)
S=1: S*d_k = d_k, H_cell is d_k × d_k — identical to existing unpolarized GEMM.
S=2: S*d_k = 2*d_k, full 2*d_k × 2*d_k H — FLOPs: 8·N·d_k²
Collinear branch — ONE gemmStridedVarBatched(S×numCells):
gemmStridedVarBatched(batchCount = S * numCells,
m[b] = N, n[b] = d_k, k[b] = d_k,
A[b] = xCell_flat_k + s*N, lda[b] = S*N,
B[b] = H_cell_k + s*d_k*(S*d_k+1), ldb[b] = S*d_k,
C[b] = yCell_flat_k + s*N, ldc[b] = S*N,
alpha=1, beta=0)
FLOPs per cell k: 4·N·d_k² (S=2) — diagonal H blocks only, no wasted FLOPs
computeAxCellWiseOptimized is a separate private method of KohnShamOperatorContextFE
with its own if/else on d_spinMode — it does not call computeAxCellWise. Both
routes share the same dispatch pattern but differ structurally: the optimized route
interleaves non-local PSP steps between the gather and the H_cell GEMM, so the gather
cannot be owned by the GEMM step.
The spatial-flat gather (step 1) is always done first — non-local PSP requires it regardless of spin mode. The H_cell GEMM (step 5) then has its own if/else.
1. spatialFlatGather(X.data(), S*N, cellLocalIds, totalCellDofs, xCellValues_flat)
→ xCellValues_flat BLAS S*N × d_k
[always spatial-flat: C is spatial, S*N cols treated independently]
2. applyCconjtransOnX(cellRange, xCellValues_flat)
→ accumulates d_CX (S*N × numProj) [no change]
[loop over cell blocks complete]
3. applyAllReduceOnCconjtransX() [no change]
4. applyVOnCconjtransX() [no change for non-SOC]
[second cell loop]
5. H_cell GEMM — own if/else on d_spinMode:
if (d_spinMode == SpinMode::Collinear):
reuse xCellValues_flat from step 1 (lda=S*N offset per spin block)
ONE gemmStridedVarBatched(S×numCells); m=N, n=d_k, k=d_k
else (Unpolarized / NonCollinear):
NonCollinear: additional spinAwareGather → xCellValues_spin BLAS N × S*d_k
Unpolarized: S=1 → spatial-flat = spin-aware; reuse xCellValues_flat directly
ONE gemmStridedVarBatched(numCells); m=N, n=S*d_k, k=S*d_k
6. applyCOnVCconjtransX(cellRange, yCellValues_flat)
→ adds C*V*d_CX to yCellValues_flat [no change]
7. spatialFlatScatter(yCellValues_flat, S*N, cellLocalIds, totalCellDofs, Y.data())
[no change to kernel]
cellWiseGEMM always uses mSizes = numVecs. The projector C_{p,j} is purely
spatial — no spin index. Setting numVecs = S*N makes it treat all S*N spin-orbital
columns as independent vectors contracted against the same C. Spin channels never
couple through C or the diagonal V:
d_CX[p, s*N+n] = Σ_j C_{p,j} * X[DOF_j, s, n] // correct for all (s,n) simultaneously
V * d_CX → scalar multiply per projector // same V for all s (non-SOC)
Y[DOF_j, s, n] += Σ_p C_{p,j} * V_p * d_CX[p, s*N+n] // correct scatter
Only change in computeAxCellWiseOptimized: call reinitCX(S*N) instead of
reinitCX(numVecs) (where numVecs was previously N).
xCellValues_flat: cellBlockSize × S*N × d_max // same as current xCellValues with numVecs=S*N
yCellValues_flat: cellBlockSize × S*N × d_max // same as current yCellValues with numVecs=S*NNo additional buffers. The two existing scratch buffers are sufficient — the H_cell
GEMM writes its result directly into yCellValues_flat via strided submatrix access.
Only applyVOnCconjtransX needs modification. d_CX is restructured as
S × N × numProj and V becomes a S × S matrix per projector:
d_CX_out[s, n, p] = Σ_{s'} V_{s, s', p} * d_CX[s', n, p]
Everything else (applyCconjtransOnX, applyCOnVCconjtransX, gather/scatter) is
unchanged even for SOC.
Basis functions φ_j(r) carry no spin index — interpolate is a purely spatial operation.
copyFieldToCellWiseData is called with numComponents = X.getNumberComponents() = S*N,
treating all S*N orbital columns as independent. No dynamic_cast, no spin knowledge.
QuadratureValuesContainer allocated with Nflattened = S*N components at call site.
Works identically for unpolarized, collinear, and non-collinear — zero changes.
No new public overload. QuadratureValuesContainer already carries K components; passing a
K-component V_xc QVC produces K dofs×dofs output matrices in one call. Each
Hamiltonian component's getLocal() block-copies those K results into (S·dofs)².
Implementation change: K>1 throw removed; output resized to K × basisOverlapSize;
K loop added inside cell block loop (Phase 11).
DensityCalculator directly outputs ρ_total and M (§0.5 convention) via a single
extended function. No separate ρ↑/ρ↓ stage. One unified kernel handles all spin modes.
ncomp = number of density components = 1 (unpolarized), 2 (collinear), 4 (non-collinear).
N = numOrbitals (numVectorsPerSpace), S = numSpaces.
psiBatchQuad layout: [S*N spin-orbital columns][numQuadInBlock quad points]
(i.e. spin-orbital index is the batch index; existing layout, unchanged).
rhoBatch layout: [numQuadInBlock quad points][ncomp] ← new: ncomp fastest.
Replace the current single-output computeRhoInBatch with one that outputs ncomp
values per quad point:
template <typename ValueType, typename RealType, utils::MemorySpace memorySpace>
class DensityCalculatorKernels
{
public:
static void
computeRhoInBatch(
size_type batchSize, // N (orbitals per spin in this batch)
size_type numSpaces, // S
std::pair<size_type,size_type> cellRange,
const RealType* occupationInBatch, // S·batchSize values: occupationInBatch[s·batchSize + n] (unpolarized: batchSize values, s=0 only)
ValueType* psiBatchQuad, // [S*N][numQuadInBlock]
RealType* scratchBatchQuad, // [S*N][numQuadInBlock], existing modPsiSq scratch
std::shared_ptr<const QuadratureRuleContainer> quadRuleContainer,
RealType* rhoBatch, // [numQuadInBlock][ncomp]
SpinMode spinMode,
LinAlgOpContext<memorySpace>& linAlgOpContext);
};Per quad point q, per orbital n in batch — kernel accumulation:
Unpolarized (S=1, ncomp=1):
ψ = psiBatchQuad[n][q]
rhoBatch[q][0] += f[n] · |ψ|²
Collinear (S=2, ncomp=2):
ψ₀ = psiBatchQuad[n ][q] ← spin-0 of orbital n
ψ₁ = psiBatchQuad[N + n][q] ← spin-1 of orbital n
f₀ = occupation[n], f₁ = occupation[N + n] (different per spin)
rhoBatch[q][0] += f₀·|ψ₀|² + f₁·|ψ₁|² ← ρ_total
rhoBatch[q][1] += f₀·|ψ₀|² − f₁·|ψ₁|² ← Mz
Non-collinear (S=2, ncomp=4):
ψ₀ = psiBatchQuad[n ][q], ψ₁ = psiBatchQuad[N + n][q]
f = occupation[n] (f[n] == f[N+n])
rhoBatch[q][0] += f·(|ψ₀|² + |ψ₁|²) ← ρ_total
rhoBatch[q][1] += f·2·Re[conj(ψ₀)·ψ₁] ← Mx
rhoBatch[q][2] += f·(−2)·Im[conj(ψ₀)·ψ₁] ← My
rhoBatch[q][3] += f·(|ψ₀|² − |ψ₁|²) ← Mz
scratchBatchQuad (existing modPsiSqBatchQuad) is reused for intermediate |ψ|²
and cross-term intermediates — no additional scratch pointer.
occupation for collinear has size S*N; kernel reads occupation[n] and occupation[N+n].
occupation for non-collinear has size S*N but f[n] == f[N+n]; kernel reads occupation[n].
Unpolarized: occupation size N; kernel reads occupation[n] (existing behaviour).
DEVICE partial specialisation follows the same interface — one specialisation covers all modes.
Signature change (single function, no overload):
// Before:
void computeRho(const std::vector<RealType>& occupation,
const MultiVector<...>& waveFunc,
QVC& rho); // 1 component per quad point
// After:
void computeRho(const std::vector<RealType>& occupation, // size S*N
const MultiVector<...>& waveFunc, // S*N components
QVC& rho, // ncomp components per quad point
SpinMode spinMode);Output rho has ncomp components per quad point — flat index q*ncomp + comp.
The caller (RDM1FE) pre-allocates rho with ncomp components.
Private scratch members that change:
// Before:
QVC* d_rhoMemspace; // 1 component per quad
MemoryStorage<RealType, memorySpace>* d_rhoBatch; // flat per-batch accumulator
// After:
QVC* d_rhoMemspace; // ncomp components per quad (resized on demand)
MemoryStorage<RealType, memorySpace>* d_rhoBatch; // [numQuadInBlock × ncomp] per batchd_rhoMemspace and d_rhoBatch are resized to ncomp × numQuadPoints on first call
with a given SpinMode. d_modPsiSqBatchQuad (existing scratch) is reused for the
cross-term intermediates in non-collinear — size S*N × numQuadInBlock covers both.
The outer cell-block loop is unchanged. The psi-batch loop changes for spin cases:
-
Unpolarized (S=1): unchanged — loop steps 0..N in increments of
waveFuncBatchSize; onestridedBlockCopyper batch;psiBatchInterimhaswaveFuncBatchSizecolumns. -
Collinear / Non-collinear (S=2): the batch unit is N orbitals per spin (=
waveFuncBatchSize / S). Per iteration,psiBatchInterim(S·N columns) is populated by twostridedBlockCopycalls — one for spin-0 columns[psiStartId, psiEndId)intopsiBatchInterimcolumns0..N-1, and one for spin-1 columns[N+psiStartId, N+psiEndId)intopsiBatchInterimcolumnsN..2N-1. The loop then runs frompsiStartId = 0toN(notS*N) so each iteration processes the same N orbital indices for both spins simultaneously.computeRhoInBatchis called once per psi-batch withbatchSize = NandnumSpaces = S.
The axpy accumulation into d_rhoMemspace operates on ncomp × numQuadInBlock elements instead of numQuadInBlock — no further structural change.
RDM1FE currently allocates densVal as AttrStorage(ncomp, QVC(quadRuleContainer, 1, 0)) —
a vector of ncomp QVCs each with 1 component per quad point.
After calling computeRho (which returns a single ncomp-component QVC), copy each
component into densVal[k]:
QVC rhoCombined(quadRuleContainer, ncomp, 0.0);
d_densCalc->computeRho(occ, *this->d_ksOrbs, rhoCombined, spinMode);
for (size_type k = 0; k < ncomp; ++k)
// copy component k of rhoCombined into densVal[k] (1-component QVC)
std::copy(rhoCombined.begin() + k,
rhoCombined.end(), /* stride ncomp */
densVal[k].begin());This keeps the existing densVal vector-of-QVCs interface that ExchangeCorrelationFE
already consumes — no change downstream.
GGA: same extension — computeRho also fills gradient components when requested,
outputting a 3·ncomp-component QVC (3 Cartesian × ncomp density observables).
| File | Change |
|---|---|
DensityCalculatorKernels.h |
computeRhoInBatch: add numSpaces, SpinMode params; rhoBatch becomes [numQuadInBlock][ncomp]
|
DensityCalculator.h |
computeRho: QVC& rho → QVC& rho with ncomp + SpinMode; d_rhoMemspace and d_rhoBatch resized to ncomp |
DensityCalculator.t.cpp |
Single psi-batch loop unchanged; computeRhoInBatch called with SpinMode; axpy on ncomp×numQuad |
RDM1FE.t.cpp |
getDescriptors: pass spinMode; split ncomp-QVC output into densVal vector |
ChebyshevFilteredEigenSolver is not modified. It calls ChebyshevFilter then
RayleighRitzEigenSolver::solve(MultiVector&) — the existing signature. Because the
static type at that call site is MultiVector&, compile-time overload resolution cannot
select derived-type overloads. Dispatch must therefore happen at runtime via
dynamic_cast inside RayleighRitzEigenSolver::solve.
KohnShamEigenSolver::solve(H, BInv, X, eigenValues, ...)
→ ChebyshevFilteredEigenSolver::solve(H, BInv, X, ...) // unchanged, X seen as MultiVector&
→ ChebyshevFilter(H, BInv, X, ...) // unchanged, Nflattened cols
→ RayleighRitzEigenSolver::solve(H, eigenValues, X, ...)
// same MultiVector& signature; dynamic_cast inside selects path:
// MultiVectorProductSpaceBlocked* → blocked/collinear ELPA loop
// MultiVectorProductSpace* → coupled (unpolarized S=1 + non-collinear S=2)
The existing solve(const OpContext&, std::vector<RealType>&, MultiVector<VT,MS>&, bool)
signature is kept unchanged. No private helper methods (d_solveBlocked,
d_solveCoupled) are needed — after the dynamic_cast, the correctly-typed reference
is passed directly to MultiVectorOps static functions, which pick the right
specialisation at compile time. The branches are inline blocks inside solve:
EigenSolverError solve(const OpContext& A, std::vector<RealType>& eigenValues,
MultiVector<VT,MS>& X, bool computeEigenVectors)
{
// Check most-derived first (Blocked ⊂ ProductSpace ⊂ MultiVector)
if (auto* Xb = dynamic_cast<MultiVectorProductSpaceBlocked<VT,MS>*>(&X)) {
// compiler picks Blocked specialisation — ELPA loop over s=0..S-1
std::vector<ScaLAPACKMatrix<VT>> Ps(Xb->numSpaces(), ...);
MultiVectorOps::project(A, *Xb, Ps, *d_elpaScala,
*d_XinBatch, *d_XoutBatch, d_scratchSBlock,
d_eigenVecBatchSize);
for (size_type s = 0; s < Xb->numSpaces(); ++s) {
elpa_eigenvectors(..., Ps[s]..., eigenValues_s...);
// fill eigenValues[s*N + n]
}
if (computeEigenVectors)
MultiVectorOps::rotate(*Xb, Qs, *d_elpaScala,
d_scratchRotBlock, d_scratchRotMatBlock);
return success;
}
// Blocked check above covers collinear; all remaining callers are ProductSpace
// (unpolarized S=1 degenerates correctly — single ELPA, N×N projected matrix)
auto* Xp = dynamic_cast<MultiVectorProductSpace<VT,MS>*>(&X);
// (assert Xp != nullptr — no plain MultiVector callers exist in the new design)
ScaLAPACKMatrix<VT> P(...);
MultiVectorOps::project(A, *Xp, P, *d_elpaScala,
*d_XinBatch, *d_XoutBatch, d_scratchSBlock,
d_eigenVecBatchSize);
elpa_eigenvectors(..., P..., eigenValues...);
if (computeEigenVectors)
MultiVectorOps::rotate(*Xp, Q, *d_elpaScala,
d_scratchRotBlock, d_scratchRotMatBlock);
return success;
}dynamic_cast cost is negligible — called once per SCF RayleighRitz step.
No copy for spin-s block access (collinear): Inside MultiVectorOps::project
(Blocked specialisation), spin-s columns are at positions [s*N, (s+1)*N).
computeXTransOpX receives a raw pointer with explicit lda:
ValueType* Xs_ptr = Xb.data() + s * Xb.numVectorsPerSpace();
size_type lda = Xb.numVectors(); // = S*N, true row stride — NOT N
// computeXTransOpX(Xs_ptr, M, N, lda, ...) — no copy into contiguous tempproject, rotate, and orthonormalize are not written from scratch. They already exist
as private / internal functions, split across three files:
| Existing function | Lives in | What it does |
|---|---|---|
computeXTransOpX |
RayleighRitzEigenSolver.t.cpp |
project: X^H Op X, batched |
fillParallelOverlapMatrix |
ElpaScalapackOperations.t.cpp |
project overlap (no Op) |
subspaceRotation |
ElpaScalapackOperations.t.cpp |
rotate: X ← X·Q |
| orthonormalization body | OrthonormalizationFunctions.t.cpp |
orthonormalize |
Move these into MultiVectorOps as the canonical location. RayleighRitzEigenSolver,
OrthonormalizationFunctions, and the new blocked overloads all call the same
MultiVectorOps implementations — no logic duplication.
The one required modification to subspaceRotation and computeXTransOpX:
add an explicit lda parameter (defaulting to N). Currently both hardcode lda = N
(e.g., X + idof * N in subspaceRotation; numVec as the GEMM lda in
computeXTransOpX). With an explicit lda, the blocked overload passes lda = S*N
and accesses only the spin-s columns — no copy, same kernel.
Files to add:
src/linearAlgebra/MultiVectorOps.h
src/linearAlgebra/MultiVectorOps.t.cpp
Files modified (existing logic moved, not rewritten):
src/linearAlgebra/ElpaScalapackOperations.t.cpp
← subspaceRotation: add lda parameter (default N); body unchanged
← fillParallelOverlapMatrix: add lda parameter (default N); body unchanged
src/linearAlgebra/RayleighRitzEigenSolver.t.cpp
← computeXTransOpX: add lda parameter (default numVec); body unchanged
← solve(): replace internal calls with MultiVectorOps::project / rotate
src/linearAlgebra/OrthonormalizationFunctions.t.cpp
← body moved to MultiVectorOps::orthonormalize; thin wrapper remains
MultiVectorOps is a stateless class of static functions. All scratch is owned
by the caller (RayleighRitzEigenSolver already manages scratch via MultivectorScratch)
and passed in explicitly — no internal allocation, consistent with the blasLapack
pattern. Dispatch is compile-time via template specialisation on the multivector type.
dynamic_cast is NOT inside MultiVectorOps — it is done once in
RayleighRitzEigenSolver::solve to unlock the type; after that, the correctly-typed
reference is passed to the static functions.
class MultiVectorOps {
public:
// ---------- project ----------
// Blocked specialisation (collinear): S independent N×N projected matrices.
// Loop s=0..S-1: Xs_ptr = X.data()+s*N, lda=S*N → computeXTransOpX, no copy.
template <typename VTO, typename VTOp, utils::MemorySpace MS>
static void project(
const OperatorContext<VTO,VTOp,MS>& Op,
MultiVectorProductSpaceBlocked<VTOp,MS>& X,
std::vector<ScaLAPACKMatrix<scalar_type<VTO,VTOp>>>& Ps, // S N×N matrices
const ElpaScalapackManager& elpa,
MultiVector<VTOp,MS>& scratchXin,
MultiVector<VTOp,MS>& scratchXout,
MemoryStorage<scalar_type<VTO,VTOp>,MS>& scratchSBlock,
size_type batchSize = 0);
// Coupled specialisation (non-collinear / unpolarized): single S*N×S*N matrix.
// computeXTransOpX(X.data(), M, S*N, lda=S*N, Op, P) — no copy.
template <typename VTO, typename VTOp, utils::MemorySpace MS>
static void project(
const OperatorContext<VTO,VTOp,MS>& Op,
MultiVectorProductSpace<VTOp,MS>& X,
ScaLAPACKMatrix<scalar_type<VTO,VTOp>>& P, // single S*N×S*N matrix
const ElpaScalapackManager& elpa,
MultiVector<VTOp,MS>& scratchXin,
MultiVector<VTOp,MS>& scratchXout,
MemoryStorage<scalar_type<VTO,VTOp>,MS>& scratchSBlock,
size_type batchSize = 0);
// ---------- rotate ----------
// Blocked: loop s, subspaceRotation(X.data()+s*N, M, N, lda=S*N, Qs[s]).
// Scratch temp is M×N only, reused across s — S× smaller than coupled.
template <typename VT, utils::MemorySpace MS>
static void rotate(
MultiVectorProductSpaceBlocked<VT,MS>& X,
const std::vector<ScaLAPACKMatrix<VT>>& Qs, // S N×N matrices
const ElpaScalapackManager& elpa,
MemoryStorage<VT,MS>& scratchRotBlock, // M×N temp
MemoryStorage<VT,MS>& scratchRotMatBlock);
// Coupled: subspaceRotation(X.data(), M, S*N, lda=S*N, Q).
// One M×S*N scratch temp (unavoidable — in-place X=X*Q always needs output buffer).
template <typename VT, utils::MemorySpace MS>
static void rotate(
MultiVectorProductSpace<VT,MS>& X,
const ScaLAPACKMatrix<VT>& Q,
const ElpaScalapackManager& elpa,
MemoryStorage<VT,MS>& scratchRotBlock, // M×S*N temp
MemoryStorage<VT,MS>& scratchRotMatBlock);
// ---------- orthonormalize ----------
// Two specialisations (Blocked / ProductSpace). No base MultiVector overload —
// all callers hold a typed product-space object.
// Scratch: overlap ScaLAPACKMatrix + Cholesky temp — all passed in by caller.
template <typename VT, utils::MemorySpace MS>
static void orthonormalize(
MultiVectorProductSpaceBlocked<VT,MS>& X,
const ElpaScalapackManager& elpa,
/* scratch: */ ScaLAPACKMatrix<VT>& scratchOverlap,
MemoryStorage<VT,MS>& scratchSBlock);
template <typename VT, utils::MemorySpace MS>
static void orthonormalize(
MultiVectorProductSpace<VT,MS>& X,
const ElpaScalapackManager& elpa,
ScaLAPACKMatrix<VT>& scratchOverlap,
MemoryStorage<VT,MS>& scratchSBlock);
};Call flow from RayleighRitzEigenSolver:
// One conditional cast (Blocked); ProductSpace cast is unconditional — no plain MultiVector callers.
if (auto* Xb = dynamic_cast<MultiVectorProductSpaceBlocked<VT,MS>*>(&X)) {
// compiler picks Blocked specialisation at compile time
MultiVectorOps::project(Op, *Xb, Ps, elpa,
d_scratchXin, d_scratchXout, d_scratchSBlock);
MultiVectorOps::rotate (*Xb, Qs, elpa,
d_scratchRotBlock, d_scratchRotMatBlock);
return success;
}
// All remaining callers are MultiVectorProductSpace (unpolarized S=1 or non-collinear S=2)
auto* Xp = static_cast<MultiVectorProductSpace<VT,MS>*>(&X);
MultiVectorOps::project(Op, *Xp, P, elpa,
d_scratchXin, d_scratchXout, d_scratchSBlock);
MultiVectorOps::rotate (*Xp, Q, elpa,
d_scratchRotBlock, d_scratchRotMatBlock);RayleighRitzEigenSolver owns all scratch (as it already does via MultivectorScratch)
and passes it through. MultiVectorOps is entirely stateless — no allocations.
Every place where spin-block data is accessed is classified below.
Goal: no intermediate contiguous copy of a spin-s block; pointer arithmetic + explicit lda throughout.
| Location | Copy needed? | Reason / mechanism |
|---|---|---|
MultiVectorOps::project blocked |
No | Static specialisation; blocked loop uses Xs_ptr=data+s*N, lda=S*N — valid BLAS strided submatrix; scratch passed by caller |
MultiVectorOps::project coupled (ProductSpace) |
No | Full M×S*N buffer; S=1 degenerates to N×N; scratch passed by caller |
MultiVectorOps::rotate coupled (ProductSpace) |
One M×S*N temp (unavoidable) | In-place X=X*Q always needs output buffer; passed as scratchRotBlock
|
MultiVectorOps::rotate blocked |
One M×N temp, reused across s | S× smaller than coupled; same scratchRotBlock reused each s iteration |
| Orthonormalization (both specialisations) | No | Blocked: Xs_ptr + lda trick; all scratch passed by caller |
ChebyshevFilter scratch |
No |
eigenVectorBatchSize=S*N → batch IS the full spin-coupled multivector; stridedBlockCopy copies all S*N columns in one shot; H applied via d_spinMode switch in KohnShamOperatorContextFE::apply — no spin-block extraction, no type inspection of batch buffer |
computeAxCellWise collinear branch |
No |
gemmStridedVarBatched reads X_s via ptr+s*N, lda=S*N — pointer arithmetic within the cell buffer |
computeAxCellWise else-branch (unpolarized/non-collinear) |
No | Single GEMM on full N×S*d_k cell buffer; S=1 degenerates to existing |
Cell gather (vectorToCell) collinear |
Necessary layout transform | M×S×N → S*N×d_k (BLAS); spatial-flat, copyFieldToCellWiseData(Blocked)
|
Cell gather (vectorToCell) non-collinear + unpolarized |
Necessary layout transform | M×S×N → N×Sd_k (BLAS); spin-aware, copyFieldToCellWiseData(ProductSpace), virtualId=spatialIdS+s inline; S=1 degenerates to existing spatial gather |
Cell scatter (cellToVector) |
Necessary layout transform | Mirrors gather; typed addCellWiseDataToFieldData overload |
getLocal() spin-independent (KineticFE etc.) |
No intermediate |
dofs×dofs result placed into diagonal block via strided block-copy: dst_ptr = H_cell + s*dofs*(S*dofs+1), ldb=S*dofs
|
getLocal() XC non-collinear |
No intermediate | DOF-outer element scatter (stride S): element (i,j) of source block → H_cell + cellDstOffset + (j·S+s_out)·(S·dofs) + i·S+s_in; top-left of block (s_in,s_out) at H_cell + s_out·(S·dofs) + s_in (no dofs factor — DOF-outer, not spin-blocked) |
distributeParentToChild/Child |
No | Takes MultiVector&; blockSize=S*N; flat buffer — no spin-block extraction |
interpolate |
No | Works on full Nflattened=SN columns; each DOF's SN values contiguous |
LanczosExtremeEigenSolver |
No | Initial guess is MultiVectorProductSpace(S, N=1); Lanczos applies H to S-column multivector without spin-block extraction |
Rule encoded in MultiVectorOps blocked overloads:
// Spin-s block: pointer into X without copy.
// lda = X.numVectors() (= S*N), NOT X.numVectorsPerSpace() (= N).
const VT* Xs = X.data() + s * X.numVectorsPerSpace();
size_type lda = X.numVectors();Any future use of spin-s subblock data should follow this pattern.
KohnShamEigenSolver constructs the right multivector type based on SpinMode and
constructs ChebyshevFilteredEigenSolver with eigenVectorBatchSize = S*N for spin
cases. The type of X drives all downstream dispatch automatically.
// Unpolarized (S=1):
const size_type S = 1;
MultiVectorProductSpace<VT, MS> X(mpiPattern, ctx, /*numSpaces=*/S, N);
// batchSize = N for unpolarized — existing behaviour unchanged
auto chebyshevSolver = ChebyshevFilteredEigenSolver(..., /*eigenVectorBatchSize=*/N, ...);
chebyshevSolver.solve(H, BInv, X, eigenValues, ...); // → RayleighRitz coupled overload
// NonCollinear (S=2):
const size_type S = 2;
MultiVectorProductSpace<VT, MS> X(mpiPattern, ctx, /*numSpaces=*/S, N);
// batchSize = S*N: full spin-coupled vector processed at once by ChebyshevFilter
auto chebyshevSolver = ChebyshevFilteredEigenSolver(..., /*eigenVectorBatchSize=*/S*N, ...);
chebyshevSolver.solve(H, BInv, X, eigenValues, ...); // → RayleighRitz coupled overload
// Collinear (S=2):
const size_type S = 2;
MultiVectorProductSpaceBlocked<VT, MS> X(mpiPattern, ctx, /*numSpaces=*/S, N);
// batchSize = S*N: same reason — operator must see both spin channels to select H_ss block
auto chebyshevSolver = ChebyshevFilteredEigenSolver(..., /*eigenVectorBatchSize=*/S*N, ...);
chebyshevSolver.solve(H, BInv, X, eigenValues, ...); // → RayleighRitz blocked overload (ELPA loop)ChebyshevFilter calls A.apply(X_batch, Y_batch) where X_batch is the batch buffer
(d_XinBatch), a plain MultiVector allocated with eigenVectorBatchSize columns.
stridedBlockCopy itself is mechanically correct — it copies the right columns.
The problem is downstream:
-
Non-collinear: H has off-diagonal spin blocks coupling spin-0 and spin-1 columns.
A batch with fewer than
S*Ncolumns is missing one spin channel → filtered vector is wrong. -
Collinear: H is block-diagonal.
d_XinBatchis a plainMultiVector— butKohnShamOperatorContextFE::applydispatches viad_spinMode, not by inspecting X. WitheigenVectorBatchSize = S*Nthe full spin-coupled vector is always passed, so the collinear path correctly applies both H_00 and H_11 blocks.
Setting eigenVectorBatchSize = S*N makes the "batch" exactly the full spin-coupled
multivector — stridedBlockCopy copies all S*N columns in one shot (trivially correct),
and A.apply receives a vector where the spin structure is preserved.
ChebyshevFilteredEigenSolver is not modified. Only KohnShamEigenSolver changes
the eigenVectorBatchSize argument passed at construction time.
No spin logic, no explicit ELPA calls, no MultiVectorOps calls in KohnShamEigenSolver
— everything is driven by the multivector type.
| Aspect | Unpolarized | Collinear | NonCollinear |
|---|---|---|---|
| MultiVector type |
MultiVectorProductSpace (S=1) |
MultiVectorProductSpaceBlocked (S=2) |
MultiVectorProductSpace (S=2) |
| H buffer shape | numCells × dofs² |
numCells × (S·dofs)² (off-diag zero) |
numCells × (S·dofs)² (full) |
computeFEMatrices calls per component |
1 (K=1 QVC) | 1 (K=2 QVC) | 1 (K=4 QVC) |
| Cell gather format (BLAS) |
N × d_k (spin-aware S=1 degenerate) |
S*N × d_k (spatial-flat) |
N × S*d_k (spin-aware, virtual DOF i*S+s) |
| Cell GEMM path | else-branch of computeAxCellWise (numCells batched) |
if-branch of computeAxCellWise (S×numCells batched) |
else-branch of computeAxCellWise (numCells batched) |
| FLOPs per cell (S=2) |
N·dofs² (S=1 degenerate) |
2·N·dofs² (diag blocks only) |
4·N·dofs² (full H) |
Dispatch in apply
|
d_spinMode if/else (else-branch) |
d_spinMode if/else (if-branch) |
d_spinMode if/else (else-branch) |
| Dispatch in RayleighRitz |
dynamic_cast on eigenVectors
|
dynamic_cast on eigenVectors
|
dynamic_cast on eigenVectors
|
interpolate |
spin-agnostic, numComponents=S*N | spin-agnostic, numComponents=S*N | spin-agnostic, numComponents=S*N |
| ELPA calls per SCF | 1 | S (loop) | 1 |
src/linearAlgebra/
MultiVectorProductSpace.h
MultiVectorProductSpace.t.cpp
MultiVectorProductSpaceBlocked.h
MultiVectorProductSpaceBlocked.t.cpp
MultiVectorOps.h
MultiVectorOps.t.cpp
No MultiVectorOpsKernels needed — MultiVectorOps is a thin static wrapper over
existing functions (computeXTransOpX, subspaceRotation, existing blasLapack
routines) with only the lda parameter addition. New kernels are in
FECellWiseDataOperations (gather/scatter) and Hamiltonian components (getLocal
block-copy), not in linearAlgebra/.
src/basis/FEBasisOperations.h / .t.cpp
← computeFEMatrices: unchanged; QVC numComponents=K already encodes spin blocks;
each Hamiltonian component's getLocal() passes K-component QVC and block-copies
the K dofs×dofs outputs into (S·dofs)² (no new overload needed)
src/basis/FECellWiseDataOperations.h / .cpp / DeviceKernels
← raw pointer overloads of copyFieldToCellWiseData / addCellWiseDataToFieldData REMOVED
← two typed overloads added as clean public API:
MultiVectorProductSpaceBlocked (spatial-flat, numComponents=S*N),
MultiVectorProductSpace (spin-aware, numComponents=N, virtualId=spatialId*S+s inline;
S=1 unpolarized degenerates to existing spatial gather)
← no base MultiVector overload — all callers hold a typed product-space object
← gather/scatter primitives only (typed public overloads + internal spatialFlatGather /
spinAwareGather helpers); no computeAxCellWise* functions here
src/ksdft/KohnShamOperatorContextFE.h / .t.cpp
← SpinMode enum; stores d_spinMode, d_S, d_N at construction
← unified numCells×(S·dofs)² H buffer
← reinit: zero-fills H buffer then axpy-accumulates fully-assembled numCells×(S·dofs)²
buffers returned by each component's getLocal() (block-copy lives inside getLocal())
← apply calls single private computeAxCellWise(X, Y); one if/else on d_spinMode:
if Collinear → spatial-flat gather + S×numCells GEMM (diag H blocks)
else → spin-aware gather + numCells GEMM (full H; S=1 degenerates to existing)
← no named spin forks, no isBlockDiagonal parameter anywhere
src/ksdft/KohnShamEigenSolver.h / .t.cpp
← constructs typed product-space multivector per SpinMode:
unpolarized → MultiVectorProductSpace(S=1) → RayleighRitz ProductSpace overload (single ELPA, degenerates to N×N)
collinear → MultiVectorProductSpaceBlocked(S=2) → RayleighRitz Blocked overload (ELPA loop)
non-collinear → MultiVectorProductSpace(S=2) → RayleighRitz ProductSpace overload (single ELPA, S*N×S*N)
← sets eigenVectorBatchSize=S*N for spin cases (=N for unpolarized)
src/ksdft/KohnShamDFT.h / .t.cpp
← SpinMode propagation
src/ksdft/KineticFE.h / .t.cpp
← constructed with SpinMode; getLocal() returns numCells×(S·dofs)²
← computes dofs×dofs once (cached), copies into diagonal blocks; off-diagonal zero
src/ksdft/ElectrostaticLocalFE.h / .t.cpp
src/ksdft/ElectrostaticExcFE.h / .t.cpp
← same pattern as KineticFE (spin-independent; single compute, diagonal copy)
src/ksdft/ExchangeCorrelationFE.h / .t.cpp
← constructed with SpinMode; getLocal() returns numCells×(S·dofs)²
← Collinear: S diagonal blocks from per-spin XC potentials
← NonCollinear: all S² blocks from full spin-matrix XC potential
← per-space density inputs to XC functional
src/ksdft/DensityCalculatorKernels.h
← computeRhoInBatch: add numSpaces (S) + SpinMode params; rhoBatch becomes [numQuadInBlock][ncomp];
kernel accumulates ρ_total/Mz (collinear) or ρ_total/Mx/My/Mz (non-collinear) directly;
DEVICE partial specialisation follows same interface
src/ksdft/DensityCalculator.h / .t.cpp
← computeRho: add SpinMode param; QVC& rho gains ncomp components per quad;
d_rhoMemspace and d_rhoBatch resized to ncomp×numQuad;
psi-batch loop: two stridedBlockCopy calls for spin cases (§6.3)
src/ksdft/RDM1FE.h / .t.cpp
← getDescriptors: allocate rhoCombined QVC with ncomp components; call computeRho once;
stride-ncomp copy of component k into densVal[k] to preserve vector-of-1-QVC interface
src/linearAlgebra/MultiVector.* (base class, not modified)
src/linearAlgebra/ChebyshevFilteredEigenSolver.* (unchanged; caller controls batchSize)
src/linearAlgebra/ChebyshevFilter.*
src/basis/FEBasisOperations::interpolate
src/basis/EFEConstraintsLocalDealii.* (see §11.1 below)
src/basis/CFEConstraintsLocalDealii.*
src/basis/ConstraintsInternal.*
EFEConstraintsLocalDealii::distributeParentToChild and distributeChildToParent both have
the signature:
void distributeParentToChild(
linearAlgebra::MultiVector<ValueTypeBasisCoeff, memorySpace>& vectorData,
size_type blockSize) const;Since MultiVectorProductSpace IS-A MultiVector (public inheritance), the product-space
multivectors bind to this parameter directly — no overload or template specialisation needed.
The only change is at the call site (inside KohnShamOperatorContextFE or wherever
constraints are applied): pass blockSize = Nflattened = S * numVectorsPerSpace instead of the scalar numVectorsPerSpace.
Why this is correct:
In the M × S × N memory layout, all S*N values for DOF i sit contiguously at byte
offset i * S * N * sizeof(ValueType). The ConstraintsInternal kernel processes each
constrained DOF by copying blockSize contiguous values from the master DOF, weighted by
the constraint coefficient. With blockSize = S*N it applies the same geometric constraint
coefficient to every component at that DOF — which is physically correct because FE
hanging-node / periodicity constraints are purely spatial.
Constrained DOF i → master DOFs j₀, j₁, … with coefficients c₀, c₁, …
For each master DOF jₖ:
for sn in 0..S*N-1: // blockSize = S*N iterations, handled inside kernel
data[i*S*N + sn] += cₖ * data[jₖ*S*N + sn]
This degenerates to the existing scalar case when S=1 (same blockSize = numVectorsPerSpace).
Summary: no changes to EFEConstraintsLocalDealii, CFEConstraintsLocalDealii, or
ConstraintsInternal. Only the caller changes blockSize from N to S*N.
src/linearAlgebra/LanczosExtremeEigenSolver.h / .t.cpp
← d_initialGuess: Vector → MultiVector
← constructor/reinit parameter: const Vector& → const MultiVector&
← existing Vector callers unaffected (Vector IS-A MultiVector, implicit upcast)
← spin callers pass MultiVectorProductSpace(mpiPattern, ctx, S, /*numVectorsPerSpace=*/1) as initial guess
src/linearAlgebra/RayleighRitzEigenSolver.h / .t.cpp
← existing solve(MultiVector&) extended: one if dynamic_cast to Blocked + unconditional
ProductSpace cast (no plain MultiVector path); no new public overloads, no private helpers
← Blocked branch: MultiVectorOps::project/rotate Blocked overloads (ELPA loop)
← ProductSpace path: MultiVectorOps::project/rotate ProductSpace overloads (single ELPA;
S=1 unpolarized degenerates to N×N projected matrix, identical to old scalar path)
← computeXTransOpX: add lda parameter (default = numVec); body unchanged
src/linearAlgebra/ElpaScalapackOperations.h / .t.cpp
← subspaceRotation: add lda parameter (default = N); body unchanged — lda replaces hardcoded N in X+idof*N
← fillParallelOverlapMatrix: add lda parameter (default = N); body unchanged
src/linearAlgebra/OrthonormalizationFunctions.h / .t.cpp
← body of orthonormalization moved to MultiVectorOps::orthonormalize (two overloads)
← thin wrapper remains calling MultiVectorOps::orthonormalize with lda = X.numVectors()
← no base MultiVector overload in MultiVectorOps::orthonormalize
| Phase | Work |
|---|---|
| 1 |
MultiVectorProductSpace — derives from MultiVector; constructor, numSpaces(), numVectorsPerSpace(). |
| 2 |
MultiVectorProductSpaceBlocked — derives from MultiVectorProductSpace; inherits constructor via using. |
| 3 |
MultiVectorOps: static class wrapping computeXTransOpX / subspaceRotation with lda param; two specialisations per function (Blocked / ProductSpace). |
| 4 |
MultiVectorOps::project — coupled overload (single SN×SN) and blocked overload (S independent N×N). |
| 5 |
MultiVectorOps::rotate — coupled overload (same Q) and blocked overload (S independent Q_s). |
| 6 |
FECellWiseDataOperations: remove raw pointer overloads; add two typed overloads of copyFieldToCellWiseData / addCellWiseDataToFieldData (MultiVectorProductSpaceBlocked spatial-flat; MultiVectorProductSpace spin-aware with inline virtual DOF spatialId*S+s, S=1 degenerates to unpolarized). No base MultiVector overload. |
| 7 |
KohnShamOperatorContextFE — SpinMode enum; store d_spinMode, d_S, d_N at construction; unified numCells×(S·dofs)² H buffer; reinit: zero H buffer + axpy over each component's getLocal() (block-copy into (S·dofs)² lives inside each component's getLocal()); private computeAxCellWise with one if/else on d_spinMode (collinear: spatial-flat gather + S×numCells GEMM; else: spin-aware gather + numCells GEMM, S=1 degenerates to existing); private computeAxCellWiseOptimized with own if/else on d_spinMode. |
| 8 |
RayleighRitzEigenSolver — extend existing solve(MultiVector&) with two inline dynamic_cast branches; Blocked → MultiVectorOps::project/rotate blocked overloads (ELPA loop); ProductSpace → coupled overloads (single ELPA). |
| 9 |
LanczosExtremeEigenSolver — change d_initialGuess and constructor/reinit param from Vector to MultiVector. |
| 10 |
DensityCalculatorKernels.h: add numSpaces (S) and SpinMode params to computeRhoInBatch; rhoBatch layout becomes [numQuadInBlock][ncomp]; kernel accumulates ρ_total/Mz (collinear) or ρ_total/Mx/My/Mz (non-collinear) directly — no ρ↑/ρ↓ stage. DEVICE partial specialisation follows same interface. DensityCalculator.h/.t.cpp: computeRho gains SpinMode param and QVC& rho gains ncomp components; d_rhoMemspace and d_rhoBatch resized to ncomp × numQuad; axpy accumulation operates on ncomp × numQuadInBlock elements; cell-block/psi-batch loop structure unchanged. RDM1FE.t.cpp: getDescriptors allocates QVC rhoCombined(quadRuleContainer, ncomp, 0), calls computeRho once, then copies component k into densVal[k] (stride-ncomp copy) to preserve the existing vector-of-1-component-QVCs interface consumed by ExchangeCorrelationFE. |
| 11 |
computeFEMatrices K>1 extension: remove throw; resize output to K × basisOverlapSize; add K loop inside cell block loop; add buildAllComponentFxJxW device kernel. GGA: m=dim in scaleStridedVarBatched, k=Q*dim in GEMM, use gradBasis. |
| 12 |
copyIntoBlockSpinBlocked kernel: collinear lda-change copy (src lda=d → dst lda=S*d), diagonal block (k,k) at (k*d+j)*(S*d)+k*d. copyIntoBlockDOFOuter kernel: non-collinear element scatter at stride S, block (k/S, k%S) at (j*S+s_out)*(S*d)+i*S+s_in. |
| 13 |
ExchangeCorrelationFE::getLocal(): add d_xcCellWiseTemp storage member (K × Σdofs²); single computeFEMatrices call + K copyIntoBlock calls; dispatch on K==S (collinear) vs K==S*S (non-collinear). |
-
Uniform
numOrbitalsper space. Both classes assume the sameNfor every space index. DifferentN_sper block would require a partitioned layout. -
ElpaScalapackManagersharing. AllMultiVectorOps::projectblocked-overload calls produceN×Nmatrices of equal size; they share oneProcessGrid. -
Orthogonalisation for product spaces. The Cholesky-GS overlap matrix (B overlap) also needs product-space-aware projection.
MultiVectorOps::project(blocked and ProductSpace overloads) handle B as well as H — just pass B as the operator. -
Runtime dispatch in
apply— resolved.d_spinMode(enum stored inKohnShamOperatorContextFE) is the sole dispatch mechanism. NoisBlockDiagonalflag. No compile-time template branching —computeAxCellWiseis a single private method with one if/else ond_spinMode. -
H block-copy helper in
getLocal(). Each Hamiltonian component'sgetLocal()callscomputeFEMatricesto producedofs×dofsblocks and places them into the(S·dofs)×(S·dofs)per-cell buffer via a block-copy or element-scatter kernel:- Spin-independent (KineticFE etc.) and collinear XC: lda-change strided copy to
diagonal block
(k·dofs, k·dofs)(spin-blocked offsets). - Non-collinear XC: element scatter with stride S to DOF-outer block
(top-left at
(s_out·(S·dofs)+s_in), no dofs factor — see §4b).getLocal()returns the fully assemblednumCells×(S·dofs)²buffer.KohnShamOperatorContextFE::reinitonly accumulates these buffers withaxpy.
- Spin-independent (KineticFE etc.) and collinear XC: lda-change strided copy to
diagonal block