Skip to content

Sync master with molresponse-feature-next (v3 development trunk)#7

Merged
ahurta92 merged 188 commits into
masterfrom
molresponse-feature-next
Jun 18, 2026
Merged

Sync master with molresponse-feature-next (v3 development trunk)#7
ahurta92 merged 188 commits into
masterfrom
molresponse-feature-next

Conversation

@ahurta92

Copy link
Copy Markdown
Owner

Bring master up to molresponse-feature-next, the de-facto v3 development
trunk. master was 168 commits behind (its only unique commits were
content-free merge bubbles), so this is a clean catch-up — no divergent work is lost.

Release-sync step of a branch-organization cleanup that also split the former
grab-bag madqc-driver-improvements into one-concern-per-thread branches off
this trunk: exchange, parallel-runtime, madqc-refactor, fix/scf-do-plots,
feat/htg-writer.

🤖 Generated with Claude Code

evaleev and others added 30 commits April 30, 2026 08:17
…at faces/edges/vertices are considered "in"/neighbors of the respective boxes
…pecial-point-at-box-boundaries

amended handling of special points at box boundaries
…rving CLI paths

Several related improvements landed together this session.

ES solver (Stage 1: TDA RHF):
- Add ESSolver.hpp with the bundle iteration loop (orthonormalize,
  Lambda build, subspace diagonalize, rotate, BSH per root, step
  restriction). Includes a 5-iter guess-refinement pre-phase using
  2N trial states, mirroring molresponse_legacy::iterate_guess.
- Add ESSolverGuess.hpp with random-Gaussian-envelope initial guess
  (mirrors molresponse_legacy::create_random_guess).
- Add ResponseKernel.hpp helpers: apply_kinetic, compute_lambda_subspace
  (full Lambda with kinetic and full Fock - distinct from the existing
  Theta-equivalent compute_lambda used for the BSH driver),
  orthonormalize_bundle, rotate_bundle, rotate_vector_bundle,
  bundle_inner, build_es_theta (with shift compensation).
- Add test_es_solver.cpp validation harness for H2 TDA-RHF that
  compares first-N omega against the legacy reference table.
- Add docs/11_excited_state_solver_design.md staged plan
  (TDA RHF -> Full RHF -> TDA UHF -> Full UHF). Stage 1 runs end-to-end
  but eigenvalues are not yet correct; debugging is in progress.

Test driver (test_v3_solver):
- Protocol ramping: walk [1e-4 -> 1e-6 -> ...] up to the archive's
  target thresh, warm-starting each step from the previous.
  ResponseProtocol.hpp adds set_response_protocol,
  default_thresh_for_k / default_k_for_thresh, build_protocol_ramp,
  parse_protocol_csv.
- Spherical-atom shortcut: single-atom systems solve only z and
  propagate to alpha_xx = alpha_yy = alpha_zz.
- Update Li (UHF) reference to ETB-converged HF basis-set limit
  (170.121 / 186.510 / 262.710), with a comment pointing at
  ~/Projects/li_polarizability_study/. Prior 170.020 reference was
  unconverged (Phys Rev A 91, 022501 - 1-D FD code).
- Tighten H2 fixture protocol [1e-4] -> [1e-4, 1e-6].
- Tighten Li fixture l=20 -> 200 and protocol -> [1e-4, 1e-6, 1e-8].
- Make GroundState::read_archive_header public so the test driver
  can inspect the archive header before set_response_protocol.

Case-preserving CLI paths:
- commandlineparser gets a parallel keyval_raw map and value_raw(key)
  accessor. Use for paths/free-form strings; existing value(key)
  keeps lowercasing for chemistry-style enum knobs (--xc=HF, --localize=BOYS).
- Bare-positional input form (`app /Mixed/Path/in.in`) also preserves
  case via the existing if (key=="file"||"input") branch.
- test_solver / test_es_solver use parser.value_raw("archive") and
  parser.value_raw("molecule") for path arguments.
- GroundState.cpp adds a workaround for QCCalculationParametersBase's
  tostring<std::string> which lowercases values: re-set the prefix
  parameter directly via the QCParameter struct so SCF::load_mos gets
  the right case. Phase 2 (param-aware case-folding inside
  QCCalculationParametersBase) is deferred.

Legacy molresponse build patches:
- ground_parameters.h: pre-set FunctionDefaults<3>::set_cubic_cell
  from the archive's L right after reading L, before orbital
  reconstructs. Newer MADNESS Function::load() rejects cell mismatch,
  which broke molresponse_legacy startup.
- src/madness/chem/QCCalculationParametersBase.h: symlink to
  ../mra/QCCalculationParametersBase.h. Legacy includes
  <chem/QCCalculationParametersBase.h>; the file is in mra/ now, so
  the symlink lets the legacy compile without modifying legacy source.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The v1 production molresponse had been bit-rotting: it didn't compile, and
once compiled, the excited-state TDA path crashed in three different places
before producing eigenvalues. This commit gets v1 building and converging on
the H2/4-root TDA fixture to within 5e-5 of the legacy reference values
(0.46802, 0.47785, 0.48134, 0.48134), with restart from a saved checkpoint
also working.

Buildability:
- Re-enable molresponse subdir (src/apps/CMakeLists.txt).
- Rename Exchange<double,3>::Algorithm -> ExchangeAlgorithm (60 sites in 4
  files) to match upstream.
- Add ResponseParameters::get_tag() override.
- Drop a stale omega arg from FrequencyResponse construction in molresponse.cc.

Excited-state TDA correctness:
- ground_parameters.h: set_cubic_cell(-L,L) before reading orbitals from
  archive (Function::load was rejecting cell mismatch).
- ResponseBase::solve: call check_k (which builds `hamiltonian`) before
  initialize(), since the guess-iteration's response-potential build needs
  it; ordering bug that left the tensor empty during iterate_trial.
- response_space(World,m,n) ctor: fill slots with zero functions rather than
  null-impl Functions.
- response_space::push_back / pop_back: keep `active` in sync with
  num_states. transform() and similar call push_back repeatedly; without
  this, to_vector/from_vector silently produced zero-length payloads and
  the eigenvalues stalled at the wrong place.
- ExcitedResponse::initialize: rebuild Chi from scratch after select_functions
  so n_states matches the trimmed x/y.
- ExcitedResponse ctor: set inner-product / J1 / K1 / VXC strategies (was
  missing entirely; FrequencyResponse does this in its ctor).
- ResponseBase::update_residual: handle empty old_residuals tensor on first
  iteration; force canonical active list to bypass any remaining stale-active
  paths.

Validation:
- src/apps/molresponse_v3/tests/fixtures/systems/h2_hf/H2_EXCITED_REPORT.md
  documents the run matrix (cold/restart for both v1 and legacy, plus
  legacy RPA), the fixes summarized above, and the v1 RPA failure mode
  (heap-corruption SIGSEGV inside MacroTaskExchangeSimple K[1] in iter 0 —
  next task).

Also includes earlier in-progress molresponse_v3 ES solver work: switch
default guess to solid harmonics × orbital (mirrors legacy
create_trial_functions), and the kernel/guess refactor that converges H2
toward the legacy reference but still has a residual gap.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
J1StrategyFull and J1StrategyStable in ResponseBase.hpp allocated
`vector_real_function_3d temp_J(3)` — a buffer sized for the 3 Cartesian
directions of FrequencyResponse's dipole case. Excited-state runs with
num_states != 3 wrote past the end and corrupted the heap; the resulting
SIGSEGV manifested in whichever allocator next touched the broken arena
(MacroTaskExchangeSimple's K[1] tile under multiworld, J[1]'s shared_ptr
release under largemem).

After the fix, v1 RPA converges on H2/4-root and matches legacy RPA to
≤ 5e-7 on the lowest three roots, with ω₄ within 3e-6 (both runs hit
maxiter before the strict bsh-residual target — the values are correct,
the convergence test is just stricter than the achieved residual).

Updates the H2_EXCITED_REPORT.md with the v1 RPA result row, the agreement
analysis, and the temp_J(3) root-cause writeup.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…4/C6H6

Adds a benchmark suite under tests/fixtures/benchmarks/ to measure the
rotate-vs-recompute gap between v1 (rotates gamma through subspace
diagonalizations) and legacy (recomputes per cycle). Each system has a
SLURM script that:

  1. Runs moldft once to converge the ground state.
  2. Runs four ES configs in sequence: legacy-TDA, legacy-RPA, v1-TDA,
     v1-RPA — all sharing the same archive.
  3. Captures wall time, iteration count, and final omegas into
     timing.csv with a stable schema.

Per-system tuning (from CLAUDE.md memory budget data):
  - h2o   (n_occ=5):  1 node × 8 tasks/node × 4 cpus/task,  2 hr
  - c2h4  (n_occ=8):  2 nodes × 8 tasks/node × 4 cpus/task, 6 hr
  - c6h6  (n_occ=21): 4 nodes × 4 tasks/node × 8 cpus/task, 24 hr
    (4 tasks/node — at 8 it OOMs on Xeon Max per RUNLOG)

C2H4 fixture is new (geometry, moldft.in, system.json) following the same
shape as h2o_hf and c6h6_hf. H2O and C6H6 fixtures already existed.

Submission edits the SLURM template needs (partition, account, optional
module loads, optional LAUNCHER if `srun` doesn't bind right) are
documented in benchmarks/README.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch from generic Slurm defaults to the user's actual Seawulf setup:
account=ahurtado, partition=hbm-short-96core, source ~/load_xeonmax.sh
for module setup, MAD_NUM_THREADS hardcoded per-system to leave headroom
on 96-core nodes. Drop --cpus-per-task and OMP_NUM_THREADS — explicit
MAD_NUM_THREADS is what madness reads.

Default launcher is now `mpirun --map-by numa numactl --preferred-many=8-15`
(bind by NUMA, prefer the HBM tier on Xeon Max), with `LAUNCHER=srun
sbatch benchmark.slurm` as the override path.

Per-system layout:
  h2o:  4 nodes × 8 tasks × 10 threads, 4 hr,  hbm-short-96core
  c2h4: 4 nodes × 8 tasks × 10 threads, 8 hr,  hbm-short-96core
  c6h6: 8 nodes × 4 tasks × 20 threads, 24 hr, NEEDS long partition
        (4 tasks/node — 8 OOMs per RUNLOG; comment flags this in script)

C6H6's 24 hr time exceeds hbm-short-96core's limit; the script has a
NOTE flagging this — switch --partition before submitting.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…rks to correct walltime and repository paths
After cleaning ~2250 in-source cmake artifacts (CMakeFiles/, .o, generated
headers, etc.) that an accidental in-source `cmake .` had scattered, add
.gitignore patterns so future regenerations don't reappear in `git status`:

- fixture run outputs (mad.*, moldft.std*, *.log, MacroTaskExchangeRow_task.*)
- test executables built into the source tree by mistake
- cmake-generated headers (config.h, tensor_spec.h, run_test_*_mpi2.cmake[.in])

Also tracks the H2 excited-state TDA input file used by the validation in
H2_EXCITED_REPORT.md, so anyone reproducing the run has the canonical input.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Multi-rank legacy runs crashed at tensoriter.h:185 ("first and second
tensors do not conform") immediately after `initialize` on H2O. Single-rank
runs were unaffected because the bug only fires through the
`if (world.size() > 1)` branch in molresponse.cc that calls
`set_protocol(1e-4)` before `set_protocol(protocol[0])`.

Inside set_protocol:
  FunctionDefaults<3>::set_k(6);     # for the 1e-4 first call
  ...
  rho0 = make_ground_density(world, ground_orbitals);

ground_orbitals are still at the archive's k=8 from the moldft run.
make_ground_density does `square(orbitals)` (returns k=8 trees) and
gaxpys them into a fresh `factoryT(world)` density at the new k=6 →
mismatched trees → tensor non-conform.

Fix: project ground_orbitals to FunctionDefaults<NDIM>::get_k() before
make_ground_density when the orbital k differs from the current default.
This mirrors what check_k() already does later in solve_excited_states;
we just need it earlier when set_protocol is called from the load-balance
prep block.

Also adds benchmarks/submit_all.sh — a small driver that submits all
three (h2o, c2h4, c6h6) benchmark.slurm scripts and writes jobs.txt
with the resulting job ids. README updated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- distpm.cc: silence -Wvla / -Wvla-{,cxx-}extension around `double tmp[m]`
- test_sepop.cc: add -Wunknown-warning-option so clang ignores the
  surrounding -Wmaybe-uninitialized pragma cleanly
- test_systolic.cc: silence C++20 -Wvolatile (gcc) / -Wdeprecated-volatile
  (clang) on `++niter` (niter is `volatile int`)
Adds MADNESS_WERROR (default OFF). When ON, an INTERFACE library
`madness_internal_warnings` carries -Werror and is linked PRIVATE-ly to
every MAD*-obj object library and to every executable built via
add_mad_executable. PRIVATE scope is load-bearing: -Werror does NOT
propagate to INTERFACE_COMPILE_OPTIONS of the installed MAD* targets, so
downstream consumers of find_package(madness) (TiledArray, MPQC, ...) are
unaffected.

Verified by inspecting the generated madness-targets.cmake: zero Werror
occurrences, madness_internal_warnings absent from every
INTERFACE_LINK_LIBRARIES.

CI: -DMADNESS_WERROR=ON added to BUILD_CONFIG for all matrix cells.

Also fixes the only remaining source-level warnings on master that would
have blocked -Werror across the CI matrix: ignored posix_memalign returns
in test_mtxmq.cc / test_Zmtxmq.cc (8 -Wunused-result warnings on gcc).
The earlier suppressions in distpm.cc / test_sepop.cc / test_systolic.cc
in this branch cover the rest.

Scope of -Werror is intentionally limited to MAD*-obj and add_mad_executable
outputs, which keeps it off bundled FetchContent code (parsec, nlohmann_json,
cereal) and off pymadness (pybind11-generated code).

Phases 2/3 (adding -Wall, -Wextra) are deferred.
Adds a method on LoadBalanceDeux<NDIM> that walks the aggregator tree
and reports per-rank cost under any pmap as a horizontal bar chart
(rank | bar | cost | x-avg ratio). Single helper that gives the right
answer both before and after load_balance() runs: pre-partition it sums
per-key cost by would-be owner under the supplied pmap; post-partition
it sums each surviving subtree-root's rolled-up cost under the new pmap.
Both totals match.

Useful for diagnosing whether load_balance(fac) actually helps: call
before with the old pmap and after with the returned pmap; if max/avg
drops, the partitioner had material to cut.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…metric

Ports the legacy TDDFT::gram_schmidt(World&, response_space&,
response_space&) from molresponse_legacy/TDDFT.cc:2061-2089. Operates
in-place on the (X, Y) pair using the RPA pseudo-norm

    norm = <f|f> - <g|g>

and projects out using the same indefinite inner product, with the
same scaling and projection coefficients applied jointly to X and Y.

This is the function the previously commented Gram-Schmidt block in
ExcitedResponse::update_response was trying to call (the commented
2-arg form did not exist in v1, which is one of the reasons that
block had been disabled). Required precursor for restoring the
top-of-iter Gram-Schmidt discipline in the RPA path.

NOTE: mirrors legacy in not guarding against non-positive norms;
non-physical RPA states with negative pseudo-norm produce NaN/inf
downstream. Revisit if observed in practice.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…fix RPA gaxpy direction

Two changes in one function:

1. Restructure to match legacy TDDFT::x_space_step_restriction at
   molresponse_legacy/TDDFT.cc:1921-1949. Per-state loop, X and Y blocks
   bounded independently with the same scalar bound. Honors the explicit
   restrict_y argument (legacy's discipline) rather than testing
   r_params.omega() != 0, which is unrelated to whether the call site
   actually wants both blocks restricted.

2. Fix a sign/direction bug in the prior RPA path. Standard step
   restriction wants

       result = s * temp + (1 - s) * old        (mostly old, small step)

   with s = maxrot / ||temp - old|| < 1. The prior RPA branch passed the
   gaxpy_oop arguments in the opposite order, producing mostly-new
   instead of mostly-old, which effectively negated the restriction
   (or worse, pushed the step the wrong direction). The TDA branch was
   already correct; the unified path now uses the correct ordering for
   both X and Y.

Dropping the old code's active-list iteration (for ai in old_Chi.active)
in favor of all-state iteration matches legacy and avoids skipping
converged states where the step would already be negligible anyway.
The work is cheap (a norm and possibly a gaxpy) so this has no
measurable cost.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…table; reformat

Stabilizes ExcitedResponse::iterate by replicating the per-iter
ground-state-cleanup discipline that ExcitedResponse::iterate_trial
already uses. Pre-fix, the v1 main loop drifted to the omega =
-epsilon(core) ghost eigenvector within 2-3 iters on H2O TDA because
the trial basis was allowed to lose orthogonality to the occupied
ground space; KAIN then locked in the contaminated direction.

Substantive changes (in ExcitedResponse::update_response):

1. Top-of-iter Q-projection on Chi. Mirrors iterate_trial:549-551.
   Strips ground-orbital character that accumulated during the previous
   iter's rotate_excited_space + BSH + KAIN. Both X and Y blocks
   for RPA.

2. Top-of-iter Gram-Schmidt + normalize, twice (the "twice is enough"
   idiom). Mirrors iterate_trial:658-670. Replaces a commented-out
   block at the same location which had three latent bugs: used Chi.X
   (field is .x), called a 2-arg gram_schmidt(x, y) that did not exist
   (added in the preceding commit), and discarded the gram_schmidt
   return value. TDA uses the 1-arg form on Chi.x; RPA uses the new
   2-arg form on (Chi.x, Chi.y) under the indefinite metric.

3. Enable step restriction. Flips `if (false)` to `if (iter > 0)` at
   the existing call site, matching legacy's gate at
   molresponse_legacy/TDDFT.cc:1720. The function body was rewritten
   in the preceding commit; this commit wires it back in.

Additional change in ExcitedResponse::create_bsh_operators:

4. Tabular print of BSH exponents mu_{k,p}. Replaces the inline
   per-(state, orbital) print loop that produced m*n lines per call
   (and the function is called many times per outer iter). Now prints
   one table per call, rank-0 only.

Also in ExcitedResponse::initialize: Q-projection now runs BEFORE the
load-balance step, and the LB block uses the new
LoadBalanceDeux::print_cost_per_rank diagnostic to report cost
distribution before and after redistribute(). This was an in-session
experiment that surfaced the fact that load_balance benefits from
diverse refinement patterns across registered functions.

Note: this commit also includes a clang-format pass (4-space to
2-space indent across the whole file) that was applied by the editor
on save during the substantive edits. CLI tooling for clang-format
isn't installed on this host so the reformat could not be cleanly
extracted into its own commit; the diff is bulky as a result. Use
`git diff -w` for review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…gamma_dispatch

Adds ResponseBase::compute_gamma_dispatch, a single helper that
encapsulates the per-(state, orbital) macrotask invocation that was
previously inlined in compute_theta_X. It builds the flat index
vectors, instantiates ResponseComputeGammaX + MacroTask, converts
X_space to/from vector, and dispatches "full" / "static" through the
macrotask while falling back to the serial compute_gamma_tda for TDA
(which needs the J[rho1] term the per-orbital macrotask does not
compute).

Call-site changes:

- compute_theta_X (ResponseBase.cpp:600-644): collapses ~45 lines of
  inline boilerplate into one line. Behaviour unchanged - this is the
  same code, just relocated.

- compute_lambda_X (ResponseBase.cpp:1038-1050) and
  compute_response_potentials (ResponseBase.cpp:1159-1171): both
  previously dispatched serial compute_gamma_full /
  compute_gamma_static / compute_gamma_tda. Now route the "full" and
  "static" branches through the macrotask, gaining the
  state-parallel speedup. TDA still goes serial through
  compute_gamma_tda.

The serial compute_gamma_full / compute_gamma_static implementations
are left in place so we can A/B-compare results. Once the macrotask
path is trusted, they can be removed in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ResponseBase::solve() ran check_k() before load(), so when restarting
the check_k call saw Chi empty and skipped its Chi projection path.
load() then populated Chi at the saved k from the archive, after which
iterate()'s first operator apply hit a tensor-dimension mismatch
(tensoriter.h:185 "first and second tensors do not conform").

Adds a second call to the free `::check_k(world, Chi, ...)` immediately
after load() to project the just-loaded Chi to the current protocol's
k. Symmetric in both directions:

- Downscale: saved at fine thresh (e.g. 1e-6, k=8), restart at coarser
  thresh (1e-4, k=6). Useful for fast-iteration testing.
- Upscale: saved at coarse thresh, restart at finer. Useful when
  resuming a multi-protocol run that was interrupted.

The ground-orbital side was already handled by the prior check_k()
call (which projects ground_orbitals, hamiltonian, mask before the
restart load happens). Subsequent protocols in the same run were also
already handled - the top-of-iter check_k() call sees a non-empty Chi
and projects it via to_vector / from_vector. Only the first-protocol
restart path needed this fix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ity rebuild

set_protocol() rebuilds ground_density at the new protocol's k right
after FunctionDefaults<3>::set_k(k). But ground_orbitals still carry
the moldft-saved k, so make_ground_density operated on mixed-k
functions and crashed with a tensor-conform assertion when the
moldft k differed from the response protocol k (common case: moldft
saved at thresh=1e-6/k=8, response running at thresh=1e-4/k=6 for
fast iteration testing).

Reprojects ground_orbitals to the current k before make_ground_density
when a mismatch is detected, and invalidates the stored hamiltonian
so the subsequent check_k() recomputes it against the reprojected
orbitals (otherwise check_k would see ground orbitals already at the
right k and skip the rebuild path, leaving a stale hamiltonian from
the previous k).

This is the first half of "downscale everything" on restart with a
lower protocol. The Chi-side projection (added in the previous
commit) handles the response vectors after they're loaded.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er analysis

Print gating:
- bsh_update_excited "omega before shifts" + omega tensor dump - the
  worst offender, ~450 lines per typical run, was ungated on rank or
  print_level. Now gated on rank 0 + print_level >= 2.
- create_bsh_operators tabular mu dump - the table I added recently
  fires per call (many per iter); gated to print_level >= 2.
- update_response "Entering Compute Lambda" - moved to print_level >= 2.
- update_response "omega_n before/after transform" pair - moved to
  print_level >= 2.
- iterate() per-iter diagnostics block: split into headline metrics
  (kept at print_level >= 1: thresh, k, Chi_X norms, bsh_residuals,
  relative_bsh, max rotation, residual maxes and targets) and verbose
  per-(state, orbital) tensors (raised to >= 2: xij norms, xij
  residual norms).

End-of-iteration analysis:
- Re-enables the previously-commented analysis block at the end of
  iterate(). Mirrors legacy iterate_excited.cc:281-297: calls
  analysis(world, Chi) (per-state ground-orbital decomposition +
  transition dipoles) followed by analyze_vectors(world, Chi.x[i],
  ...) per state, and same for Chi.y[i] when not TDA. Separators
  between states gated on rank 0 + print_level >= 1.

Fixes the field-name casing inherited from legacy: the commented v1
code used Chi.X[i] (uppercase) but X_space in v1 uses .x lowercase -
this is the same bit-rot pattern that affected the earlier
Gram-Schmidt block.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…_ptr in future.h

C++20 deprecated std::atomic_exchange(std::shared_ptr<T>*, ...) in
favor of std::atomic<std::shared_ptr<T>>. The proper fix is to convert
FutureImpl's owning shared_ptr to std::atomic<std::shared_ptr<...>>;
that's a larger refactor preserved as a FIXME at the call site. Add a
localized pragma suppression so gcc/libstdc++ builds with
MADNESS_WERROR=ON do not fail on this one known-deprecated call.
Four new ES benchmark systems spanning n_occ ∈ {29, 35, 41, 48}, all
closed-shell HF. Geometries are PubChem 3D conformers (MMFF94-optimized)
fetched by madness_studies/scripts/fetch_pubchem_xyz.py; the
moldft.in / system.json / benchmark dir are generated by
madness_studies/scripts/make_fixture_files.py.

Each system gets:
  src/apps/molresponse_v3/tests/fixtures/systems/<name>_hf/
    {geometry.xyz, moldft.in, system.json}
  src/apps/molresponse_v3/tests/fixtures/benchmarks/<name>/
    {response_tda.in, response_rpa.in, benchmark.slurm}

SLURM defaults are sized for Seawulf Xeon Max hbm-medium-96core (4 tasks
per node to fit Xeon Max memory at protocol [1e-6]):

  uracil    (29 occ):  4 nodes,  12 hr
  adenine   (35 occ):  6 nodes,  12 hr
  histidine (41 occ):  8 nodes,  12 hr
  tyrosine  (48 occ):  8 nodes,  24 hr  (tightest memory budget)

Submitting still goes through benchmarks/submit_all.sh (no change needed
there — it now picks up the four new dirs automatically).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… maxsub

Three changes inside ExcitedResponse::iterate():

1. Convergence targets now mirror SCF.cc:2381-2384:

       conv_den            = dconv * max(5, natom)
       relative_max_target = 5 * dconv

   Replaces the prior thresh-floored formulas (max(100 * thresh, dconv)
   etc), which produced relative_bsh targets of 5e-5 at thresh=1e-6 that
   were below the truncation noise floor of MRA functions at that
   protocol. The new targets follow SCF's convention - including its
   "max(5, natom)" floor that keeps small molecules from being held to
   too-tight density-residual targets (per the legacy comment "previous
   conv was too tight for small systems").

   Net effect on C2H4 at dconv=1e-4, thresh=1e-6:
     conv_den            1e-4    -> 6e-4   (6x looser)
     relative_max_target 5e-5    -> 5e-4   (10x looser)

2. The step-restriction bound max_rotation now reads
   r_params.maxrotn() instead of being hardcoded by thresh. The
   protocol-based heuristic (0.05/0.1/0.25/2.0 across thresh tiers)
   is kept as a fallback when the user leaves maxrotn at its default
   value of 0.5 - so existing behavior is preserved when nothing is
   set, and users can override per-protocol via the input file.

3. KAIN's set_maxsub now reads r_params.maxsub() instead of being
   hardcoded at 10. The default in response_parameters.h is 5, so this
   *does* change default behavior; tune via the input file if a deeper
   subspace was relying on the old hardcoded value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ahurta92 and others added 29 commits June 9, 2026 15:49
…+beta+raman+es) [R3b]

Extend the v3 madqc adapter beyond alpha: requested_properties + beta.*/excited.*
knobs now select multiple ResponsePropertyRequests (polarizability,
hyperpolarizability, single-component raman, resonant/excited), merged via
merge_plans. And run_response_with_ground assembles alpha AND beta when both are
present (was alpha-XOR-beta) — assemble_alpha self-guards on dipole FD, so a mixed
plan yields both. excited.tda=false → Full (X,Y) bundle.

Validated (single-node madqc, h2o z @1e-4, engine v3,
requested_properties=[polarizability,hyperpolarizability]): response_metadata
carries BOTH alpha_zz=8.5346 and beta_zzz=7.760 (+ 1 VBC state).

Caveats: raman maps to v3's SINGLE-COMPONENT path (atom 0, z) — won't match v2's
full per-atom Raman tensor (deferred). ES uses the ExecutorSettings default guess
(SolidHarmonics); a VirtualAO/es-guess madqc knob is a follow-up.

cm.sh: cm_madqc_parity now tees live + highlights [ALPHA]/PROTOCOL/engine lines
(was silent until both engines finished).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aman+es) landed

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… (climb + restart segfault)

R4 sweep-2 surfaced two bugs sharing one root — R3a's in-memory GroundState
shortcut:
  (1) multi-protocol climb crashed "tensors do not conform": the from_memory
      path skipped the pristine-MO reload, so climbing back to original_k_
      left scf_->amo at a stale coarse k.
  (2) restart-in-place segfaulted in build_fock_matrices: madqc validates a
      valid SCF archive as Ok and constructs the SCF WITHOUT loading the MOs,
      so scf_->amo was empty.

Fix (one mechanism): the madqc adapter now builds via GroundState::from_archive
(resolve the moldft restartdata from scf_calc->work_dir), exactly like v2's
make_ground_context. from_archive resets from_memory_=false so prepare()
reloads pristine MOs from disk on each protocol climb. Reverted the in-memory
pristine-snapshot band-aid.

Validated (SLURM r4_resume, 2026-06-11): fresh h2o climb k6->k8 via from_archive
completes — Memory Model B measured the k8 footprint, the rung that previously
crashed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…TaskQ scratch

- Add tools/dump_mra_trees.cpp + its CMake target (R2/L4 offline MRA tree-skeleton
  dumper: loads converged orbitals, dumps the adaptive octree per orbital, never
  solves). CMakeLists already referenced it but the source was untracked.
- Broaden the MacroTaskExchangeRow_task.* ignore to any cwd — these scratch files
  land in the repo root / calc dir when a solve runs from there, not only in the
  fixture dirs the existing rule covered.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ard + atomic write, re-enable es_analysis

PENDING cm_build + cm_es h2 3 validation. No physics change is expected; the
es_analysis re-enable is the one part that needs the ES run to confirm no
teardown crash. Amend if the build/run reveals an issue.

Cleanups (no behavior change):
- GroundState: the from_memory_ flag is now vestigial (only from_archive uses the
  bare ctor, which then overrides it) — remove the flag + guard so prepare()
  always reloads pristine MOs, and make the bare ctor private to enforce the
  checkpoint-backed invariant. Removes the latent single-protocol-only footgun.
- calc_executor: delete FdResponseExecutor::action_name (byte-identical to the
  free node_action_name). madqc_adapter: refresh the stale header comment
  (from_archive + multi-property, not in-memory + R3a-alpha-only).
- assemble_beta: accumulate property rows and write response_metadata.json once
  instead of reloading + rewriting it per (vbc, axis).

ES heap-OOB fix (unblocks the parked es_analysis re-enable, R2):
- Root cause: the per-root archives use nio=1, so reloading at a different process
  count than wrote them silently corrupts the WorldContainer and aborts at
  teardown (the --es-analyze-only --es-load-only repro). Record writer_nproc in
  roots.json; load_es_roots broadcasts it and fails cleanly on ALL ranks (no
  divergent-throw deadlock) on a mismatch instead of corrupting the heap. Write
  roots.json atomically (tmp + rename).
- With the cross-np load path guarded, re-enable report_es_analysis: it runs on
  the in-memory converged state at the solve's own np (no bundle reload), so it
  never hits the guarded path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…iterations

compute_V0x applied K[phi_occ, phi_occ] to the response orbitals by constructing
a fresh madness::Exchange on every call, and Exchange::set_bra_and_ket deep-copies
bra+ket (all occupied orbitals) each time — so a full copy of the occupied set was
made on every V0x apply (x2 for Full's x,y) every iteration.

Build K0 once per protocol on ResponseGroundState (K0_alpha; K0_beta for open
shell, null for closed) in build_response_ground_state_*, and apply the cached
operator from the 12 compute_V0x sites via common_ops::apply_ground_exchange. A
null-K0 fallback rebuilds a fresh K[mos,mos], reproducing the old path exactly, so
a missed build site cannot drift the numbers.

Validated bit-identical: cm_equiv h2o green. The saved copies scale with n_occ, so
the win is negligible on h2o and material on c6h6/naphthalene and toward the
large-system target. two_electron's response exchange K[amo,x] is unchanged (ket=x
depends on the iterate — not cacheable).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The first concrete L2/R5 state-parallel target: fan the ES bundle's per-root work
across rank subworlds. Documents the serial-over-roots bottleneck
(es_solver.hpp:328 step_rotate_pieces — M serial collective passes), the
fan-out/gather decomposition (parallel per-root build+BSH; cheap coupled subspace
step), the MacroTaskQ mechanism (MacroTaskExchangeRow pattern at root grain, vs
v2's raw-subworld execute_subgroup_state_solve and why MacroTaskQ fits the
single-shot-per-iteration ES work), the ground-orbital replication constraint
(gated on the 15d pre-flight estimate + node-local shared replica), the
serial-vs-parallel parity contract, the MacroTaskQ gotchas (no universe
collectives in operator(), subworld teardown fences), and a phased Inc 0-4
prototype plan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…+ apply_A±B

Drop the (A−B)(A+B)u = ω²u direction entirely — not pursuing it:
- delete solvers/es_solver_full_rpa.hpp (ESSolverFullRPA + ESProblemFullRPA)
- kernels/full.hpp: remove apply_AplusB / apply_AminusB + INVARIANCE block
- build_response_ground_state.hpp: remove build_es_problem_full_rpa +
  promote_{tda,full}_to_full_rpa_closed_shell + include
- test_v3_es_skeleton: drop --type=full-rpa branch + --load-roots-full
- docs 02/15/19 + operator_contracts: mark removed (2026-06)

Full ES remains the direct paired-(X,Y) ESSolver<Full, ClosedShell>.
Compile-verified: cm_build targets + test_v3_es_skeleton link clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Batch V0x/T0x/BSH over the flattened M*n_occ root bundle (closed-shell
TDA) -- one collective pass per op-class instead of M per-root passes.
These are pure per-function maps (no cross-function reduction), so the
batched result is bit-identical to the per-root path. Gated behind
--es-batch (default off); the per-root loop stays as the numerical
reference. gamma-exchange (K[phi,x_s](phi), a distinct operator per root)
and the focka transforms stay per-root -- those are the Stage-2
(locality-pmap) targets, recorded in docs/19.

Verified (4 ranks x 10 threads): h2 3/3 roots bit-for-bit identical,
legacy-TDA gate 3/3 PASS; h2o converged root bit-identical, ~7% faster
wall. Cuts fence count ~6M -> ~3+3M and fills the task queue with M x
more work per fence; the win grows with root/rank count.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Instrument step_rotate_pieces with fence-marked phase timers (gamma,
build, subspace, rotate, bsh) printed as one rank-0 ES_TIMING line per
iter. Gated behind --es-time (default off); the boundary fences are
added only under the flag, so normal runs are untouched. The per-root
build loop is split into a gamma-pass (density + Coulomb-exchange) and
an ops-pass (V0x/T0x/E0x) purely for attribution -- the kernel calls are
independent, so results are unchanged (re-verified h2: 3/3 roots
bit-identical, omegas unchanged from pre-split).

First measurement (h2o, 3 roots, 4 ranks): build 42%, gamma 36%,
bsh 16%, rotate 6%, subspace 0.5%. => 78% of each iter is in the
per-root building blocks (the Stage-2 root-fanout target); the M*M
subspace coupling is negligible (0.5%), and --es-batch's 7% win lands
almost entirely in build (V0x/T0x), ~nothing in BSH.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Stage-2 draft)

Design draft for the two-pmap / single-World locality approach: what a
distributed Function is, the binary-op alignment rule (inner_local
"ASSUMING same distribution"), the GroupPmap mechanism, NodeReplicated
ground state, the once-per-iter shuffle for the M*M subspace step, an
honest trade-off section (it's a scaling bet; the Exchange operator is
the hard 36%), what is/isn't reusable from v2 subworld state-parallel,
and a phased plan (Inc A-D).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…Stage-2 fork

Verified the MADNESS output-pmap behavior: an op inherits its result pmap from
a SPECIFIC operand (mul_sparse -> left/`a`, via set_impl), not "the per-root
one", and compressed gaxpy (merge_trees, funcimpl.h:1233) ASSERTS same pmap. So
localizing roots alone crashes assemble_lambda (V0x lands on V_local's pmap,
T0x on the root's group -> mismatch). Keeping every per-root intermediate
group-local needs pervasive per-op remapping -> favors subworlds (subworld
default pmap localizes intermediates for free) vs single-World pmaps
(memory-friendly NodeReplicated phi, but fights output-pmap conventions). Real
Stage-2 decision; settle before implementing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…worlds (chosen)

Chosen Stage-2 path (over doc-20 pmaps). Key findings from MADNESS source:
(a) NO shared-memory Function storage exists (no MPI_Win_allocate_shared;
Split_type SHARED only makes intra-node communicators; NodeReplicated = 1
copy/node via active messages, not shared pointers); (b) a subworld cannot
reference universe data — it must be Cloud-copied in (copy_coeffs_different_
world). So literal shm-shared phi is a core-framework project, out of scope.

The GOAL is reachable anyway: NODE-ALIGNED subworlds (Split_type SHARED, one
subworld per node) with phi left DISTRIBUTED within each node-subworld = one
copy per node, per-rank phi identical to single-World (avoids the v2 per-RANK
replication OOM). Intermediate locality is free (subworld default pmap).
Per-iter Cloud gather/scatter for the M*M subspace+rotation (heavier than the
pmap shuffle; the efficiency unknown to measure). Reuses MacroTaskQ/Cloud;
drops v2's per-task phi replication. Phased plan Inc 1-4; first step =
node-aligned create_worlds + map test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
make_node_aligned_subworld() splits the universe into one subworld per
PHYSICAL NODE via MPI_Comm_split_type(SHARED) — node-contiguous, unlike
MacroTaskQ's default color=rank%nsubworld interleaving. This is the
prerequisite for the ES node-phi Stage-2 design: with node-aligned
subworlds, phi Cloud-copied into each subworld lands DISTRIBUTED over
that node's ranks => one copy per node, per-rank phi identical to the
current single-World cost (avoids the v2 per-RANK replication OOM).

verify_one_subworld_per_node() cross-checks each rank's subworld size
against ranks_per_host. New test_node_subworlds (header-only, no solver
change, no MRA solve). Verified: 1 rank PASS; 2 nodes x 4 ranks -> 2
subworlds of size 4, NODE_SUBWORLD_TEST PASS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…export [R2 Phase 2b]

Extend the offline MRA tree-skeleton dumper to also dump converged closed-shell FD response orbitals, so one run yields a combined ground + response product-overlap analysis (the Coulomb/exchange partitioning oracle).

--fd [--fd-calc-dir=DIR] walks response_metadata.json fd_states, reads (thresh,k) from protocols[key] + set_response_protocol, parses the <pert> key back into a Perturbation (parse_perturbation = inverse of Perturbation::description), and try_load_fd_state<Type,ClosedShell> each point — static->ResponseStateX (x only), full->ResponseStateXY (x+y) — dumping every response orbital's recon+comp tree via the existing dump_function_trees, into the same out dir as the ground MOs. TDA is an ES-bundle type (no FDPerturbationOf<TDA>) and is excluded.

Validated on the allocation (cm_run h2o, dipole_z static @k6, alpha_zz=8.5345 == R3a ref): 5 ground MOs + 5 response orbitals -> combined 10x10 overlap matrix, C in [0,1], symmetric, response-of-orbital-i overlaps ground-orbital-i.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…d copy/node (docs/21)

test_node_phi ships φ (synthetic, content-independent) from the universe into
node-aligned subworlds and proves the memory claim the Stage-2 design rests on
via size()/size_local accounting: Σ(local) over a node's ranks == size()
(ratio 1.0), i.e. ONE DISTRIBUTED copy per node, NOT a per-rank replica.

Verified (2 nodes x 4 ranks, 12 orbitals): per-rank φ = 175824 coeffs
(one copy/node) vs 703296 if per-rank replicated (the v2 OOM cost we avoid) —
4x less per rank at 4 ranks/node; ratio per node = 1.000000; NODE_PHI_TEST PASS.
Also confirms the cross-world copy completes without deadlock outside MacroTaskQ.
Teardown clears all Functions before finalize (avoids the RecursiveMutex abort).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…code)

Per-iteration fan/gather/couple/scatter for the BUILD phase only (BSH stays
universe-side until Inc 4). MacroTaskESBuildTDA functor fans each root's build
to a subworld (kernels unchanged — subworld-local => aligned), returns
[Λ|V0x|E0x|γ] (4*n_occ) per root; universe subspace/rotate/θ/BSH stay
byte-identical so bit-identity is easy to reason about. World-bound objects
(coulop, Qa, K0) rebuilt subworld-side; K0=null => direct apply_exchange
(no nested MacroTaskQ). Gated --es-subworlds=G (0=reference). Staged: 3a stock
MacroTaskQ interleaved G=nnodes (flow+bit-identity+Cloud cost, no core change),
3b node-aligned create_worlds (Split_type SHARED) for on-node locality.
Open risk: custom result-stride partitioner. Awaits review before code.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ompute (per user)

The §1-7 MacroTask build fan-out is stateless (recreates subworlds + reships
X/φ every call), which fights the user's goals. §8 captures the real target:
persistent node-subworlds + φ replicated once + X REPLICATED across subworlds;
per iter build Λ locally (recompute, transient), assemble per-subworld A/S
columns, allreduce the M×M SCALAR matrix (Λ never crosses), diagonalize +
rotate + BSH + KAIN locally, re-broadcast updated X. Reuses the existing
step_recompute_pieces (stream_theta). Hard constraint: matrix + dense rotation
couple all roots via co-located-function ops, so X must be replicated (or
gathered) — but the 78% build stays local. MacroTask demoted to optional
coupling-cost measurement.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e (docs/22 §8)

Proves the "collective to construct the matrix" of the persistent-subworld
recompute design: X replicated into node-subworlds, each node computes the
subspace-matrix columns it owns (round-robin) via a subworld-collective
matrix_inner over its replicated X, only subworld-rank-0 contributes, universe
allreduce sums the disjoint columns -> full M×M. Functions NEVER cross worlds;
only M×M scalars move. == direct universe matrix to machine eps.

Verified: 1 rank and 2 nodes x 4 ranks (M=6) -> max|S_dist-S_direct| = 8.9e-16,
SUBSPACE_ALLREDUCE_TEST PASS. Keystone for the rotation+recompute solver
restructure (the remaining Stage-2 steps).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… density [R2 Phase 2b-ii]

Each FD point now also builds and dumps the closed-shell response density rho^(1) as fd_<pert>__<key>__<fkey>_rho1_{recon,comp}.json, so the response density tree feeds the same slice/spectrum/boxes viz and adds a row to the combined product-overlap matrix.

Reuses Kernels<Type,ClosedShell>::compute_density as the single source of truth for the convention (Static folds Y=X -> factor 4; Full uses x+y -> factor 2). It reads only g0.amo, so a 1-field ResponseGroundState suffices (no prepare()/Coulomb/Q). GS orbitals load at native k but the FD state is at the FD k, so a LOCAL copy of the orbitals is reprojected to the active k (project(), as in GroundState::prepare) for the phi*x products; the dumped GS trees stay native.

Validated via SLURM batch (job 2023189, hbm-short, non-exclusive): h2o dipole_z static, rho1 norm2=0.664, 764 leaves, max_level 14 (smooth valence-scale polarization density vs core lvl 18); combined 11x11 matrix, rho1 overlaps O-1s core 0.85, HOMO-response 0.26.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tal + rho^(1) cubes [R2 Phase 2b-iii]

--cube previously emitted Gaussian cubes only for the ground-state MOs. Thread the molecule + cube settings through dump_fd_states -> dump_fd_point -> dump_block so the FD response orbitals x_i (and y_i for Full) and the response density rho^(1) are also written via the same write_cube_file/eval_cube path, with the same stem as their _boxes.vtk (fd_..._x<i>.cube, fd_..._rho1.cube). Lets ParaView/Avogadro render the response functions as signed isosurface lobes like the MOs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Index/table of contents for the MADWorld parallel runtime developer guide
(docs/parallel_runtime_guide/). Chapters and the performance-models companion
follow in subsequent commits.
Chapters 1-5 of the parallel runtime developer guide: MPI/SafeMPI,
active messages, the RMI thread, the task queue & futures, and global
ops & fence. Diagram-driven, grounded in src/madness/world with file:line
anchors.
Dataflow tasks, futures, dependency tracking, thread pool, remote-task
lifecycle; binary-tree collectives and Dijkstra-style fence termination
detection.
…tions)

WorldContainer distributed hash map and pmaps; Function/FunctionImpl tree
representation, key-aware pmaps and load balancing; and the per-operation
deep dive (compress, reconstruct, gaxpy, multiply, inner, apply, truncate)
with sweep diagrams and per-node cost.
Per-operation parameter-tuning playbook (runtime/build/numerical knobs,
symptom->action); and the parallel function I/O chapter: archive system,
the nio single-writer default bottleneck, app restart flows, and an HDF5 /
parallel-HDF5 (MPI-IO) analysis with an I/O performance model.
Quantitative time/communication/memory models for the MADWorld runtime and
the core MRA operations: the compute/comm/sync decomposition, a worked
response-iteration model, the ground-state replication memory model tying
to the documented OOM, and how to instrument each model term. Companion to
the parallel runtime developer guide.
…++17/-Werror)

merge_flag_map captured the structured binding 'shard_proto' from the
enclosing range-for; capturing a structured binding is a C++20 extension
(-Wc++20-extensions), which upstream's new MADNESS_WERROR=ON CI turns into
a hard error under clang. Alias it to a named reference first.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@ahurta92 ahurta92 merged commit fd03e5c into master Jun 18, 2026
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants