Sync master with molresponse-feature-next (v3 development trunk) by ahurta92 · Pull Request #7 · ahurta92/madness

ahurta92 · 2026-06-17T19:00:46Z

Bring master up to molresponse-feature-next, the de-facto v3 development
trunk. master was 168 commits behind (its only unique commits were
content-free merge bubbles), so this is a clean catch-up — no divergent work is lost.

Release-sync step of a branch-organization cleanup that also split the former
grab-bag madqc-driver-improvements into one-concern-per-thread branches off
this trunk: exchange, parallel-runtime, madqc-refactor, fix/scf-do-plots,
feat/htg-writer.

🤖 Generated with Claude Code

…at faces/edges/vertices are considered "in"/neighbors of the respective boxes

…pecial-point-at-box-boundaries amended handling of special points at box boundaries

…rving CLI paths Several related improvements landed together this session. ES solver (Stage 1: TDA RHF): - Add ESSolver.hpp with the bundle iteration loop (orthonormalize, Lambda build, subspace diagonalize, rotate, BSH per root, step restriction). Includes a 5-iter guess-refinement pre-phase using 2N trial states, mirroring molresponse_legacy::iterate_guess. - Add ESSolverGuess.hpp with random-Gaussian-envelope initial guess (mirrors molresponse_legacy::create_random_guess). - Add ResponseKernel.hpp helpers: apply_kinetic, compute_lambda_subspace (full Lambda with kinetic and full Fock - distinct from the existing Theta-equivalent compute_lambda used for the BSH driver), orthonormalize_bundle, rotate_bundle, rotate_vector_bundle, bundle_inner, build_es_theta (with shift compensation). - Add test_es_solver.cpp validation harness for H2 TDA-RHF that compares first-N omega against the legacy reference table. - Add docs/11_excited_state_solver_design.md staged plan (TDA RHF -> Full RHF -> TDA UHF -> Full UHF). Stage 1 runs end-to-end but eigenvalues are not yet correct; debugging is in progress. Test driver (test_v3_solver): - Protocol ramping: walk [1e-4 -> 1e-6 -> ...] up to the archive's target thresh, warm-starting each step from the previous. ResponseProtocol.hpp adds set_response_protocol, default_thresh_for_k / default_k_for_thresh, build_protocol_ramp, parse_protocol_csv. - Spherical-atom shortcut: single-atom systems solve only z and propagate to alpha_xx = alpha_yy = alpha_zz. - Update Li (UHF) reference to ETB-converged HF basis-set limit (170.121 / 186.510 / 262.710), with a comment pointing at ~/Projects/li_polarizability_study/. Prior 170.020 reference was unconverged (Phys Rev A 91, 022501 - 1-D FD code). - Tighten H2 fixture protocol [1e-4] -> [1e-4, 1e-6]. - Tighten Li fixture l=20 -> 200 and protocol -> [1e-4, 1e-6, 1e-8]. - Make GroundState::read_archive_header public so the test driver can inspect the archive header before set_response_protocol. Case-preserving CLI paths: - commandlineparser gets a parallel keyval_raw map and value_raw(key) accessor. Use for paths/free-form strings; existing value(key) keeps lowercasing for chemistry-style enum knobs (--xc=HF, --localize=BOYS). - Bare-positional input form (`app /Mixed/Path/in.in`) also preserves case via the existing if (key=="file"||"input") branch. - test_solver / test_es_solver use parser.value_raw("archive") and parser.value_raw("molecule") for path arguments. - GroundState.cpp adds a workaround for QCCalculationParametersBase's tostring<std::string> which lowercases values: re-set the prefix parameter directly via the QCParameter struct so SCF::load_mos gets the right case. Phase 2 (param-aware case-folding inside QCCalculationParametersBase) is deferred. Legacy molresponse build patches: - ground_parameters.h: pre-set FunctionDefaults<3>::set_cubic_cell from the archive's L right after reading L, before orbital reconstructs. Newer MADNESS Function::load() rejects cell mismatch, which broke molresponse_legacy startup. - src/madness/chem/QCCalculationParametersBase.h: symlink to ../mra/QCCalculationParametersBase.h. Legacy includes <chem/QCCalculationParametersBase.h>; the file is in mra/ now, so the symlink lets the legacy compile without modifying legacy source. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The v1 production molresponse had been bit-rotting: it didn't compile, and once compiled, the excited-state TDA path crashed in three different places before producing eigenvalues. This commit gets v1 building and converging on the H2/4-root TDA fixture to within 5e-5 of the legacy reference values (0.46802, 0.47785, 0.48134, 0.48134), with restart from a saved checkpoint also working. Buildability: - Re-enable molresponse subdir (src/apps/CMakeLists.txt). - Rename Exchange<double,3>::Algorithm -> ExchangeAlgorithm (60 sites in 4 files) to match upstream. - Add ResponseParameters::get_tag() override. - Drop a stale omega arg from FrequencyResponse construction in molresponse.cc. Excited-state TDA correctness: - ground_parameters.h: set_cubic_cell(-L,L) before reading orbitals from archive (Function::load was rejecting cell mismatch). - ResponseBase::solve: call check_k (which builds `hamiltonian`) before initialize(), since the guess-iteration's response-potential build needs it; ordering bug that left the tensor empty during iterate_trial. - response_space(World,m,n) ctor: fill slots with zero functions rather than null-impl Functions. - response_space::push_back / pop_back: keep `active` in sync with num_states. transform() and similar call push_back repeatedly; without this, to_vector/from_vector silently produced zero-length payloads and the eigenvalues stalled at the wrong place. - ExcitedResponse::initialize: rebuild Chi from scratch after select_functions so n_states matches the trimmed x/y. - ExcitedResponse ctor: set inner-product / J1 / K1 / VXC strategies (was missing entirely; FrequencyResponse does this in its ctor). - ResponseBase::update_residual: handle empty old_residuals tensor on first iteration; force canonical active list to bypass any remaining stale-active paths. Validation: - src/apps/molresponse_v3/tests/fixtures/systems/h2_hf/H2_EXCITED_REPORT.md documents the run matrix (cold/restart for both v1 and legacy, plus legacy RPA), the fixes summarized above, and the v1 RPA failure mode (heap-corruption SIGSEGV inside MacroTaskExchangeSimple K[1] in iter 0 — next task). Also includes earlier in-progress molresponse_v3 ES solver work: switch default guess to solid harmonics × orbital (mirrors legacy create_trial_functions), and the kernel/guess refactor that converges H2 toward the legacy reference but still has a residual gap. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

J1StrategyFull and J1StrategyStable in ResponseBase.hpp allocated `vector_real_function_3d temp_J(3)` — a buffer sized for the 3 Cartesian directions of FrequencyResponse's dipole case. Excited-state runs with num_states != 3 wrote past the end and corrupted the heap; the resulting SIGSEGV manifested in whichever allocator next touched the broken arena (MacroTaskExchangeSimple's K[1] tile under multiworld, J[1]'s shared_ptr release under largemem). After the fix, v1 RPA converges on H2/4-root and matches legacy RPA to ≤ 5e-7 on the lowest three roots, with ω₄ within 3e-6 (both runs hit maxiter before the strict bsh-residual target — the values are correct, the convergence test is just stricter than the achieved residual). Updates the H2_EXCITED_REPORT.md with the v1 RPA result row, the agreement analysis, and the temp_J(3) root-cause writeup. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…4/C6H6 Adds a benchmark suite under tests/fixtures/benchmarks/ to measure the rotate-vs-recompute gap between v1 (rotates gamma through subspace diagonalizations) and legacy (recomputes per cycle). Each system has a SLURM script that: 1. Runs moldft once to converge the ground state. 2. Runs four ES configs in sequence: legacy-TDA, legacy-RPA, v1-TDA, v1-RPA — all sharing the same archive. 3. Captures wall time, iteration count, and final omegas into timing.csv with a stable schema. Per-system tuning (from CLAUDE.md memory budget data): - h2o (n_occ=5): 1 node × 8 tasks/node × 4 cpus/task, 2 hr - c2h4 (n_occ=8): 2 nodes × 8 tasks/node × 4 cpus/task, 6 hr - c6h6 (n_occ=21): 4 nodes × 4 tasks/node × 8 cpus/task, 24 hr (4 tasks/node — at 8 it OOMs on Xeon Max per RUNLOG) C2H4 fixture is new (geometry, moldft.in, system.json) following the same shape as h2o_hf and c6h6_hf. H2O and C6H6 fixtures already existed. Submission edits the SLURM template needs (partition, account, optional module loads, optional LAUNCHER if `srun` doesn't bind right) are documented in benchmarks/README.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Switch from generic Slurm defaults to the user's actual Seawulf setup: account=ahurtado, partition=hbm-short-96core, source ~/load_xeonmax.sh for module setup, MAD_NUM_THREADS hardcoded per-system to leave headroom on 96-core nodes. Drop --cpus-per-task and OMP_NUM_THREADS — explicit MAD_NUM_THREADS is what madness reads. Default launcher is now `mpirun --map-by numa numactl --preferred-many=8-15` (bind by NUMA, prefer the HBM tier on Xeon Max), with `LAUNCHER=srun sbatch benchmark.slurm` as the override path. Per-system layout: h2o: 4 nodes × 8 tasks × 10 threads, 4 hr, hbm-short-96core c2h4: 4 nodes × 8 tasks × 10 threads, 8 hr, hbm-short-96core c6h6: 8 nodes × 4 tasks × 20 threads, 24 hr, NEEDS long partition (4 tasks/node — 8 OOMs per RUNLOG; comment flags this in script) C6H6's 24 hr time exceeds hbm-short-96core's limit; the script has a NOTE flagging this — switch --partition before submitting. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rks to correct walltime and repository paths

After cleaning ~2250 in-source cmake artifacts (CMakeFiles/, .o, generated headers, etc.) that an accidental in-source `cmake .` had scattered, add .gitignore patterns so future regenerations don't reappear in `git status`: - fixture run outputs (mad.*, moldft.std*, *.log, MacroTaskExchangeRow_task.*) - test executables built into the source tree by mistake - cmake-generated headers (config.h, tensor_spec.h, run_test_*_mpi2.cmake[.in]) Also tracks the H2 excited-state TDA input file used by the validation in H2_EXCITED_REPORT.md, so anyone reproducing the run has the canonical input. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-next' into molresponse-feature-next

Multi-rank legacy runs crashed at tensoriter.h:185 ("first and second tensors do not conform") immediately after `initialize` on H2O. Single-rank runs were unaffected because the bug only fires through the `if (world.size() > 1)` branch in molresponse.cc that calls `set_protocol(1e-4)` before `set_protocol(protocol[0])`. Inside set_protocol: FunctionDefaults<3>::set_k(6); # for the 1e-4 first call ... rho0 = make_ground_density(world, ground_orbitals); ground_orbitals are still at the archive's k=8 from the moldft run. make_ground_density does `square(orbitals)` (returns k=8 trees) and gaxpys them into a fresh `factoryT(world)` density at the new k=6 → mismatched trees → tensor non-conform. Fix: project ground_orbitals to FunctionDefaults<NDIM>::get_k() before make_ground_density when the orbital k differs from the current default. This mirrors what check_k() already does later in solve_excited_states; we just need it earlier when set_protocol is called from the load-balance prep block. Also adds benchmarks/submit_all.sh — a small driver that submits all three (h2o, c2h4, c6h6) benchmark.slurm scripts and writes jobs.txt with the resulting job ids. README updated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-next' into molresponse-feature-next

- distpm.cc: silence -Wvla / -Wvla-{,cxx-}extension around `double tmp[m]` - test_sepop.cc: add -Wunknown-warning-option so clang ignores the surrounding -Wmaybe-uninitialized pragma cleanly - test_systolic.cc: silence C++20 -Wvolatile (gcc) / -Wdeprecated-volatile (clang) on `++niter` (niter is `volatile int`)

Adds MADNESS_WERROR (default OFF). When ON, an INTERFACE library `madness_internal_warnings` carries -Werror and is linked PRIVATE-ly to every MAD*-obj object library and to every executable built via add_mad_executable. PRIVATE scope is load-bearing: -Werror does NOT propagate to INTERFACE_COMPILE_OPTIONS of the installed MAD* targets, so downstream consumers of find_package(madness) (TiledArray, MPQC, ...) are unaffected. Verified by inspecting the generated madness-targets.cmake: zero Werror occurrences, madness_internal_warnings absent from every INTERFACE_LINK_LIBRARIES. CI: -DMADNESS_WERROR=ON added to BUILD_CONFIG for all matrix cells. Also fixes the only remaining source-level warnings on master that would have blocked -Werror across the CI matrix: ignored posix_memalign returns in test_mtxmq.cc / test_Zmtxmq.cc (8 -Wunused-result warnings on gcc). The earlier suppressions in distpm.cc / test_sepop.cc / test_systolic.cc in this branch cover the rest. Scope of -Werror is intentionally limited to MAD*-obj and add_mad_executable outputs, which keeps it off bundled FetchContent code (parsec, nlohmann_json, cereal) and off pymadness (pybind11-generated code). Phases 2/3 (adding -Wall, -Wextra) are deferred.

Adds a method on LoadBalanceDeux<NDIM> that walks the aggregator tree and reports per-rank cost under any pmap as a horizontal bar chart (rank | bar | cost | x-avg ratio). Single helper that gives the right answer both before and after load_balance() runs: pre-partition it sums per-key cost by would-be owner under the supplied pmap; post-partition it sums each surviving subtree-root's rolled-up cost under the new pmap. Both totals match. Useful for diagnosing whether load_balance(fac) actually helps: call before with the old pmap and after with the returned pmap; if max/avg drops, the partitioner had material to cut. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…metric Ports the legacy TDDFT::gram_schmidt(World&, response_space&, response_space&) from molresponse_legacy/TDDFT.cc:2061-2089. Operates in-place on the (X, Y) pair using the RPA pseudo-norm norm = <f|f> - <g|g> and projects out using the same indefinite inner product, with the same scaling and projection coefficients applied jointly to X and Y. This is the function the previously commented Gram-Schmidt block in ExcitedResponse::update_response was trying to call (the commented 2-arg form did not exist in v1, which is one of the reasons that block had been disabled). Required precursor for restoring the top-of-iter Gram-Schmidt discipline in the RPA path. NOTE: mirrors legacy in not guarding against non-positive norms; non-physical RPA states with negative pseudo-norm produce NaN/inf downstream. Revisit if observed in practice. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…fix RPA gaxpy direction Two changes in one function: 1. Restructure to match legacy TDDFT::x_space_step_restriction at molresponse_legacy/TDDFT.cc:1921-1949. Per-state loop, X and Y blocks bounded independently with the same scalar bound. Honors the explicit restrict_y argument (legacy's discipline) rather than testing r_params.omega() != 0, which is unrelated to whether the call site actually wants both blocks restricted. 2. Fix a sign/direction bug in the prior RPA path. Standard step restriction wants result = s * temp + (1 - s) * old (mostly old, small step) with s = maxrot / ||temp - old|| < 1. The prior RPA branch passed the gaxpy_oop arguments in the opposite order, producing mostly-new instead of mostly-old, which effectively negated the restriction (or worse, pushed the step the wrong direction). The TDA branch was already correct; the unified path now uses the correct ordering for both X and Y. Dropping the old code's active-list iteration (for ai in old_Chi.active) in favor of all-state iteration matches legacy and avoids skipping converged states where the step would already be negligible anyway. The work is cheap (a norm and possibly a gaxpy) so this has no measurable cost. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…table; reformat Stabilizes ExcitedResponse::iterate by replicating the per-iter ground-state-cleanup discipline that ExcitedResponse::iterate_trial already uses. Pre-fix, the v1 main loop drifted to the omega = -epsilon(core) ghost eigenvector within 2-3 iters on H2O TDA because the trial basis was allowed to lose orthogonality to the occupied ground space; KAIN then locked in the contaminated direction. Substantive changes (in ExcitedResponse::update_response): 1. Top-of-iter Q-projection on Chi. Mirrors iterate_trial:549-551. Strips ground-orbital character that accumulated during the previous iter's rotate_excited_space + BSH + KAIN. Both X and Y blocks for RPA. 2. Top-of-iter Gram-Schmidt + normalize, twice (the "twice is enough" idiom). Mirrors iterate_trial:658-670. Replaces a commented-out block at the same location which had three latent bugs: used Chi.X (field is .x), called a 2-arg gram_schmidt(x, y) that did not exist (added in the preceding commit), and discarded the gram_schmidt return value. TDA uses the 1-arg form on Chi.x; RPA uses the new 2-arg form on (Chi.x, Chi.y) under the indefinite metric. 3. Enable step restriction. Flips `if (false)` to `if (iter > 0)` at the existing call site, matching legacy's gate at molresponse_legacy/TDDFT.cc:1720. The function body was rewritten in the preceding commit; this commit wires it back in. Additional change in ExcitedResponse::create_bsh_operators: 4. Tabular print of BSH exponents mu_{k,p}. Replaces the inline per-(state, orbital) print loop that produced m*n lines per call (and the function is called many times per outer iter). Now prints one table per call, rank-0 only. Also in ExcitedResponse::initialize: Q-projection now runs BEFORE the load-balance step, and the LB block uses the new LoadBalanceDeux::print_cost_per_rank diagnostic to report cost distribution before and after redistribute(). This was an in-session experiment that surfaced the fact that load_balance benefits from diverse refinement patterns across registered functions. Note: this commit also includes a clang-format pass (4-space to 2-space indent across the whole file) that was applied by the editor on save during the substantive edits. CLI tooling for clang-format isn't installed on this host so the reformat could not be cleanly extracted into its own commit; the diff is bulky as a result. Use `git diff -w` for review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…gamma_dispatch Adds ResponseBase::compute_gamma_dispatch, a single helper that encapsulates the per-(state, orbital) macrotask invocation that was previously inlined in compute_theta_X. It builds the flat index vectors, instantiates ResponseComputeGammaX + MacroTask, converts X_space to/from vector, and dispatches "full" / "static" through the macrotask while falling back to the serial compute_gamma_tda for TDA (which needs the J[rho1] term the per-orbital macrotask does not compute). Call-site changes: - compute_theta_X (ResponseBase.cpp:600-644): collapses ~45 lines of inline boilerplate into one line. Behaviour unchanged - this is the same code, just relocated. - compute_lambda_X (ResponseBase.cpp:1038-1050) and compute_response_potentials (ResponseBase.cpp:1159-1171): both previously dispatched serial compute_gamma_full / compute_gamma_static / compute_gamma_tda. Now route the "full" and "static" branches through the macrotask, gaining the state-parallel speedup. TDA still goes serial through compute_gamma_tda. The serial compute_gamma_full / compute_gamma_static implementations are left in place so we can A/B-compare results. Once the macrotask path is trusted, they can be removed in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ResponseBase::solve() ran check_k() before load(), so when restarting the check_k call saw Chi empty and skipped its Chi projection path. load() then populated Chi at the saved k from the archive, after which iterate()'s first operator apply hit a tensor-dimension mismatch (tensoriter.h:185 "first and second tensors do not conform"). Adds a second call to the free `::check_k(world, Chi, ...)` immediately after load() to project the just-loaded Chi to the current protocol's k. Symmetric in both directions: - Downscale: saved at fine thresh (e.g. 1e-6, k=8), restart at coarser thresh (1e-4, k=6). Useful for fast-iteration testing. - Upscale: saved at coarse thresh, restart at finer. Useful when resuming a multi-protocol run that was interrupted. The ground-orbital side was already handled by the prior check_k() call (which projects ground_orbitals, hamiltonian, mask before the restart load happens). Subsequent protocols in the same run were also already handled - the top-of-iter check_k() call sees a non-empty Chi and projects it via to_vector / from_vector. Only the first-protocol restart path needed this fix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ity rebuild set_protocol() rebuilds ground_density at the new protocol's k right after FunctionDefaults<3>::set_k(k). But ground_orbitals still carry the moldft-saved k, so make_ground_density operated on mixed-k functions and crashed with a tensor-conform assertion when the moldft k differed from the response protocol k (common case: moldft saved at thresh=1e-6/k=8, response running at thresh=1e-4/k=6 for fast iteration testing). Reprojects ground_orbitals to the current k before make_ground_density when a mismatch is detected, and invalidates the stored hamiltonian so the subsequent check_k() recomputes it against the reprojected orbitals (otherwise check_k would see ground orbitals already at the right k and skip the rebuild path, leaving a stale hamiltonian from the previous k). This is the first half of "downscale everything" on restart with a lower protocol. The Chi-side projection (added in the previous commit) handles the response vectors after they're loaded. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er analysis Print gating: - bsh_update_excited "omega before shifts" + omega tensor dump - the worst offender, ~450 lines per typical run, was ungated on rank or print_level. Now gated on rank 0 + print_level >= 2. - create_bsh_operators tabular mu dump - the table I added recently fires per call (many per iter); gated to print_level >= 2. - update_response "Entering Compute Lambda" - moved to print_level >= 2. - update_response "omega_n before/after transform" pair - moved to print_level >= 2. - iterate() per-iter diagnostics block: split into headline metrics (kept at print_level >= 1: thresh, k, Chi_X norms, bsh_residuals, relative_bsh, max rotation, residual maxes and targets) and verbose per-(state, orbital) tensors (raised to >= 2: xij norms, xij residual norms). End-of-iteration analysis: - Re-enables the previously-commented analysis block at the end of iterate(). Mirrors legacy iterate_excited.cc:281-297: calls analysis(world, Chi) (per-state ground-orbital decomposition + transition dipoles) followed by analyze_vectors(world, Chi.x[i], ...) per state, and same for Chi.y[i] when not TDA. Separators between states gated on rank 0 + print_level >= 1. Fixes the field-name casing inherited from legacy: the commented v1 code used Chi.X[i] (uppercase) but X_space in v1 uses .x lowercase - this is the same bit-rot pattern that affected the earlier Gram-Schmidt block. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…_ptr in future.h C++20 deprecated std::atomic_exchange(std::shared_ptr<T>*, ...) in favor of std::atomic<std::shared_ptr<T>>. The proper fix is to convert FutureImpl's owning shared_ptr to std::atomic<std::shared_ptr<...>>; that's a larger refactor preserved as a FIXME at the call site. Add a localized pragma suppression so gcc/libstdc++ builds with MADNESS_WERROR=ON do not fail on this one known-deprecated call.

Four new ES benchmark systems spanning n_occ ∈ {29, 35, 41, 48}, all closed-shell HF. Geometries are PubChem 3D conformers (MMFF94-optimized) fetched by madness_studies/scripts/fetch_pubchem_xyz.py; the moldft.in / system.json / benchmark dir are generated by madness_studies/scripts/make_fixture_files.py. Each system gets: src/apps/molresponse_v3/tests/fixtures/systems/<name>_hf/ {geometry.xyz, moldft.in, system.json} src/apps/molresponse_v3/tests/fixtures/benchmarks/<name>/ {response_tda.in, response_rpa.in, benchmark.slurm} SLURM defaults are sized for Seawulf Xeon Max hbm-medium-96core (4 tasks per node to fit Xeon Max memory at protocol [1e-6]): uracil (29 occ): 4 nodes, 12 hr adenine (35 occ): 6 nodes, 12 hr histidine (41 occ): 8 nodes, 12 hr tyrosine (48 occ): 8 nodes, 24 hr (tightest memory budget) Submitting still goes through benchmarks/submit_all.sh (no change needed there — it now picks up the four new dirs automatically). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…enchmarks

…ugging outputs

… maxsub Three changes inside ExcitedResponse::iterate(): 1. Convergence targets now mirror SCF.cc:2381-2384: conv_den = dconv * max(5, natom) relative_max_target = 5 * dconv Replaces the prior thresh-floored formulas (max(100 * thresh, dconv) etc), which produced relative_bsh targets of 5e-5 at thresh=1e-6 that were below the truncation noise floor of MRA functions at that protocol. The new targets follow SCF's convention - including its "max(5, natom)" floor that keeps small molecules from being held to too-tight density-residual targets (per the legacy comment "previous conv was too tight for small systems"). Net effect on C2H4 at dconv=1e-4, thresh=1e-6: conv_den 1e-4 -> 6e-4 (6x looser) relative_max_target 5e-5 -> 5e-4 (10x looser) 2. The step-restriction bound max_rotation now reads r_params.maxrotn() instead of being hardcoded by thresh. The protocol-based heuristic (0.05/0.1/0.25/2.0 across thresh tiers) is kept as a fallback when the user leaves maxrotn at its default value of 0.5 - so existing behavior is preserved when nothing is set, and users can override per-protocol via the input file. 3. KAIN's set_maxsub now reads r_params.maxsub() instead of being hardcoded at 10. The default in response_parameters.h is 5, so this *does* change default behavior; tune via the input file if a deeper subspace was relying on the old hardcoded value. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…+beta+raman+es) [R3b] Extend the v3 madqc adapter beyond alpha: requested_properties + beta.*/excited.* knobs now select multiple ResponsePropertyRequests (polarizability, hyperpolarizability, single-component raman, resonant/excited), merged via merge_plans. And run_response_with_ground assembles alpha AND beta when both are present (was alpha-XOR-beta) — assemble_alpha self-guards on dipole FD, so a mixed plan yields both. excited.tda=false → Full (X,Y) bundle. Validated (single-node madqc, h2o z @1e-4, engine v3, requested_properties=[polarizability,hyperpolarizability]): response_metadata carries BOTH alpha_zz=8.5346 and beta_zzz=7.760 (+ 1 VBC state). Caveats: raman maps to v3's SINGLE-COMPONENT path (atom 0, z) — won't match v2's full per-atom Raman tensor (deferred). ES uses the ExecutorSettings default guess (SolidHarmonics); a VirtualAO/es-guess madqc knob is a follow-up. cm.sh: cm_madqc_parity now tees live + highlights [ALPHA]/PROTOCOL/engine lines (was silent until both engines finished). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…aman+es) landed Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… (climb + restart segfault) R4 sweep-2 surfaced two bugs sharing one root — R3a's in-memory GroundState shortcut: (1) multi-protocol climb crashed "tensors do not conform": the from_memory path skipped the pristine-MO reload, so climbing back to original_k_ left scf_->amo at a stale coarse k. (2) restart-in-place segfaulted in build_fock_matrices: madqc validates a valid SCF archive as Ok and constructs the SCF WITHOUT loading the MOs, so scf_->amo was empty. Fix (one mechanism): the madqc adapter now builds via GroundState::from_archive (resolve the moldft restartdata from scf_calc->work_dir), exactly like v2's make_ground_context. from_archive resets from_memory_=false so prepare() reloads pristine MOs from disk on each protocol climb. Reverted the in-memory pristine-snapshot band-aid. Validated (SLURM r4_resume, 2026-06-11): fresh h2o climb k6->k8 via from_archive completes — Memory Model B measured the k8 footprint, the rung that previously crashed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…TaskQ scratch - Add tools/dump_mra_trees.cpp + its CMake target (R2/L4 offline MRA tree-skeleton dumper: loads converged orbitals, dumps the adaptive octree per orbital, never solves). CMakeLists already referenced it but the source was untracked. - Broaden the MacroTaskExchangeRow_task.* ignore to any cwd — these scratch files land in the repo root / calc dir when a solve runs from there, not only in the fixture dirs the existing rule covered. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ard + atomic write, re-enable es_analysis PENDING cm_build + cm_es h2 3 validation. No physics change is expected; the es_analysis re-enable is the one part that needs the ES run to confirm no teardown crash. Amend if the build/run reveals an issue. Cleanups (no behavior change): - GroundState: the from_memory_ flag is now vestigial (only from_archive uses the bare ctor, which then overrides it) — remove the flag + guard so prepare() always reloads pristine MOs, and make the bare ctor private to enforce the checkpoint-backed invariant. Removes the latent single-protocol-only footgun. - calc_executor: delete FdResponseExecutor::action_name (byte-identical to the free node_action_name). madqc_adapter: refresh the stale header comment (from_archive + multi-property, not in-memory + R3a-alpha-only). - assemble_beta: accumulate property rows and write response_metadata.json once instead of reloading + rewriting it per (vbc, axis). ES heap-OOB fix (unblocks the parked es_analysis re-enable, R2): - Root cause: the per-root archives use nio=1, so reloading at a different process count than wrote them silently corrupts the WorldContainer and aborts at teardown (the --es-analyze-only --es-load-only repro). Record writer_nproc in roots.json; load_es_roots broadcasts it and fails cleanly on ALL ranks (no divergent-throw deadlock) on a mismatch instead of corrupting the heap. Write roots.json atomically (tmp + rename). - With the cross-np load path guarded, re-enable report_es_analysis: it runs on the in-memory converged state at the solve's own np (no bundle reload), so it never hits the guarded path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…iterations compute_V0x applied K[phi_occ, phi_occ] to the response orbitals by constructing a fresh madness::Exchange on every call, and Exchange::set_bra_and_ket deep-copies bra+ket (all occupied orbitals) each time — so a full copy of the occupied set was made on every V0x apply (x2 for Full's x,y) every iteration. Build K0 once per protocol on ResponseGroundState (K0_alpha; K0_beta for open shell, null for closed) in build_response_ground_state_*, and apply the cached operator from the 12 compute_V0x sites via common_ops::apply_ground_exchange. A null-K0 fallback rebuilds a fresh K[mos,mos], reproducing the old path exactly, so a missed build site cannot drift the numbers. Validated bit-identical: cm_equiv h2o green. The saved copies scale with n_occ, so the win is negligible on h2o and material on c6h6/naphthalene and toward the large-system target. two_electron's response exchange K[amo,x] is unchanged (ket=x depends on the iterate — not cacheable). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The first concrete L2/R5 state-parallel target: fan the ES bundle's per-root work across rank subworlds. Documents the serial-over-roots bottleneck (es_solver.hpp:328 step_rotate_pieces — M serial collective passes), the fan-out/gather decomposition (parallel per-root build+BSH; cheap coupled subspace step), the MacroTaskQ mechanism (MacroTaskExchangeRow pattern at root grain, vs v2's raw-subworld execute_subgroup_state_solve and why MacroTaskQ fits the single-shot-per-iteration ES work), the ground-orbital replication constraint (gated on the 15d pre-flight estimate + node-local shared replica), the serial-vs-parallel parity contract, the MacroTaskQ gotchas (no universe collectives in operator(), subworld teardown fences), and a phased Inc 0-4 prototype plan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…+ apply_A±B Drop the (A−B)(A+B)u = ω²u direction entirely — not pursuing it: - delete solvers/es_solver_full_rpa.hpp (ESSolverFullRPA + ESProblemFullRPA) - kernels/full.hpp: remove apply_AplusB / apply_AminusB + INVARIANCE block - build_response_ground_state.hpp: remove build_es_problem_full_rpa + promote_{tda,full}_to_full_rpa_closed_shell + include - test_v3_es_skeleton: drop --type=full-rpa branch + --load-roots-full - docs 02/15/19 + operator_contracts: mark removed (2026-06) Full ES remains the direct paired-(X,Y) ESSolver<Full, ClosedShell>. Compile-verified: cm_build targets + test_v3_es_skeleton link clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Batch V0x/T0x/BSH over the flattened M*n_occ root bundle (closed-shell TDA) -- one collective pass per op-class instead of M per-root passes. These are pure per-function maps (no cross-function reduction), so the batched result is bit-identical to the per-root path. Gated behind --es-batch (default off); the per-root loop stays as the numerical reference. gamma-exchange (K[phi,x_s](phi), a distinct operator per root) and the focka transforms stay per-root -- those are the Stage-2 (locality-pmap) targets, recorded in docs/19. Verified (4 ranks x 10 threads): h2 3/3 roots bit-for-bit identical, legacy-TDA gate 3/3 PASS; h2o converged root bit-identical, ~7% faster wall. Cuts fence count ~6M -> ~3+3M and fills the task queue with M x more work per fence; the win grows with root/rank count. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Instrument step_rotate_pieces with fence-marked phase timers (gamma, build, subspace, rotate, bsh) printed as one rank-0 ES_TIMING line per iter. Gated behind --es-time (default off); the boundary fences are added only under the flag, so normal runs are untouched. The per-root build loop is split into a gamma-pass (density + Coulomb-exchange) and an ops-pass (V0x/T0x/E0x) purely for attribution -- the kernel calls are independent, so results are unchanged (re-verified h2: 3/3 roots bit-identical, omegas unchanged from pre-split). First measurement (h2o, 3 roots, 4 ranks): build 42%, gamma 36%, bsh 16%, rotate 6%, subspace 0.5%. => 78% of each iter is in the per-root building blocks (the Stage-2 root-fanout target); the M*M subspace coupling is negligible (0.5%), and --es-batch's 7% win lands almost entirely in build (V0x/T0x), ~nothing in BSH. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…Stage-2 draft) Design draft for the two-pmap / single-World locality approach: what a distributed Function is, the binary-op alignment rule (inner_local "ASSUMING same distribution"), the GroupPmap mechanism, NodeReplicated ground state, the once-per-iter shuffle for the M*M subspace step, an honest trade-off section (it's a scaling bet; the Exchange operator is the hard 36%), what is/isn't reusable from v2 subworld state-parallel, and a phased plan (Inc A-D). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…Stage-2 fork Verified the MADNESS output-pmap behavior: an op inherits its result pmap from a SPECIFIC operand (mul_sparse -> left/`a`, via set_impl), not "the per-root one", and compressed gaxpy (merge_trees, funcimpl.h:1233) ASSERTS same pmap. So localizing roots alone crashes assemble_lambda (V0x lands on V_local's pmap, T0x on the root's group -> mismatch). Keeping every per-root intermediate group-local needs pervasive per-op remapping -> favors subworlds (subworld default pmap localizes intermediates for free) vs single-World pmaps (memory-friendly NodeReplicated phi, but fights output-pmap conventions). Real Stage-2 decision; settle before implementing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…worlds (chosen) Chosen Stage-2 path (over doc-20 pmaps). Key findings from MADNESS source: (a) NO shared-memory Function storage exists (no MPI_Win_allocate_shared; Split_type SHARED only makes intra-node communicators; NodeReplicated = 1 copy/node via active messages, not shared pointers); (b) a subworld cannot reference universe data — it must be Cloud-copied in (copy_coeffs_different_ world). So literal shm-shared phi is a core-framework project, out of scope. The GOAL is reachable anyway: NODE-ALIGNED subworlds (Split_type SHARED, one subworld per node) with phi left DISTRIBUTED within each node-subworld = one copy per node, per-rank phi identical to single-World (avoids the v2 per-RANK replication OOM). Intermediate locality is free (subworld default pmap). Per-iter Cloud gather/scatter for the M*M subspace+rotation (heavier than the pmap shuffle; the efficiency unknown to measure). Reuses MacroTaskQ/Cloud; drops v2's per-task phi replication. Phased plan Inc 1-4; first step = node-aligned create_worlds + map test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

make_node_aligned_subworld() splits the universe into one subworld per PHYSICAL NODE via MPI_Comm_split_type(SHARED) — node-contiguous, unlike MacroTaskQ's default color=rank%nsubworld interleaving. This is the prerequisite for the ES node-phi Stage-2 design: with node-aligned subworlds, phi Cloud-copied into each subworld lands DISTRIBUTED over that node's ranks => one copy per node, per-rank phi identical to the current single-World cost (avoids the v2 per-RANK replication OOM). verify_one_subworld_per_node() cross-checks each rank's subworld size against ranks_per_host. New test_node_subworlds (header-only, no solver change, no MRA solve). Verified: 1 rank PASS; 2 nodes x 4 ranks -> 2 subworlds of size 4, NODE_SUBWORLD_TEST PASS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@k6

…export [R2 Phase 2b] Extend the offline MRA tree-skeleton dumper to also dump converged closed-shell FD response orbitals, so one run yields a combined ground + response product-overlap analysis (the Coulomb/exchange partitioning oracle). --fd [--fd-calc-dir=DIR] walks response_metadata.json fd_states, reads (thresh,k) from protocols[key] + set_response_protocol, parses the <pert> key back into a Perturbation (parse_perturbation = inverse of Perturbation::description), and try_load_fd_state<Type,ClosedShell> each point — static->ResponseStateX (x only), full->ResponseStateXY (x+y) — dumping every response orbital's recon+comp tree via the existing dump_function_trees, into the same out dir as the ground MOs. TDA is an ES-bundle type (no FDPerturbationOf<TDA>) and is excluded. Validated on the allocation (cm_run h2o, dipole_z static @k6, alpha_zz=8.5345 == R3a ref): 5 ground MOs + 5 response orbitals -> combined 10x10 overlap matrix, C in [0,1], symmetric, response-of-orbital-i overlaps ground-orbital-i. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…d copy/node (docs/21) test_node_phi ships φ (synthetic, content-independent) from the universe into node-aligned subworlds and proves the memory claim the Stage-2 design rests on via size()/size_local accounting: Σ(local) over a node's ranks == size() (ratio 1.0), i.e. ONE DISTRIBUTED copy per node, NOT a per-rank replica. Verified (2 nodes x 4 ranks, 12 orbitals): per-rank φ = 175824 coeffs (one copy/node) vs 703296 if per-rank replicated (the v2 OOM cost we avoid) — 4x less per rank at 4 ranks/node; ratio per node = 1.000000; NODE_PHI_TEST PASS. Also confirms the cross-world copy completes without deadlock outside MacroTaskQ. Teardown clears all Functions before finalize (avoids the RecursiveMutex abort). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…code) Per-iteration fan/gather/couple/scatter for the BUILD phase only (BSH stays universe-side until Inc 4). MacroTaskESBuildTDA functor fans each root's build to a subworld (kernels unchanged — subworld-local => aligned), returns [Λ|V0x|E0x|γ] (4*n_occ) per root; universe subspace/rotate/θ/BSH stay byte-identical so bit-identity is easy to reason about. World-bound objects (coulop, Qa, K0) rebuilt subworld-side; K0=null => direct apply_exchange (no nested MacroTaskQ). Gated --es-subworlds=G (0=reference). Staged: 3a stock MacroTaskQ interleaved G=nnodes (flow+bit-identity+Cloud cost, no core change), 3b node-aligned create_worlds (Split_type SHARED) for on-node locality. Open risk: custom result-stride partitioner. Awaits review before code. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ompute (per user) The §1-7 MacroTask build fan-out is stateless (recreates subworlds + reships X/φ every call), which fights the user's goals. §8 captures the real target: persistent node-subworlds + φ replicated once + X REPLICATED across subworlds; per iter build Λ locally (recompute, transient), assemble per-subworld A/S columns, allreduce the M×M SCALAR matrix (Λ never crosses), diagonalize + rotate + BSH + KAIN locally, re-broadcast updated X. Reuses the existing step_recompute_pieces (stream_theta). Hard constraint: matrix + dense rotation couple all roots via co-located-function ops, so X must be replicated (or gathered) — but the 78% build stays local. MacroTask demoted to optional coupling-cost measurement. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e (docs/22 §8) Proves the "collective to construct the matrix" of the persistent-subworld recompute design: X replicated into node-subworlds, each node computes the subspace-matrix columns it owns (round-robin) via a subworld-collective matrix_inner over its replicated X, only subworld-rank-0 contributes, universe allreduce sums the disjoint columns -> full M×M. Functions NEVER cross worlds; only M×M scalars move. == direct universe matrix to machine eps. Verified: 1 rank and 2 nodes x 4 ranks (M=6) -> max|S_dist-S_direct| = 8.9e-16, SUBSPACE_ALLREDUCE_TEST PASS. Keystone for the rotation+recompute solver restructure (the remaining Stage-2 steps). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… density [R2 Phase 2b-ii] Each FD point now also builds and dumps the closed-shell response density rho^(1) as fd_<pert>__<key>__<fkey>_rho1_{recon,comp}.json, so the response density tree feeds the same slice/spectrum/boxes viz and adds a row to the combined product-overlap matrix. Reuses Kernels<Type,ClosedShell>::compute_density as the single source of truth for the convention (Static folds Y=X -> factor 4; Full uses x+y -> factor 2). It reads only g0.amo, so a 1-field ResponseGroundState suffices (no prepare()/Coulomb/Q). GS orbitals load at native k but the FD state is at the FD k, so a LOCAL copy of the orbitals is reprojected to the active k (project(), as in GroundState::prepare) for the phi*x products; the dumped GS trees stay native. Validated via SLURM batch (job 2023189, hbm-short, non-exclusive): h2o dipole_z static, rho1 norm2=0.664, 764 leaves, max_level 14 (smooth valence-scale polarization density vs core lvl 18); combined 11x11 matrix, rho1 overlaps O-1s core 0.85, HOMO-response 0.26. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tal + rho^(1) cubes [R2 Phase 2b-iii] --cube previously emitted Gaussian cubes only for the ground-state MOs. Thread the molecule + cube settings through dump_fd_states -> dump_fd_point -> dump_block so the FD response orbitals x_i (and y_i for Full) and the response density rho^(1) are also written via the same write_cube_file/eval_cube path, with the same stem as their _boxes.vtk (fd_..._x<i>.cube, fd_..._rho1.cube). Lets ParaView/Avogadro render the response functions as signed isosurface lobes like the MOs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Index/table of contents for the MADWorld parallel runtime developer guide (docs/parallel_runtime_guide/). Chapters and the performance-models companion follow in subsequent commits.

Chapters 1-5 of the parallel runtime developer guide: MPI/SafeMPI, active messages, the RMI thread, the task queue & futures, and global ops & fence. Diagram-driven, grounded in src/madness/world with file:line anchors.

Dataflow tasks, futures, dependency tracking, thread pool, remote-task lifecycle; binary-tree collectives and Dijkstra-style fence termination detection.

…tions) WorldContainer distributed hash map and pmaps; Function/FunctionImpl tree representation, key-aware pmaps and load balancing; and the per-operation deep dive (compress, reconstruct, gaxpy, multiply, inner, apply, truncate) with sweep diagrams and per-node cost.

Per-operation parameter-tuning playbook (runtime/build/numerical knobs, symptom->action); and the parallel function I/O chapter: archive system, the nio single-writer default bottleneck, app restart flows, and an HDF5 / parallel-HDF5 (MPI-IO) analysis with an I/O performance model.

Quantitative time/communication/memory models for the MADWorld runtime and the core MRA operations: the compute/comm/sync decomposition, a worked response-iteration model, the ground-state replication memory model tying to the documented OOM, and how to instrument each model term. Companion to the parallel runtime developer guide.

…re-next

…++17/-Werror) merge_flag_map captured the structured binding 'shard_proto' from the enclosing range-for; capturing a structured binding is a C++20 extension (-Wc++20-extensions), which upstream's new MADNESS_WERROR=ON CI turns into a hard error under clang. Alias it to a named reference first. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

evaleev and others added 30 commits April 30, 2026 08:17

amended handling of special points at box boundaries; special points …

4742b85

…at faces/edges/vertices are considered "in"/neighbors of the respective boxes

Merge pull request m-a-d-n-e-s-s#692 from m-a-d-n-e-s-s/evaleev/fix/s…

32c1b82

…pecial-point-at-box-boundaries amended handling of special points at box boundaries

fix(benchmarks): update SLURM scripts for C2H4, C6H6, and H2O benchma…

f61c879

…rks to correct walltime and repository paths

feat(benchmarks): add error logs for C2H4, C6H6, and H2O benchmarks

36368df

Merge remote-tracking branch 'refs/remotes/origin/molresponse-feature…

d016a61

…-next' into molresponse-feature-next

feat(benchmarks): add execution time logs for C6H6 benchmark

736e3c2

Merge remote-tracking branch 'refs/remotes/origin/molresponse-feature…

2f164a8

…-next' into molresponse-feature-next

docs: document MADNESS_WERROR option and internal warnings target

cc7446a

fix(benchmarks): adjust resource allocation for tyrosine and uracil b…

4429509

…enchmarks

fix(excitedresponse): unify conditional checks for print level in deb…

ba553c8

…ugging outputs

ahurta92 and others added 29 commits June 9, 2026 15:49

docs(molresponse_v3): R3b (madqc multi-property mapping: alpha+beta+r…

a83ede6

…aman+es) landed Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs: add parallel runtime developer guide index

330f287

Index/table of contents for the MADWorld parallel runtime developer guide (docs/parallel_runtime_guide/). Chapters and the performance-models companion follow in subsequent commits.

docs: add runtime guide chapters 1-5 (comm, AM, RMI, taskq, GOP)

1fcd304

Chapters 1-5 of the parallel runtime developer guide: MPI/SafeMPI, active messages, the RMI thread, the task queue & futures, and global ops & fence. Diagram-driven, grounded in src/madness/world with file:line anchors.

docs: add runtime guide chapters 4-5 (taskq/futures, GOP/fence)

03e068e

Dataflow tasks, futures, dependency tracking, thread pool, remote-task lifecycle; binary-tree collectives and Dijkstra-style fence termination detection.

Merge remote-tracking branch 'upstream/master' into molresponse-featu…

1200592

…re-next

ahurta92 merged commit fd03e5c into master Jun 18, 2026
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sync master with molresponse-feature-next (v3 development trunk)#7

Sync master with molresponse-feature-next (v3 development trunk)#7
ahurta92 merged 188 commits into
masterfrom
molresponse-feature-next

ahurta92 commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ahurta92 commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants