docs: confirm 'no real improvement' verdicts for metal-TS systems (n=10) by ericchansen · Pull Request #287 · ericchansen/q2mm

ericchansen · 2026-05-28T00:32:31Z

Summary

Reruns pd-allyl, rh-conjugate, and heck-relay with --n-evals 10 (q2mm#286, now on master) and rewrites each per-system doc page with the statistically defensible verdict rather than the earlier "within noise" caveat from #283.

Companion data PR: ericchansen/q2mm-data#9

Verdicts (now decisive)

System	Mean Δ%	CI₉₅	Verdict
pd-allyl	−0.029 %	±0.34 %	NOT SIGNIFICANT
rh-conjugate	−0.080 %	±1.18 %	NOT SIGNIFICANT
heck-relay*	−0.59 %	±3.26 %	NOT SIGNIFICANT

* heck-relay run with --ratio-tol none (ratio = 1.378, formally fails default gate); even with the gate bypassed, the JaxLoss surrogate broke down (2 non-finite line-search values) and the result is inside the noise band.

Doc changes per system

Each page is rewritten so the earlier "within noise floor, cannot claim" caveat becomes a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss local minimum" success callout. The CI₉₅ excludes any improvement larger than the per-system noise floor, so this is now a defensible "no real improvement available", not "we can't tell".

Specific updates:

pd-allyl — table now shows mean ± CI₉₅; caveat replaced with success callout; gap analysis unchanged
rh-conjugate — table same shape; the 4602-ratio non-determinism caveat is removed (ratio is now stable at 1.01 with n=10); Investigate rh-conjugate JaxLoss ratio non-determinism (0.46-0.96-4602 across runs) #278 stays closed
heck-relay — --ratio-tol none recommendation strengthened: with statistical rigor in place, the gate bypass demonstrably doesn't unlock useful optimization

Why these verdicts matter

The earlier #283 results flagged these three systems as "within noise" but couldn't say whether there was a real signal hiding under the per-call GPU noise. The n=10 + Student-t 95% CI confirms there isn't:

All three CIs exclude any improvement larger than the per-system noise floor
All well below any publishable improvement claim
The published Wahlers/Rosales FFs sit at JaxLoss local minima for our engine
Further improvement requires the engine-parity work in MM3 energy is non-smooth: JAX grad returns wrong subgradients at converged metal-TS geometries #284, not optimizer tweaking

Validation

ruff check + format clean (no Python changes; docs only)
properdocs build clean (no new warnings)
All numbers in tables traceable to JSON in companion data PR Support for Gaussian Hessian, eigenvectors, etc. #9

Out of scope

The MM3 non-smooth gradient bug (MM3 energy is non-smooth: JAX grad returns wrong subgradients at converged metal-TS geometries #284) — confirmed by these runs to be the real blocker for further improvement on metal-TS systems
Engine-parity rewrite of JaxEngine.minimize — multi-day, separate PR

Copilot

Pull request overview

Updates the published-FF system documentation for three metal-TS benchmarks (pd-allyl, rh-conjugate, heck-relay) to report statistically defensible “no real improvement” verdicts using --n-evals 10 and 95% confidence intervals, replacing the earlier “within noise” caveats.

Changes:

Replace “within noise” wording with decisive “NOT SIGNIFICANT” verdicts based on n=10 + CI₉₅.
Update per-system benchmark tables to include mean ObjectiveFunction values and CI₉₅, plus CI₉₅ on mean Δ%.
Revise narrative callouts/recommendations (including --ratio-tol none discussion for heck-relay).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
docs/systems/rh-conjugate.md	Updates benchmark table to n=10 mean/CI₉₅ and rewrites interpretation as “no detectable improvement.”
docs/systems/pd-allyl.md	Updates benchmark table to n=10 mean/CI₉₅ and reframes result as confirmed local minimum.
docs/systems/heck-relay.md	Updates benchmark table to n=10 mean/CI₉₅ with `--ratio-tol none` and strengthens recommendation to keep the ratio gate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…e/heck-relay (n=10) Reruns the three "within noise" published-FF systems with the --n-evals 10 statistical evaluation (q2mm#286, landed on master). The n=10 samples give a Student-t 95% CI tight enough to make confident scientific verdicts where the earlier single-call PR #283 could only flag the results as "within noise": | System | Mean Δ% | CI₉₅ | Verdict | |-----------------|---------:|-------:|----------------| | pd-allyl | -0.029% | ±0.34% | NOT SIGNIFICANT | | rh-conjugate | -0.080% | ±1.18% | NOT SIGNIFICANT | | heck-relay* | -0.59% | ±3.26% | NOT SIGNIFICANT | (*) heck-relay run with --ratio-tol none; even with the gate bypassed JaxLoss broke down (2 non-finite line-search values). Each per-system page is rewritten: - The earlier "within noise floor, cannot claim" caveat is replaced with a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss local minimum" success callout — the CI excludes any improvement larger than the per-system noise floor, so this is now a defensible "no real improvement available", not "we can't tell". - Metric tables updated to show mean ± CI₉₅ %, not single-call values. - The 4602-ratio non-determinism caveat on rh-conjugate is removed (with n=10 the ratio is stable at 1.01) and #278 stays closed. - heck-relay's "keep default ratio_tol=0.15" recommendation is strengthened: with statistical rigor in place, --ratio-tol none demonstrably doesn't unlock useful optimization. Companion data PR with the regenerated JSON + FFs: ericchansen/q2mm-data#9. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 28, 2026 00:32

Copilot started reviewing on behalf of ericchansen May 28, 2026 00:32 View session

Copilot AI reviewed May 28, 2026

View reviewed changes

Comment thread docs/systems/pd-allyl.md

Comment thread docs/systems/rh-conjugate.md

Comment thread docs/systems/pd-allyl.md Outdated

Comment thread docs/systems/rh-conjugate.md Outdated

ericchansen force-pushed the feat/metals-noise-honest-redo branch from 7e21fdd to f29b03a Compare May 28, 2026 01:29

Copilot AI review requested due to automatic review settings May 28, 2026 03:08

ericchansen force-pushed the feat/metals-noise-honest-redo branch from f29b03a to dde4478 Compare May 28, 2026 03:08

Copilot started reviewing on behalf of ericchansen May 28, 2026 03:08 View session

ericchansen merged commit cdbd5ff into master May 28, 2026
5 of 6 checks passed

ericchansen deleted the feat/metals-noise-honest-redo branch May 28, 2026 03:09

ericchansen review requested due to automatic review settings May 28, 2026 03:35

ericchansen mentioned this pull request May 28, 2026

fix(jax_engine): correct gradients for MM3 angle term at near-collinear geometries #288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: confirm 'no real improvement' verdicts for metal-TS systems (n=10)#287

docs: confirm 'no real improvement' verdicts for metal-TS systems (n=10)#287
ericchansen merged 1 commit into
masterfrom
feat/metals-noise-honest-redo

ericchansen commented May 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ericchansen commented May 28, 2026

Summary

Verdicts (now decisive)

Doc changes per system

Why these verdicts matter

Validation

Out of scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants