Skip to content

docs: confirm 'no real improvement' verdicts for metal-TS systems (n=10)#287

Merged
ericchansen merged 1 commit into
masterfrom
feat/metals-noise-honest-redo
May 28, 2026
Merged

docs: confirm 'no real improvement' verdicts for metal-TS systems (n=10)#287
ericchansen merged 1 commit into
masterfrom
feat/metals-noise-honest-redo

Conversation

@ericchansen
Copy link
Copy Markdown
Owner

Summary

Reruns pd-allyl, rh-conjugate, and heck-relay with --n-evals 10 (q2mm#286, now on master) and rewrites each per-system doc page with the statistically defensible verdict rather than the earlier "within noise" caveat from #283.

Companion data PR: ericchansen/q2mm-data#9

Verdicts (now decisive)

System Mean Δ% CI₉₅ Verdict
pd-allyl −0.029 % ±0.34 % NOT SIGNIFICANT
rh-conjugate −0.080 % ±1.18 % NOT SIGNIFICANT
heck-relay* −0.59 % ±3.26 % NOT SIGNIFICANT

* heck-relay run with --ratio-tol none (ratio = 1.378, formally fails default gate); even with the gate bypassed, the JaxLoss surrogate broke down (2 non-finite line-search values) and the result is inside the noise band.

Doc changes per system

Each page is rewritten so the earlier "within noise floor, cannot claim" caveat becomes a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss local minimum" success callout. The CI₉₅ excludes any improvement larger than the per-system noise floor, so this is now a defensible "no real improvement available", not "we can't tell".

Specific updates:

  • pd-allyl — table now shows mean ± CI₉₅; caveat replaced with success callout; gap analysis unchanged
  • rh-conjugate — table same shape; the 4602-ratio non-determinism caveat is removed (ratio is now stable at 1.01 with n=10); Investigate rh-conjugate JaxLoss ratio non-determinism (0.46-0.96-4602 across runs) #278 stays closed
  • heck-relay--ratio-tol none recommendation strengthened: with statistical rigor in place, the gate bypass demonstrably doesn't unlock useful optimization

Why these verdicts matter

The earlier #283 results flagged these three systems as "within noise" but couldn't say whether there was a real signal hiding under the per-call GPU noise. The n=10 + Student-t 95% CI confirms there isn't:

Validation

Out of scope

Copilot AI review requested due to automatic review settings May 28, 2026 00:32
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the published-FF system documentation for three metal-TS benchmarks (pd-allyl, rh-conjugate, heck-relay) to report statistically defensible “no real improvement” verdicts using --n-evals 10 and 95% confidence intervals, replacing the earlier “within noise” caveats.

Changes:

  • Replace “within noise” wording with decisive “NOT SIGNIFICANT” verdicts based on n=10 + CI₉₅.
  • Update per-system benchmark tables to include mean ObjectiveFunction values and CI₉₅, plus CI₉₅ on mean Δ%.
  • Revise narrative callouts/recommendations (including --ratio-tol none discussion for heck-relay).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
docs/systems/rh-conjugate.md Updates benchmark table to n=10 mean/CI₉₅ and rewrites interpretation as “no detectable improvement.”
docs/systems/pd-allyl.md Updates benchmark table to n=10 mean/CI₉₅ and reframes result as confirmed local minimum.
docs/systems/heck-relay.md Updates benchmark table to n=10 mean/CI₉₅ with --ratio-tol none and strengthens recommendation to keep the ratio gate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread docs/systems/pd-allyl.md
Comment thread docs/systems/rh-conjugate.md
Comment thread docs/systems/pd-allyl.md Outdated
Comment thread docs/systems/rh-conjugate.md Outdated
@ericchansen ericchansen force-pushed the feat/metals-noise-honest-redo branch from 7e21fdd to f29b03a Compare May 28, 2026 01:29
…e/heck-relay (n=10)

Reruns the three "within noise" published-FF systems with the
--n-evals 10 statistical evaluation (q2mm#286, landed on master).
The n=10 samples give a Student-t 95% CI tight enough to make
confident scientific verdicts where the earlier single-call PR
#283 could only flag the results as "within noise":

| System          | Mean Δ%  | CI₉₅   | Verdict        |
|-----------------|---------:|-------:|----------------|
| pd-allyl        | -0.029%  | ±0.34% | NOT SIGNIFICANT |
| rh-conjugate    | -0.080%  | ±1.18% | NOT SIGNIFICANT |
| heck-relay*     | -0.59%   | ±3.26% | NOT SIGNIFICANT |

(*) heck-relay run with --ratio-tol none; even with the gate
bypassed JaxLoss broke down (2 non-finite line-search values).

Each per-system page is rewritten:

- The earlier "within noise floor, cannot claim" caveat is replaced
  with a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss
  local minimum" success callout — the CI excludes any improvement
  larger than the per-system noise floor, so this is now a
  defensible "no real improvement available", not "we can't tell".
- Metric tables updated to show mean ± CI₉₅ %, not single-call
  values.
- The 4602-ratio non-determinism caveat on rh-conjugate is removed
  (with n=10 the ratio is stable at 1.01) and #278 stays closed.
- heck-relay's "keep default ratio_tol=0.15" recommendation is
  strengthened: with statistical rigor in place, --ratio-tol none
  demonstrably doesn't unlock useful optimization.

Companion data PR with the regenerated JSON + FFs:
ericchansen/q2mm-data#9.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 28, 2026 03:08
@ericchansen ericchansen force-pushed the feat/metals-noise-honest-redo branch from f29b03a to dde4478 Compare May 28, 2026 03:08
@ericchansen ericchansen merged commit cdbd5ff into master May 28, 2026
5 of 6 checks passed
@ericchansen ericchansen deleted the feat/metals-noise-honest-redo branch May 28, 2026 03:09
@ericchansen ericchansen review requested due to automatic review settings May 28, 2026 03:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants