docs: confirm 'no real improvement' verdicts for metal-TS systems (n=10)#287
Merged
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the published-FF system documentation for three metal-TS benchmarks (pd-allyl, rh-conjugate, heck-relay) to report statistically defensible “no real improvement” verdicts using --n-evals 10 and 95% confidence intervals, replacing the earlier “within noise” caveats.
Changes:
- Replace “within noise” wording with decisive “NOT SIGNIFICANT” verdicts based on n=10 + CI₉₅.
- Update per-system benchmark tables to include mean ObjectiveFunction values and CI₉₅, plus CI₉₅ on mean Δ%.
- Revise narrative callouts/recommendations (including
--ratio-tol nonediscussion for heck-relay).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| docs/systems/rh-conjugate.md | Updates benchmark table to n=10 mean/CI₉₅ and rewrites interpretation as “no detectable improvement.” |
| docs/systems/pd-allyl.md | Updates benchmark table to n=10 mean/CI₉₅ and reframes result as confirmed local minimum. |
| docs/systems/heck-relay.md | Updates benchmark table to n=10 mean/CI₉₅ with --ratio-tol none and strengthens recommendation to keep the ratio gate. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
7e21fdd to
f29b03a
Compare
…e/heck-relay (n=10) Reruns the three "within noise" published-FF systems with the --n-evals 10 statistical evaluation (q2mm#286, landed on master). The n=10 samples give a Student-t 95% CI tight enough to make confident scientific verdicts where the earlier single-call PR #283 could only flag the results as "within noise": | System | Mean Δ% | CI₉₅ | Verdict | |-----------------|---------:|-------:|----------------| | pd-allyl | -0.029% | ±0.34% | NOT SIGNIFICANT | | rh-conjugate | -0.080% | ±1.18% | NOT SIGNIFICANT | | heck-relay* | -0.59% | ±3.26% | NOT SIGNIFICANT | (*) heck-relay run with --ratio-tol none; even with the gate bypassed JaxLoss broke down (2 non-finite line-search values). Each per-system page is rewritten: - The earlier "within noise floor, cannot claim" caveat is replaced with a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss local minimum" success callout — the CI excludes any improvement larger than the per-system noise floor, so this is now a defensible "no real improvement available", not "we can't tell". - Metric tables updated to show mean ± CI₉₅ %, not single-call values. - The 4602-ratio non-determinism caveat on rh-conjugate is removed (with n=10 the ratio is stable at 1.01) and #278 stays closed. - heck-relay's "keep default ratio_tol=0.15" recommendation is strengthened: with statistical rigor in place, --ratio-tol none demonstrably doesn't unlock useful optimization. Companion data PR with the regenerated JSON + FFs: ericchansen/q2mm-data#9. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f29b03a to
dde4478
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Reruns pd-allyl, rh-conjugate, and heck-relay with
--n-evals 10(q2mm#286, now on master) and rewrites each per-system doc page with the statistically defensible verdict rather than the earlier "within noise" caveat from #283.Companion data PR: ericchansen/q2mm-data#9
Verdicts (now decisive)
* heck-relay run with
--ratio-tol none(ratio = 1.378, formally fails default gate); even with the gate bypassed, the JaxLoss surrogate broke down (2 non-finite line-search values) and the result is inside the noise band.Doc changes per system
Each page is rewritten so the earlier "within noise floor, cannot claim" caveat becomes a "Confirmed: published Wahlers/Rosales FF sits at a JaxLoss local minimum" success callout. The CI₉₅ excludes any improvement larger than the per-system noise floor, so this is now a defensible "no real improvement available", not "we can't tell".
Specific updates:
--ratio-tol nonerecommendation strengthened: with statistical rigor in place, the gate bypass demonstrably doesn't unlock useful optimizationWhy these verdicts matter
The earlier #283 results flagged these three systems as "within noise" but couldn't say whether there was a real signal hiding under the per-call GPU noise. The n=10 + Student-t 95% CI confirms there isn't:
Validation
Out of scope
JaxEngine.minimize— multi-day, separate PR