feat(regen): add --n-evals + median+CI reporting to distinguish signal from noise by ericchansen · Pull Request #286 · ericchansen/q2mm

ericchansen · 2026-05-27T21:01:37Z

Summary

Adds post-hoc median-of-N ObjectiveFunction reporting to scripts/regenerate_convergence_results.py so convergence artifacts can distinguish real force-field optimization signal from the per-call engine noise documented in q2mm#284 §2 after the q2mm#283 runs.

What changes

Adds --n-evals N with default 1 for backwards-compatible single-call behavior.
Reuses the optimizer's ObjectiveFunction instance for repeated initial/final evaluations after optimization.
Emits median score, 95% CI half-width, median improvement percentage, and a significance flag in validation_results.json.
Preserves ObjectiveFunction counters/history around repeated post-hoc evaluations.
Updates the optimized INFO log line to report median improvement, CI, and significance when N > 1.

Backwards compatibility

Existing consumers can keep reading these unchanged fields:

initial_obj_score
final_obj_score
improvement_pct

The new fields are additive: initial_obj_score_median, initial_obj_score_ci95, final_obj_score_median, final_obj_score_ci95, improvement_pct_median, and improvement_significant.

Validation

/home/eric/repos/q2mm/.venv/bin/python -m ruff check scripts/ q2mm/
/home/eric/repos/q2mm/.venv/bin/python -m ruff format --check scripts/ q2mm/
PYTHONPATH=/home/eric/repos/q2mm-feat-regen-median-of-n /home/eric/repos/q2mm/.venv/bin/python -m pytest test/ -x -q -m "not (openmm or tinker or jax or jax_md or psi4)"
Q2MM_SUPPORTING_INFO=/home/eric/repos/q2mm/validation/supporting-info PYTHONPATH=/home/eric/repos/q2mm-feat-regen-median-of-n /home/eric/repos/q2mm/.venv/bin/python scripts/regenerate_convergence_results.py --system ch3f --output-dir results/regen-n1
Q2MM_SUPPORTING_INFO=/home/eric/repos/q2mm/validation/supporting-info PYTHONPATH=/home/eric/repos/q2mm-feat-regen-median-of-n /home/eric/repos/q2mm/.venv/bin/python scripts/regenerate_convergence_results.py --system ch3f --n-evals 3 --output-dir results/regen-n3

Next step (separate PR)

Phase D will re-run the three metal-TS systems with --n-evals 5 and update the docs with significant / no-improvement / inconclusive verdicts.

Copilot

Pull request overview

This PR enhances scripts/regenerate_convergence_results.py to optionally repeat post-hoc ObjectiveFunction evaluations (--n-evals N) and report median-based scores plus uncertainty metrics, so convergence artifacts can better separate real optimization signal from per-call engine noise.

Changes:

Add --n-evals CLI option (default 1) and record it in provenance.
Re-evaluate ObjectiveFunction at initial/final parameters N times and emit median/CI fields + a significance flag into validation_results.json.
Update the “optimized” log line to include median improvement and CI when N > 1.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…from noise Add post-hoc repeated ObjectiveFunction evaluation for convergence regeneration so q2mm#284 §2 noise findings from q2mm#283 can be reported as measurement uncertainty instead of single-call verdicts. The validation JSON keeps the legacy initial_obj_score, final_obj_score, and improvement_pct fields for existing consumers, and adds: - initial_obj_score_mean, initial_obj_score_ci95 - final_obj_score_mean, final_obj_score_ci95 - improvement_pct_mean, improvement_significant Reports the sample mean (not median) paired with a Student-t 95% CI half-width — the t-distribution describes the sampling distribution of the mean, not the median. For n ≤ 10 with the bounded engine noise we measure here, sample mean and median are nearly identical; the mean is the right center to pair with a t-CI. ObjectiveFunction.history is restored between samples by truncating back to its original length (O(1)) rather than copying-then-replacing (O(len)) — important when the optimizer has accumulated many evaluations. Validation: - ruff check + format clean - 680 unit tests pass (24 new tests from #285 included) - ch3f smoke run with --n-evals 3 produces both legacy and new fields; ci95 ≈ 1e-15 (deterministic single-mol system); SIGNIFICANT verdict as expected (99.83 % vs ~0 CI) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 27, 2026 21:01

Copilot started reviewing on behalf of ericchansen May 27, 2026 21:01 View session

Copilot AI reviewed May 27, 2026

View reviewed changes

Comment thread scripts/regenerate_convergence_results.py Outdated

Comment thread scripts/regenerate_convergence_results.py

ericchansen force-pushed the feat/regen-median-of-n branch from 853c61a to edee1ad Compare May 27, 2026 21:17

ericchansen merged commit 86d8483 into master May 27, 2026
11 checks passed

ericchansen deleted the feat/regen-median-of-n branch May 27, 2026 21:27

ericchansen mentioned this pull request May 28, 2026

fix(jax_engine): correct gradients for MM3 angle term at near-collinear geometries #288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(regen): add --n-evals + median+CI reporting to distinguish signal from noise#286

feat(regen): add --n-evals + median+CI reporting to distinguish signal from noise#286
ericchansen merged 1 commit into
masterfrom
feat/regen-median-of-n

ericchansen commented May 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ericchansen commented May 27, 2026

Summary

What changes

Backwards compatibility

Validation

Next step (separate PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants