Skip to content

feat(regen): add --n-evals + median+CI reporting to distinguish signal from noise#286

Merged
ericchansen merged 1 commit into
masterfrom
feat/regen-median-of-n
May 27, 2026
Merged

feat(regen): add --n-evals + median+CI reporting to distinguish signal from noise#286
ericchansen merged 1 commit into
masterfrom
feat/regen-median-of-n

Conversation

@ericchansen
Copy link
Copy Markdown
Owner

Summary

Adds post-hoc median-of-N ObjectiveFunction reporting to scripts/regenerate_convergence_results.py so convergence artifacts can distinguish real force-field optimization signal from the per-call engine noise documented in q2mm#284 §2 after the q2mm#283 runs.

What changes

  • Adds --n-evals N with default 1 for backwards-compatible single-call behavior.
  • Reuses the optimizer's ObjectiveFunction instance for repeated initial/final evaluations after optimization.
  • Emits median score, 95% CI half-width, median improvement percentage, and a significance flag in validation_results.json.
  • Preserves ObjectiveFunction counters/history around repeated post-hoc evaluations.
  • Updates the optimized INFO log line to report median improvement, CI, and significance when N > 1.

Backwards compatibility

Existing consumers can keep reading these unchanged fields:

  • initial_obj_score
  • final_obj_score
  • improvement_pct

The new fields are additive: initial_obj_score_median, initial_obj_score_ci95, final_obj_score_median, final_obj_score_ci95, improvement_pct_median, and improvement_significant.

Validation

  • /home/eric/repos/q2mm/.venv/bin/python -m ruff check scripts/ q2mm/
  • /home/eric/repos/q2mm/.venv/bin/python -m ruff format --check scripts/ q2mm/
  • PYTHONPATH=/home/eric/repos/q2mm-feat-regen-median-of-n /home/eric/repos/q2mm/.venv/bin/python -m pytest test/ -x -q -m "not (openmm or tinker or jax or jax_md or psi4)"
  • Q2MM_SUPPORTING_INFO=/home/eric/repos/q2mm/validation/supporting-info PYTHONPATH=/home/eric/repos/q2mm-feat-regen-median-of-n /home/eric/repos/q2mm/.venv/bin/python scripts/regenerate_convergence_results.py --system ch3f --output-dir results/regen-n1
  • Q2MM_SUPPORTING_INFO=/home/eric/repos/q2mm/validation/supporting-info PYTHONPATH=/home/eric/repos/q2mm-feat-regen-median-of-n /home/eric/repos/q2mm/.venv/bin/python scripts/regenerate_convergence_results.py --system ch3f --n-evals 3 --output-dir results/regen-n3

Next step (separate PR)

Phase D will re-run the three metal-TS systems with --n-evals 5 and update the docs with significant / no-improvement / inconclusive verdicts.

Copilot AI review requested due to automatic review settings May 27, 2026 21:01
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR enhances scripts/regenerate_convergence_results.py to optionally repeat post-hoc ObjectiveFunction evaluations (--n-evals N) and report median-based scores plus uncertainty metrics, so convergence artifacts can better separate real optimization signal from per-call engine noise.

Changes:

  • Add --n-evals CLI option (default 1) and record it in provenance.
  • Re-evaluate ObjectiveFunction at initial/final parameters N times and emit median/CI fields + a significance flag into validation_results.json.
  • Update the “optimized” log line to include median improvement and CI when N > 1.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread scripts/regenerate_convergence_results.py Outdated
Comment thread scripts/regenerate_convergence_results.py
@ericchansen ericchansen force-pushed the feat/regen-median-of-n branch from 853c61a to edee1ad Compare May 27, 2026 21:17
…from noise

Add post-hoc repeated ObjectiveFunction evaluation for convergence
regeneration so q2mm#284 §2 noise findings from q2mm#283 can be
reported as measurement uncertainty instead of single-call verdicts.

The validation JSON keeps the legacy initial_obj_score, final_obj_score,
and improvement_pct fields for existing consumers, and adds:

  - initial_obj_score_mean, initial_obj_score_ci95
  - final_obj_score_mean, final_obj_score_ci95
  - improvement_pct_mean, improvement_significant

Reports the sample mean (not median) paired with a Student-t 95% CI
half-width — the t-distribution describes the sampling distribution
of the mean, not the median.  For n ≤ 10 with the bounded engine
noise we measure here, sample mean and median are nearly identical;
the mean is the right center to pair with a t-CI.

ObjectiveFunction.history is restored between samples by truncating
back to its original length (O(1)) rather than copying-then-replacing
(O(len)) — important when the optimizer has accumulated many
evaluations.

Validation:

  - ruff check + format clean
  - 680 unit tests pass (24 new tests from #285 included)
  - ch3f smoke run with --n-evals 3 produces both legacy and new
    fields; ci95 ≈ 1e-15 (deterministic single-mol system); SIGNIFICANT
    verdict as expected (99.83 % vs ~0 CI)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@ericchansen ericchansen merged commit 86d8483 into master May 27, 2026
11 checks passed
@ericchansen ericchansen deleted the feat/regen-median-of-n branch May 27, 2026 21:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants