Reproducible benchmark: 5 cases, measure wall-clock from prompt to first converged result. Compare against hand-written baseline.
Reproducible benchmark: 5 cases, measure wall-clock from prompt to first converged result.
Compare against hand-written baseline.