wahln · wahln · Jun 25, 2026 · Jun 21, 2026 · Jun 21, 2026 · Jun 21, 2026
diff --git a/.claude/commands/release.md b/.claude/commands/release.md
@@ -0,0 +1,44 @@
+---
+description: Drive the ipax release workflow (rc branch -> PR -> tag) for a version.
+argument-hint: X.Y.Z
+---
+
+Drive a release of version **$ARGUMENTS** following the **Releasing** section in
+`AGENTS.md` — that section is the source of truth; read it first and keep this command
+in sync with it if it changes.
+
+This workflow has human/async gates (CI, code review, PR merge). **Stop at each gate**
+and report status — do not poll indefinitely or fabricate approval. Resume when I tell
+you the gate has passed.
+
+Phase A — open the RC (do now):
+
+1. From an up-to-date `develop`, create branch `rc/v$ARGUMENTS`.
+2. Push it and open a PR with **base `main`, head `rc/v$ARGUMENTS`**.
+   **Do not bump the version or finalize the changelog yet.**
+3. Report the PR URL and **stop**: waiting for CI + code review.
+
+Phase B — review loop (when I report review feedback):
+
+4. Address the review comments. Stage only the files you changed (**never `git add -A`**),
+   commit, push. Re-summarize and **stop** until the PR is approved.
+
+Phase C — release bump (only once I confirm the PR is approved):
+
+5. Finalize `CHANGELOG.md`: add `## [$ARGUMENTS] - <today>` (move the `[Unreleased]`
+   items into it) and update the compare links at the bottom.
+6. Bump `ipax.__version__` to `$ARGUMENTS` (the single version source; pyproject derives it).
+7. Run `python scripts/check.py` and `pre-commit run kacl-verify --files CHANGELOG.md`.
+   Commit, push, and **stop**: waiting for CI to go green and the PR to be merged.
+
+Phase D — tag & sync (once I confirm the PR is merged):
+
+8. `git switch main && git pull --ff-only origin main`; confirm `__version__` is `$ARGUMENTS`.
+9. Create the annotated tag: `git tag -a v$ARGUMENTS main` with a short summary drawn
+   from the `[$ARGUMENTS]` changelog section.
+10. Merge `main` into `develop` (`--no-ff`).
+11. Push both: `git push origin develop && git push origin v$ARGUMENTS`. The tag push
+    triggers `release.yml` (PyPI + GitHub release) — report that it has started.
+
+If `$ARGUMENTS` is empty or not `X.Y.Z`, ask me for the version before doing anything.
+Stop and ask if any step deviates from `AGENTS.md` rather than improvising.
diff --git a/AGENTS.md b/AGENTS.md
@@ -177,10 +177,14 @@ Claude Code config lives in `.claude/` and is checked in:
 - **`/verify`** (`.claude/commands/verify.md`) — run `scripts/check.py` and summarize.
 - **`/tdd`** (`.claude/commands/tdd.md`) — drive a change through the mandated
   red→green→verify→regression loop, multi-backend and invariant-aware.
+- **`/release`** (`.claude/commands/release.md`) — drive the release workflow
+  (`rc/vX.Y.Z` branch → PR onto `main` → review loop → version/changelog bump → tag →
+  sync `develop`), stopping at each human/CI gate. Follows the **Releasing** section.
 
 The auditor's rubric is the static half of GPU performance work; the measured half is
-a future `benchmarks/runners/device_efficiency.py` (sync/kernel profiling on real
-hardware), deferred until GPU CI exists.
+`benchmarks/runners/device_efficiency.py` (host-sync counting + per-iteration timing
+on real hardware, kernel-launch counts best-effort via `nsys`). It is GPU-gated, so it
+is a no-op in CI until GPU CI exists; run it locally on a CUDA backend.
 
 ---
 
@@ -250,6 +254,40 @@ core must remain backend-agnostic.
 
 ---
 
+## Releasing
+
+Releases follow a release-candidate PR onto `main`, then a tag. Conventions to know
+before starting:
+
+- **Version is single-sourced** in `ipax/__init__.py` (`__version__`); `pyproject.toml`
+  derives it via `[tool.hatch.version]`. Bump in **exactly that one place**.
+- **`CHANGELOG.md` follows Keep a Changelog** and is enforced by the `kacl-verify`
+  pre-commit/CI hook. Keep entries under `## [Unreleased]` as work lands.
+- **Tags are `vX.Y.Z`**; pushing a tag triggers `release.yml` (PyPI Trusted Publishing
+  + a GitHub release whose notes come from the tag annotation and the changelog).
+
+Steps:
+
+1. **Branch.** From an up-to-date `develop`, create `rc/vX.Y.Z`.
+2. **Open the PR.** Push the branch and open a PR with **base `main`, head `rc/vX.Y.Z`**.
+   **Do not bump the version yet** — review may change the contents.
+3. **Wait for CI + code review** (CI gates + Copilot/human review).
+4. **Address review.** Push fixes to `rc/vX.Y.Z`; repeat 3–4 until the PR is approved.
+   Avoid `git add -A` (it sweeps unrelated working-tree edits into the commit) — stage
+   the files you changed.
+5. **Release bump (only once approved).** Confirm `CHANGELOG.md` covers the release:
+   add `## [X.Y.Z] - YYYY-MM-DD` (move the `[Unreleased]` items into it) and update the
+   compare links at the bottom. Bump `ipax.__version__`. Run `python scripts/check.py`
+   and `pre-commit run kacl-verify --files CHANGELOG.md`. Push; let CI go green.
+6. **Merge** the PR into `main`.
+7. **Tag and sync.** Sync `main` (`git switch main && git pull --ff-only`), create the
+   annotated tag (`git tag -a vX.Y.Z main` with a short changelog-derived summary),
+   merge `main` back into `develop` (`--no-ff`), then push **both** `develop` and the
+   tag (`git push origin develop && git push origin vX.Y.Z`). The tag push runs
+   `release.yml` — confirm that workflow succeeds.
+
+---
+
 ## References
 
 - Wächter, A. & Biegler, L. T. (2006). "On the implementation of an interior-point

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,6 +6,135 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 
 ## [Unreleased]
 
+## [0.3.0] - 2026-06-26
+
+### Added
+- Documentation: a published **S2MPJ / CUTEst benchmark** page
+  (`docs/benchmarks/s2mpj.md`) recording the latest full-corpus run — system
+  information, per-configuration metrics for the `{lbfgs, exact} × {dense, krylov,
+  sparse}` matrix, the optimization-vs-feasibility split, and the dataset-sourced
+  scoring methodology.
+- The per-iteration log table reprints its column header every
+  `HEADER_REPEAT_INTERVAL` (10) rows so it stays readable on long runs, and marks
+  any iterate that already satisfies every enabled acceptable-stopping criterion
+  (before the required consecutive count) with a trailing `*`.
+- Expanded the Hock–Schittkowski analytic-oracle set in `ipax.testing.problems`
+  with `HS9`, `HS21`, `HS28`, and `HS71`, covering active bound multipliers, a
+  degenerate (zero) equality multiplier, a non-unique periodic optimum, and the
+  full equality+inequality+bounds constraint mix. Each is exercised across every
+  backend in the integration suite, wired into the QC benchmark corpus, and
+  checked by a new finite-difference derivative-consistency test that also
+  back-fills the previously untested HS oracles.
+- Two-sided linear inequalities (`Problem.linear_ineq`, `l ≤ A x ≤ u`) are now
+  solved: the constant-data block is lowered into the standard one-sided
+  inequality machinery (finite lower rows → `l − A x ≤ 0`, finite upper rows →
+  `A x − u ≤ 0`, both-finite rows yield a range pair), so the IPM, gradient
+  scaling, and every solver route handle it with no special-casing and the block
+  contributes no Lagrangian-Hessian term. Previously `solve` raised
+  `NotImplementedError` despite the interface being documented. A matrix-free
+  (operator) `linear_ineq` matrix still raises with guidance to use
+  `ineq_constraints` instead.
+
+- S2MPJ scoring now uses the **dataset's own documented outcome** instead of
+  convergence alone: the loader parses each source file for the CUTEst
+  classification (`pbclass`), the SIF author's solution objective
+  (`# LO SOLTN`, present on ~72% of the corpus), and an explicit
+  `Solution (infeasible)` / `Source: an infeasible problem` marker. A case is
+  scored *correct* when it reaches the documented objective, or — for a
+  documented-infeasible problem like BURKEHAN — when it **detects infeasibility**
+  (previously flagged as a failure). The report shows the gap to the documented
+  optimum (`Δf*`) and annotates `infeasible (exp)`. The objective-free problems
+  (CUTEst feasibility / nonlinear-equation systems) can now be run via
+  `--include-objective-free` as `min 0` subject to the constraints, and one
+  configuration can be swept per process with `--config` for parallel runs.
+- S2MPJ sweep gained a size-aware run strategy for tractable full-corpus runs:
+  **per-route variable caps** (dense 2000, Krylov 10000, sparse 25000) so each
+  config runs only on problems that fit its route — small problems are
+  cross-validated across every route while larger ones fall through to Krylov and
+  the sparse-direct route — with `--max-vars` kept as a global ceiling; **sized
+  instantiation** (`--size N`, with `PROBLEM(N)` for the scalable problems and a
+  SIF-default fallback for the rest) to reach the sparse route's intended large-`n`
+  regime; an optional subprocess **build-time guard** (`--max-build-seconds`) that
+  abandons a pathological O(n²) pure-Python construction before it stalls an
+  unattended sweep; and per-problem instance **caching** so the per-config fan-out
+  rebuilds each problem once instead of up to five times.
+- S2MPJ benchmark sweep now exercises the **exact Lagrangian Hessian** and the
+  **sparse-direct route**, not only L-BFGS. The adapter (`_S2MPJExactProblem`) wires
+  S2MPJ's `LgHxy`/`LHxyv` (convention `L = f + yᵀc`) into ipax's exact-Hessian route,
+  mapping `(σ, y_eq, y_ineq)` onto S2MPJ's single multiplier vector with the correct
+  signs for lowered inequality sides (lower `−y`, upper `+y`) and honoring `σ` on the
+  objective term so it stays correct under gradient-based scaling. With `sparse=True`
+  the Jacobians and Hessian cross as `SparseOperator`s (true COO sparsity) for the
+  sparse-direct (Feral/cuDSS) factorization. The runner's regular matrix is now
+  `{lbfgs, exact} × {dense, krylov, sparse}` (`exact/sparse` factors true sparsity —
+  raise `--max-vars` to reach the large, sparse models), and its `--scaling` now
+  defaults to `gradient-based` to match the solver default rather than benchmarking a
+  scaling-off configuration users do not get.
+- S2MPJ benchmark corpus: `benchmarks/corpus/s2mpj.py` loads the pure-Python
+  S2MPJ translations of the CUTEst/Hock–Schittkowski problems (no Fortran/SIF
+  toolchain) and bridges their NumPy/SciPy evaluation onto any CPU Array-API
+  backend, mapping S2MPJ's two-sided `clower ≤ c(x) ≤ cupper` constraints onto
+  ipax's eq/ineq split. A `benchmarks/runners/s2mpj.py` L-BFGS accuracy sweep
+  consumes it. `list_s2mpj_problems()` enumerates a checkout and the runner's
+  `--all` flag sweeps the **entire CUTEst set**, with `--max-vars`/`--max-iter`/
+  `--max-time` caps, per-problem isolation, automatic skipping of objective-free
+  problems, and a status summary. Download-gated (`IPAX_S2MPJ_DIR`); not vendored
+  (S2MPJ has no license) and not part of per-PR CI — the loader returns `[]` and
+  the gated tests skip when no checkout is present.
+
+### Changed
+- **Gradient-based scaling is now the default** (`ScalingOptions.method`
+  `"none"` → `"gradient-based"`), matching IPOPT. Across the full CUTEst/S2MPJ
+  corpus this solves a net **+67** problems (≈92 recovered, mostly from the
+  slow-converging `max_iter` bucket; ≈23 regressed, of which only 3–4 genuinely
+  fail — hard nonconvex/minimax cases that diverge under scaling — and the rest
+  merely converge slower). The returned `x`, objective, and multipliers are
+  reported in the original problem's units; pass `scaling="none"` to opt out.
+- Promoted the driver's private vertical-stack operator to a public
+  `ipax.backend.operators.VStack` (now also exposing `row_inf_norms` for
+  gradient scaling), reused by both the equality assembly and the new
+  linear-inequality lowering.
+
+### Fixed
+- A stall at a near-optimal iterate is now reported `ACCEPTABLE` instead of
+  being discarded. Near a solution the condensed system is ill-conditioned (μ
+  driven below the achieved KKT residual), so the Newton step can come out
+  non-finite and the line search can fail to make progress even though the
+  iterate is essentially optimal. The solver now salvages such an iterate —
+  whether the failure is a non-finite **step solve** (previously
+  `NUMERICAL_ERROR`) or the line search **handing off to restoration**
+  (previously a false `INFEASIBLE`) — when its scaled KKT components are within a
+  relaxed multiple (IPOPT `acceptable_tol` ≈ 1e2 × `tol`) of the optimality
+  tolerances, rather than throwing away a usable solution.
+- Fixed variables (`x_L == x_U`) — common in CUTEst-style models — no longer make
+  the solve fail at the first iteration. Such a variable has no strict barrier
+  interior, so `z = μ/(x − x_L)` was singular and the first Newton step came out
+  non-finite (`numerical_error`). The solver now relaxes fixed / near-degenerate
+  bound pairs symmetrically about their midpoint (IPOPT
+  `fixed_variable_treatment='relax_bounds'`), leaving well-separated bounds
+  untouched. Surfaced by the S2MPJ sweep, where it accounted for the bulk of the
+  first-iteration `numerical_error` failures.
+- The filter line-search switching condition no longer raises `OverflowError` on
+  a badly-scaled iterate. Python's `float ** s_phi` raises instead of returning
+  `inf` once the result exceeds the double range (an enormous directional
+  derivative `dphi`), which crashed the whole solve; the power now uses IEEE
+  overflow semantics (`→ inf`). Surfaced by the S2MPJ INDEF sweep. The S2MPJ
+  benchmark adapter likewise sanitizes overflow in its NumPy-bridged
+  objective/gradient (returning `inf`), so a trial point that overflows the
+  problem's own generated `float**` is rejected rather than crashing
+  (e.g. LUKVLE4C, which then solves).
+- Feasibility restoration no longer crashes on a numerically singular or
+  extreme-scale Gauss-Newton system. The damped (Levenberg–Marquardt) step now
+  treats a failed/non-finite linear solve as a rejected step — growing the
+  damping (up to a ceiling) and retrying — instead of letting the backend's
+  ``solve`` raise (e.g. numpy ``LinAlgError: Singular matrix`` when a constraint
+  Jacobian blows up far from feasibility). Surfaced by the S2MPJ HS7 sweep; the
+  solve now degrades to a reported status rather than raising.
+- `configure_verbosity` no longer attaches a second console handler when the
+  application has already configured its own handler on the `"ipax"` logger,
+  which previously printed every iteration record twice. Propagation to ancestor
+  handlers (and `caplog`) is unchanged.
+
 ## [0.2.0] - 2026-06-21
 
 ### Added
@@ -85,7 +214,8 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 - Contract batteries (`tests/contracts/`) plus unit/property/integration/backends/
   regression layers; benchmark suite (`benchmarks/`, asv); MkDocs documentation.
 
-[Unreleased]: https://github.com/wahln/ipax/compare/v0.2.0...HEAD
+[Unreleased]: https://github.com/wahln/ipax/compare/v0.3.0...HEAD
+[0.3.0]: https://github.com/wahln/ipax/compare/v0.2.0...v0.3.0
 [0.2.0]: https://github.com/wahln/ipax/compare/v0.1.1...v0.2.0
 [0.1.1]: https://github.com/wahln/ipax/compare/v0.1.0...v0.1.1
 [0.1.0]: https://github.com/wahln/ipax/releases/tag/v0.1.0
diff --git a/benchmarks/corpus/__init__.py b/benchmarks/corpus/__init__.py
@@ -21,8 +21,12 @@
     HS6,
     HS7,
     HS8,
+    HS9,
+    HS21,
+    HS28,
     HS35,
     HS43,
+    HS71,
     BoundConstrainedQP,
     EqualityConstrainedQP,
     UnconstrainedQuadratic,
@@ -61,6 +65,7 @@ class BenchmarkProblem:
         default=lambda _problem: None, repr=False
     )
     backends: tuple[str, ...] | None = None
+    exclude_configs: tuple[str, ...] = ()  # config labels the QC sweep skips here
 
 
 def _known(problem: Problem) -> Array | None:
@@ -149,6 +154,40 @@ def default_corpus() -> list[BenchmarkProblem]:
             tags=("eq", "nonlinear"),
             build=lambda xp: (HS8(xp), _arr(xp, [2.0, 1.0])),
         ),
+        BenchmarkProblem(
+            name="hs9",
+            kind="NLP",
+            tags=("eq", "nonlinear"),
+            build=lambda xp: (HS9(xp), _arr(xp, [0.0, 0.0])),
+        ),
+        BenchmarkProblem(
+            name="hs21",
+            kind="QP",
+            tags=("bounds", "ineq"),
+            build=lambda xp: (HS21(xp), _arr(xp, [3.0, 1.0])),
+            optimum=_known,
+        ),
+        BenchmarkProblem(
+            name="hs28",
+            kind="QP",
+            tags=("eq",),
+            build=lambda xp: (HS28(xp), _arr(xp, [-1.0, 0.5, 0.5])),
+            optimum=_known,
+        ),
+        BenchmarkProblem(
+            name="hs71",
+            kind="NLP",
+            tags=("eq", "ineq", "bounds", "nonlinear"),
+            build=lambda xp: (HS71(xp), _arr(xp, [1.0, 5.0, 5.0, 1.0])),
+            optimum=_known,
+            # The Mehrotra/Gondzio correctors sit at HS71's convergence edge on this
+            # nonconvex problem and stall on some backends/platforms (e.g. CI's
+            # Torch build) while converging on others — a known corrector-robustness
+            # gap, not a per-PR regression. Exclude those configs here so the gate is
+            # deterministic; HS71 is still swept on every stable route, and covered
+            # under the default solve by the integration tests.
+            exclude_configs=("exact/dense+mehrotra", "exact/dense+gondzio"),
+        ),
         _rt_case(),
     ]