rc/0.2.0: GPU device-efficiency profiling + fraction-to-boundary vectorization by wahln · Pull Request #2 · wahln/ipax

wahln · 2026-06-21T13:41:13Z

Release candidate for 0.2.0. No version bump yet — opening for PR review first; the bump + changelog will follow once the review is settled.

Summary

A measure-first GPU optimization pass, validated on a local RTX 4070 (CuPy), plus the device-efficiency profiling harness it required.

Profiling harness (the deferred `device_efficiency` deliverable)

Real device detection in capabilities() via the Array-API inspection API (__array_namespace_info__().devices()) — was hardcoded ("cpu",). Result.device added and shown in the tier-1 summary.
benchmarks/harness: DeviceMetrics + measure_device_solve, with a host-sync counter (patches the array type's scalar dunders) and the CuPy GPU/CPU time split.
benchmarks/runners/device_efficiency.py (GPU-gated → no-op on CI) and the previously-stubbed micro-benchmarks (matvec / dense solve / Newton step).

The optimization

Finding: fraction_to_boundary looped over the array in Python calling float(v[idx]) per element — O(n) host↔device syncs, called 6× per iteration. On GPU this dominated: 5,510 syncs/iter at n=1000, scaling linearly to 27,422 at n=5000.
Fix: vectorized to a single reduction (Wächter–Biegler 2006 eq. 15), Array-API pure (no concrete backend in core).
Result on the 4070: krylov syncs/iter 27,422 → 133 (no longer scales with n); wall 9.04s → 0.39s (~23×) at n=5000; identical iteration counts and KKT error. The loop is now ~97% GPU-compute-bound (gpu_time ≈ wall), so further sync consolidation was deliberately not pursued (data showed <3% headroom).

Tests / gates

Device-efficiency harness smoke test, an O(1)-sync regression guard for fraction_to_boundary, and a vectorization-equivalence test.
All gates green: 799 passed, ruff/format/mypy/import-purity clean.

Notes for review

Invariants held: core change uses only array_namespace/xp.*; backend-specific timers live in benchmarks/.
torch/jax CUDA wheels are a separate install step; the harness is already backend-parametric for when they're added.

🤖 Generated with Claude Code

…copes - Ship root conftest.py in the sdist so the suite collects from an sdist (it now requires the root conftest for --doctest-modules). - Fix README CI badge URL: niklaswahl/ipax -> wahln/ipax. - Add explicit `actions: read` to the release.yml jobs that run actions/download-artifact (defensive least-privilege; same-run artifacts use the runtime token, but being explicit de-risks restricted-default orgs). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…dary Measure-first GPU optimization pass (CuPy on RTX 4070). Harness: - Real device detection in capabilities() via the Array-API inspection API (__array_namespace_info__().devices()); was hardcoded ("cpu",). Result.device added and surfaced in the tier-1 summary. - benchmarks/harness: DeviceMetrics + measure_device_solve with a host-sync counter (patches the array type scalar dunders) and CuPy gpu/cpu split. - benchmarks/runners/device_efficiency.py (GPU-gated, no-op on CI) and the previously-stubbed micro-benchmarks (matvec / dense solve / Newton step). Optimization: - fraction_to_boundary looped over the array in Python calling float(v[idx]) per element (O(n) host syncs, called 6x/iter). On GPU this dominated: 5,510 syncs/iter at n=1000 scaling linearly to 27,422 at n=5000. Vectorized to a single reduction (Wachter-Biegler 2006 eq. 15), Array-API pure. - Result on the 4070: krylov syncs/iter 27,422 -> 133 (no longer scales with n), wall 9.04s -> 0.39s (~23x) at n=5000; same iters and KKT error. The loop is now ~97% GPU-compute-bound (gpu_time ~= wall). Tests: device-efficiency harness smoke test, O(1)-sync regression guard for fraction_to_boundary, and a vectorization-equivalence test. All gates green (799 passed, ruff/format/mypy/purity clean). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

This PR prepares the 0.2.0 release candidate by adding a GPU/device-efficiency profiling harness under benchmarks/, surfacing device information in solver results/logging, and removing an O(n) host↔device sync bottleneck in the interior-point loop by vectorizing fraction_to_boundary.

Changes:

Vectorize fraction_to_boundary to eliminate per-element scalar materializations (dramatically reducing GPU syncs/iter) and add unit + regression coverage.
Add device reporting: detect available devices in backend capabilities and record the solution device on Result, including tier-1 logging output.
Introduce the device-efficiency study harness/runner and revive micro-benchmarks for kernel-level timing across installed backends.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file

File	Description
`ipax/ipm/barrier.py`	Vectorized `fraction_to_boundary` to avoid per-element host syncs.
`tests/unit/test_barrier.py`	Adds empty-vector and reference-equivalence tests for `fraction_to_boundary`.
`tests/regression/test_fraction_to_boundary_no_elementwise_sync.py`	Adds regression guard that scalar syncs during `fraction_to_boundary` stay O(1).
`benchmarks/harness/__init__.py`	Adds device-efficiency measurement primitives (sync counter, GPU timing/memory, report formatting).
`benchmarks/runners/device_efficiency.py`	Adds CLI runner to generate JSON/Markdown device-efficiency reports over backend/route/size grids.
`tests/integration/test_device_efficiency.py`	Smoke-tests the harness + runner on CPU backends and validates sync-counter restoration behavior.
`ipax/backend/namespace.py`	Implements device discovery via `__array_namespace_info__().devices()` and exposes it in `Capabilities`.
`ipax/result.py`	Adds `Result.device` field for reporting where the solve ran.
`ipax/solve.py`	Populates `Result.device` from the solution array’s `.device` attribute.
`ipax/_logging.py`	Includes `device` in the tier-1 `format_result()` output.
`tests/unit/test_logging.py`	Updates logging expectations to include the new `device` line.
`tests/unit/test_backend_namespace.py`	Adds tests for real device reporting and fallback behavior in `capabilities()`.
`benchmarks/runners/micro/bench_kernels.py`	Implements micro-benchmarks for matvec/dense solve/Newton-step kernels across installed backends.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

…y return The infeasible-bounds early return in solve() left Result.device as the default empty string while x0 (and its device) was available, making format_result() output inconsistent across failure modes. Populate device from x0 there too, matching the main path. Regression assertion added to the infeasible-bounds test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

codecov · 2026-06-21T13:57:33Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

- Bump ipax.__version__ to 0.2.0 (single-sources the project version via hatch). - Add the 0.2.0 CHANGELOG section: GPU device-efficiency harness, device reporting, and the fraction_to_boundary vectorization; update compare links. - Add a Codecov coverage badge to the README. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

wahln and others added 4 commits June 21, 2026 13:22

qol updates in CI and others to version 0.1.1

ab3d580

Merge main into develop after squash-merge of #1 (0.1.1)

2c64c8f

wahln requested a review from Copilot June 21, 2026 13:41

Copilot started reviewing on behalf of wahln June 21, 2026 13:42 View session

Copilot AI reviewed Jun 21, 2026

View reviewed changes

Comment thread ipax/solve.py

wahln merged commit d22637f into main Jun 21, 2026
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rc/0.2.0: GPU device-efficiency profiling + fraction-to-boundary vectorization#2

rc/0.2.0: GPU device-efficiency profiling + fraction-to-boundary vectorization#2
wahln merged 6 commits into
mainfrom
rc/0.2.0

wahln commented Jun 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

codecov Bot commented Jun 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wahln commented Jun 21, 2026

Summary

Profiling harness (the deferred device_efficiency deliverable)

The optimization

Tests / gates

Notes for review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

codecov Bot commented Jun 21, 2026

Welcome to Codecov 🎉

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Profiling harness (the deferred `device_efficiency` deliverable)