Conversation
…copes - Ship root conftest.py in the sdist so the suite collects from an sdist (it now requires the root conftest for --doctest-modules). - Fix README CI badge URL: niklaswahl/ipax -> wahln/ipax. - Add explicit `actions: read` to the release.yml jobs that run actions/download-artifact (defensive least-privilege; same-run artifacts use the runtime token, but being explicit de-risks restricted-default orgs). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…dary
Measure-first GPU optimization pass (CuPy on RTX 4070).
Harness:
- Real device detection in capabilities() via the Array-API inspection API
(__array_namespace_info__().devices()); was hardcoded ("cpu",). Result.device
added and surfaced in the tier-1 summary.
- benchmarks/harness: DeviceMetrics + measure_device_solve with a host-sync
counter (patches the array type scalar dunders) and CuPy gpu/cpu split.
- benchmarks/runners/device_efficiency.py (GPU-gated, no-op on CI) and the
previously-stubbed micro-benchmarks (matvec / dense solve / Newton step).
Optimization:
- fraction_to_boundary looped over the array in Python calling float(v[idx])
per element (O(n) host syncs, called 6x/iter). On GPU this dominated: 5,510
syncs/iter at n=1000 scaling linearly to 27,422 at n=5000. Vectorized to a
single reduction (Wachter-Biegler 2006 eq. 15), Array-API pure.
- Result on the 4070: krylov syncs/iter 27,422 -> 133 (no longer scales with n),
wall 9.04s -> 0.39s (~23x) at n=5000; same iters and KKT error. The loop is
now ~97% GPU-compute-bound (gpu_time ~= wall).
Tests: device-efficiency harness smoke test, O(1)-sync regression guard for
fraction_to_boundary, and a vectorization-equivalence test. All gates green
(799 passed, ruff/format/mypy/purity clean).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR prepares the 0.2.0 release candidate by adding a GPU/device-efficiency profiling harness under benchmarks/, surfacing device information in solver results/logging, and removing an O(n) host↔device sync bottleneck in the interior-point loop by vectorizing fraction_to_boundary.
Changes:
- Vectorize
fraction_to_boundaryto eliminate per-element scalar materializations (dramatically reducing GPU syncs/iter) and add unit + regression coverage. - Add device reporting: detect available devices in backend capabilities and record the solution device on
Result, including tier-1 logging output. - Introduce the device-efficiency study harness/runner and revive micro-benchmarks for kernel-level timing across installed backends.
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
ipax/ipm/barrier.py |
Vectorized fraction_to_boundary to avoid per-element host syncs. |
tests/unit/test_barrier.py |
Adds empty-vector and reference-equivalence tests for fraction_to_boundary. |
tests/regression/test_fraction_to_boundary_no_elementwise_sync.py |
Adds regression guard that scalar syncs during fraction_to_boundary stay O(1). |
benchmarks/harness/__init__.py |
Adds device-efficiency measurement primitives (sync counter, GPU timing/memory, report formatting). |
benchmarks/runners/device_efficiency.py |
Adds CLI runner to generate JSON/Markdown device-efficiency reports over backend/route/size grids. |
tests/integration/test_device_efficiency.py |
Smoke-tests the harness + runner on CPU backends and validates sync-counter restoration behavior. |
ipax/backend/namespace.py |
Implements device discovery via __array_namespace_info__().devices() and exposes it in Capabilities. |
ipax/result.py |
Adds Result.device field for reporting where the solve ran. |
ipax/solve.py |
Populates Result.device from the solution array’s .device attribute. |
ipax/_logging.py |
Includes device in the tier-1 format_result() output. |
tests/unit/test_logging.py |
Updates logging expectations to include the new device line. |
tests/unit/test_backend_namespace.py |
Adds tests for real device reporting and fallback behavior in capabilities(). |
benchmarks/runners/micro/bench_kernels.py |
Implements micro-benchmarks for matvec/dense solve/Newton-step kernels across installed backends. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
…y return The infeasible-bounds early return in solve() left Result.device as the default empty string while x0 (and its device) was available, making format_result() output inconsistent across failure modes. Populate device from x0 there too, matching the main path. Regression assertion added to the infeasible-bounds test. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
- Bump ipax.__version__ to 0.2.0 (single-sources the project version via hatch). - Add the 0.2.0 CHANGELOG section: GPU device-efficiency harness, device reporting, and the fraction_to_boundary vectorization; update compare links. - Add a Codecov coverage badge to the README. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Release candidate for 0.2.0. No version bump yet — opening for PR review first; the bump + changelog will follow once the review is settled.
Summary
A measure-first GPU optimization pass, validated on a local RTX 4070 (CuPy), plus the device-efficiency profiling harness it required.
Profiling harness (the deferred
device_efficiencydeliverable)capabilities()via the Array-API inspection API (__array_namespace_info__().devices()) — was hardcoded("cpu",).Result.deviceadded and shown in the tier-1 summary.benchmarks/harness:DeviceMetrics+measure_device_solve, with a host-sync counter (patches the array type's scalar dunders) and the CuPy GPU/CPU time split.benchmarks/runners/device_efficiency.py(GPU-gated → no-op on CI) and the previously-stubbed micro-benchmarks (matvec / dense solve / Newton step).The optimization
fraction_to_boundarylooped over the array in Python callingfloat(v[idx])per element — O(n) host↔device syncs, called 6× per iteration. On GPU this dominated: 5,510 syncs/iter at n=1000, scaling linearly to 27,422 at n=5000.gpu_time ≈ wall), so further sync consolidation was deliberately not pursued (data showed <3% headroom).Tests / gates
fraction_to_boundary, and a vectorization-equivalence test.Notes for review
array_namespace/xp.*; backend-specific timers live inbenchmarks/.🤖 Generated with Claude Code