Skip to content

rc/0.2.0: GPU device-efficiency profiling + fraction-to-boundary vectorization#2

Merged
wahln merged 6 commits into
mainfrom
rc/0.2.0
Jun 21, 2026
Merged

rc/0.2.0: GPU device-efficiency profiling + fraction-to-boundary vectorization#2
wahln merged 6 commits into
mainfrom
rc/0.2.0

Conversation

@wahln

@wahln wahln commented Jun 21, 2026

Copy link
Copy Markdown
Owner

Release candidate for 0.2.0. No version bump yet — opening for PR review first; the bump + changelog will follow once the review is settled.

Summary

A measure-first GPU optimization pass, validated on a local RTX 4070 (CuPy), plus the device-efficiency profiling harness it required.

Profiling harness (the deferred device_efficiency deliverable)

  • Real device detection in capabilities() via the Array-API inspection API (__array_namespace_info__().devices()) — was hardcoded ("cpu",). Result.device added and shown in the tier-1 summary.
  • benchmarks/harness: DeviceMetrics + measure_device_solve, with a host-sync counter (patches the array type's scalar dunders) and the CuPy GPU/CPU time split.
  • benchmarks/runners/device_efficiency.py (GPU-gated → no-op on CI) and the previously-stubbed micro-benchmarks (matvec / dense solve / Newton step).

The optimization

  • Finding: fraction_to_boundary looped over the array in Python calling float(v[idx]) per element — O(n) host↔device syncs, called 6× per iteration. On GPU this dominated: 5,510 syncs/iter at n=1000, scaling linearly to 27,422 at n=5000.
  • Fix: vectorized to a single reduction (Wächter–Biegler 2006 eq. 15), Array-API pure (no concrete backend in core).
  • Result on the 4070: krylov syncs/iter 27,422 → 133 (no longer scales with n); wall 9.04s → 0.39s (~23×) at n=5000; identical iteration counts and KKT error. The loop is now ~97% GPU-compute-bound (gpu_time ≈ wall), so further sync consolidation was deliberately not pursued (data showed <3% headroom).

Tests / gates

  • Device-efficiency harness smoke test, an O(1)-sync regression guard for fraction_to_boundary, and a vectorization-equivalence test.
  • All gates green: 799 passed, ruff/format/mypy/import-purity clean.

Notes for review

  • Invariants held: core change uses only array_namespace/xp.*; backend-specific timers live in benchmarks/.
  • torch/jax CUDA wheels are a separate install step; the harness is already backend-parametric for when they're added.

🤖 Generated with Claude Code

wahln and others added 4 commits June 21, 2026 13:22
…copes

- Ship root conftest.py in the sdist so the suite collects from an sdist
  (it now requires the root conftest for --doctest-modules).
- Fix README CI badge URL: niklaswahl/ipax -> wahln/ipax.
- Add explicit `actions: read` to the release.yml jobs that run
  actions/download-artifact (defensive least-privilege; same-run artifacts
  use the runtime token, but being explicit de-risks restricted-default orgs).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…dary

Measure-first GPU optimization pass (CuPy on RTX 4070).

Harness:
- Real device detection in capabilities() via the Array-API inspection API
  (__array_namespace_info__().devices()); was hardcoded ("cpu",). Result.device
  added and surfaced in the tier-1 summary.
- benchmarks/harness: DeviceMetrics + measure_device_solve with a host-sync
  counter (patches the array type scalar dunders) and CuPy gpu/cpu split.
- benchmarks/runners/device_efficiency.py (GPU-gated, no-op on CI) and the
  previously-stubbed micro-benchmarks (matvec / dense solve / Newton step).

Optimization:
- fraction_to_boundary looped over the array in Python calling float(v[idx])
  per element (O(n) host syncs, called 6x/iter). On GPU this dominated: 5,510
  syncs/iter at n=1000 scaling linearly to 27,422 at n=5000. Vectorized to a
  single reduction (Wachter-Biegler 2006 eq. 15), Array-API pure.
- Result on the 4070: krylov syncs/iter 27,422 -> 133 (no longer scales with n),
  wall 9.04s -> 0.39s (~23x) at n=5000; same iters and KKT error. The loop is
  now ~97% GPU-compute-bound (gpu_time ~= wall).

Tests: device-efficiency harness smoke test, O(1)-sync regression guard for
fraction_to_boundary, and a vectorization-equivalence test. All gates green
(799 passed, ruff/format/mypy/purity clean).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR prepares the 0.2.0 release candidate by adding a GPU/device-efficiency profiling harness under benchmarks/, surfacing device information in solver results/logging, and removing an O(n) host↔device sync bottleneck in the interior-point loop by vectorizing fraction_to_boundary.

Changes:

  • Vectorize fraction_to_boundary to eliminate per-element scalar materializations (dramatically reducing GPU syncs/iter) and add unit + regression coverage.
  • Add device reporting: detect available devices in backend capabilities and record the solution device on Result, including tier-1 logging output.
  • Introduce the device-efficiency study harness/runner and revive micro-benchmarks for kernel-level timing across installed backends.

Reviewed changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
ipax/ipm/barrier.py Vectorized fraction_to_boundary to avoid per-element host syncs.
tests/unit/test_barrier.py Adds empty-vector and reference-equivalence tests for fraction_to_boundary.
tests/regression/test_fraction_to_boundary_no_elementwise_sync.py Adds regression guard that scalar syncs during fraction_to_boundary stay O(1).
benchmarks/harness/__init__.py Adds device-efficiency measurement primitives (sync counter, GPU timing/memory, report formatting).
benchmarks/runners/device_efficiency.py Adds CLI runner to generate JSON/Markdown device-efficiency reports over backend/route/size grids.
tests/integration/test_device_efficiency.py Smoke-tests the harness + runner on CPU backends and validates sync-counter restoration behavior.
ipax/backend/namespace.py Implements device discovery via __array_namespace_info__().devices() and exposes it in Capabilities.
ipax/result.py Adds Result.device field for reporting where the solve ran.
ipax/solve.py Populates Result.device from the solution array’s .device attribute.
ipax/_logging.py Includes device in the tier-1 format_result() output.
tests/unit/test_logging.py Updates logging expectations to include the new device line.
tests/unit/test_backend_namespace.py Adds tests for real device reporting and fallback behavior in capabilities().
benchmarks/runners/micro/bench_kernels.py Implements micro-benchmarks for matvec/dense solve/Newton-step kernels across installed backends.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread ipax/solve.py
…y return

The infeasible-bounds early return in solve() left Result.device as the default
empty string while x0 (and its device) was available, making format_result()
output inconsistent across failure modes. Populate device from x0 there too,
matching the main path. Regression assertion added to the infeasible-bounds test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@codecov

codecov Bot commented Jun 21, 2026

Copy link
Copy Markdown

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

- Bump ipax.__version__ to 0.2.0 (single-sources the project version via hatch).
- Add the 0.2.0 CHANGELOG section: GPU device-efficiency harness, device
  reporting, and the fraction_to_boundary vectorization; update compare links.
- Add a Codecov coverage badge to the README.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wahln wahln merged commit d22637f into main Jun 21, 2026
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants