feat(linalg): matmul + transpose backed by OpenBLAS / Accelerate / cuBLAS#18
feat(linalg): matmul + transpose backed by OpenBLAS / Accelerate / cuBLAS#18Hayden727 wants to merge 12 commits into
Conversation
…BLAS Implements issue #11. - Tensor::T() and ctorch::transpose(x, i, j): zero-copy view ops; T() is a 2-D shorthand that rejects non-2-D inputs with ShapeError. - ctorch::matmul: full PyTorch shape rules (1D×1D, 1D×2D, 2D×1D, 2D×2D, batched N-D with right-aligned broadcast). f32/f64 only; integer/bool raise DTypeError; cross-device raises DeviceError; inner-dim mismatch raises ShapeError with both shapes printed. - BLAS layer: linalg/blas.h thin wrapper, blas_cpu.cpp dispatches into cblas_sgemm/dgemm via find_package(BLAS) (Apple Accelerate / OpenBLAS / MKL all work). blas_cuda.cu dispatches into cublasSgemm/Dgemm via a per-thread per-device handle cache, using the C^T = B^T A^T trick to feed cuBLAS column-major. - CMake: new CTORCH_BLAS option (default ON) gated through cmake/ctorch_blas.cmake; sets ACCELERATE_NEW_LAPACK on macOS to avoid the deprecated cblas_sgemm warning. CUDA build links CUDA::cublas. - Tests: 22 functional tests (transpose semantics, all 5 matmul shape cases, dtype rules, error paths) plus 8 parity cases vs np.matmul. - docs/ops.md gained a Linear algebra section; README roadmap marks the milestone shipped. Closes #11
…e + BLAS-OFF tests Addresses three issues raised by Codex on the issue-#11 PR: - P1: matmul on non-contiguous CUDA operands (e.g. matmul(a, a.T())) used to throw because Tensor::contiguous() also throws on non-CPU tensors. Round-trip through CPU instead — slow but correct, matching the issue's documented examples. Optimisation deferred. - P2: mixed float32 × float64 was accepted on CPU but rejected on CUDA; align both backends by routing CUDA casts through CPU as well. - P2: matmul tests are no-ops when CTORCH_BLAS=OFF (CPU backend is a stub in that mode). Gate the CPU-using matmul tests behind if(CTORCH_BLAS) so the no-BLAS build keeps the suite green. Adds matmul_cuda_test exercising the new fallbacks on real GPU plus asymmetric M/N/K to lock in the cuBLAS column-major C^T = B^T A^T trick. Skipped at runtime on hosts without CUDA.
The matmul P1 fallback `t.to(cpu).contiguous().to(dev)` couldn't run on strided CUDA inputs because `Tensor::to(cpu)` started by calling `contiguous()`, which throws on non-CPU. Extend `Tensor::to`: - Already-contiguous source → fast path (unchanged). - Non-CPU non-contiguous source → bulk-copy the entire underlying storage device-to-host while preserving shape / stride / offset. The destination is non-contiguous on its target device; callers can `.contiguous()` from there (works on CPU since the strided-to- contiguous odometer is CPU-side). - CPU non-contig source → unchanged (materialise on CPU first). This unblocks `matmul(a, a.T())` on CUDA — the test that was failing on dell now has a working round-trip path.
Codex review on PR #18 flagged the BLAS detection as platform-fragile: - P1: macOS unconditionally picked Accelerate's header and defined ACCELERATE_NEW_LAPACK, even when the user passed -DBLA_VENDOR=OpenBLAS / MKL. That compiled against Accelerate's CBLAS prototypes while linking a different ABI. Now check that the resolved BLAS_LIBRARIES actually contain Accelerate AND the vendor isn't explicitly set to a non-Apple choice before opting into the framework header. - P2: check_include_file_cxx("cblas.h") only searched the compiler's default include dirs, so brew/apt/Cellar OpenBLAS installs at non-system prefixes failed configuration. Replace with find_path() carrying explicit HINTS for the common prefixes; surface the resolved path through CTORCH_CBLAS_INCLUDE_DIR so ctorch_core can add it as a private include directory.
…tmul Codex review on the issue-#11 branch flagged two more P2s: - The float-only contract was checked against the *promoted* dtype, which let `int32 × float32` and `bool × float64` slip through after an implicit widen. Move the gate to each operand individually before promotion so any non-float input throws DTypeError as documented. - The CUDA round-trip fallback paths (`Tensor::to(Device::cpu())` and `.to(dev)`) ran without `cuda::DeviceGuard`, so on multi-GPU hosts the temporary download/upload could land on the wrong context when the calling thread's current device differed from `t.device()`. Pin the device for the duration of every fallback path. Adds a regression test for mixed int/bool × float on CPU.
Issue #11 added BLAS-backed matmul; the ctorch_blas.cmake module now calls find_package(BLAS REQUIRED) which fails on the GH Actions Linux images because no BLAS is preinstalled. macOS uses Accelerate so it works out of the box. Add libopenblas-dev to every Linux apt-get line (lint, build-test, asan, valgrind, build-cuda) so CTORCH_BLAS=ON (the default) configures cleanly across the matrix.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a321380f58
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Two issues stacked from review + CI: - Codex P1 on PR #18: `Tensor::to(cuda->cpu)` for a strided view used to allocate `nbytes()` worth of CPU storage and copy the entire underlying allocation. For a small slice into a large CUDA tensor that turned a few-KB transfer into a multi-GB blowup. Compute the byte range actually addressed by the view and copy only that; rebase the destination's offset to 0 so subsequent `.contiguous()` works. - CI clang-16 build-test: OpenBLAS's `cblas.h` transitively includes `openblas_config.h`, which uses `float _Complex` typedefs. Under `-Werror -Wpedantic -Wc99-extensions` clang rejects this even though we never reference the typedefs. Wrap the cblas include in a clang-specific diagnostic pragma to silence -Wc99-extensions for exactly that header. Adds a regression test (matmul_cuda_test) that allocates a 1024x1024 CUDA tensor, slices a 2x2 view, copies to CPU, and asserts the CPU storage size is < 1% of the source — fails clearly if the bounded- range fix regresses.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e72f445615
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex review on PR #18 caught a P1 in `broadcast_batch`: using `max(da, db)` when one operand is `1` collapses `(0, 1)` to `1` instead of `0`, scheduling GEMMs against empty storage and producing a non-empty output for inputs that should broadcast to zero rows. Pick the non-`1` operand explicitly (matches `ops/broadcast.cpp`), so `(0, 1)` and `(1, 0)` both stay `0`. Adds a regression test: `matmul((0,2,3), (1,3,4))` now produces an empty `(0, 2, 4)` tensor instead of corrupted output.
|
@codex review. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: eb174fe877
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex P2 on PR #18: backends narrow `MatmulPlan`'s int64 M/N/K to `int` for cblas/cuBLAS. A dim above INT_MAX would silently wrap to a negative value at the call site. Surface it as ShapeError up front in plan_matmul so both backends inherit the guard without duplication. Adds a regression test that constructs a tensor with shape `(INT_MAX+1, 0)` (storage allocates zero bytes thanks to the 0 companion dim) and expects ShapeError instead of corrupted GEMM.
|
@codex review. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2ecaa71214
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex P1 on PR #18: when K==0 (empty contraction axis), the front-door still dispatched to the BLAS backend, which then issued cblas/cuBLAS calls with K=0 and lda=K=0 — illegal-parameter in both libraries. Mathematically the contraction sum over an empty axis is 0, so the result is an M×N tensor of zeros — exactly what `Tensor`'s zero-init already produces. Add `plan.K == 0` to the early-return guard so the GEMM call is skipped. Adds a regression test for `(2, 0) @ (0, 3)` → all-zero (2, 3).
|
@codex review. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e7cebdf17e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex P2 on PR #18: `transpose`'s `dim + rank` overflows signed `int` when `dim == INT_MIN` (UB). The shared `normalise_dim` helper used by slice / select / narrow had the same pattern. Range-check the unnormalised value against the valid window `[-rank, rank)` before adding, mirroring the INT64_MIN fixes already landed in the same file. Adds INT_MIN regression tests for transpose, slice, select, narrow.
|
@codex review. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a0e005d050
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (plan.M == 0 || plan.N == 0 || plan.K == 0 || plan.a_offsets.empty()) { | ||
| // Empty result OR empty contraction: skip the GEMM call. | ||
| // - M==0 / N==0 / batch==0: the output tensor is empty, so | ||
| // nothing to write. | ||
| // - K==0: the contraction sum is empty so C must be 0; | ||
| // Tensor's zero-init already gives that. Dispatching anyway | ||
| // would feed BLAS `K=0` plus `lda=K=0` (an illegal-parameter | ||
| // error in cblas / cuBLAS). | ||
| return out; |
There was a problem hiding this comment.
Preserve BLAS-off error path for degenerate CPU matmul
Keep the early-exit path from bypassing backend dispatch on CPU builds without BLAS. In -DCTORCH_BLAS=OFF builds, matmul_cpu is the component that raises the documented runtime error, but this pre-dispatch return lets calls like (M,0) @ (0,N) or zero-batch inputs succeed silently instead of surfacing that matmul is unavailable on CPU. That makes behavior shape-dependent in a configuration that is supposed to be a stub and can hide unsupported deployment setups.
Useful? React with 👍 / 👎.
Summary
Implements issue #11 — BLAS-backed matrix multiplication and zero-copy transpose.
Tensor::T()andctorch::transpose(x, i, j)— zero-copy view ops; T() is the 2-D shorthand and rejects non-2-D inputs withShapeError.ctorch::matmul— full PyTorch shape rules (1-D × 1-D dot, 1-D × 2-D, 2-D × 1-D, 2-D × 2-D, batched N-D with right-aligned broadcast). f32/f64 only; integer / bool raiseDTypeError; cross-device raisesDeviceError; inner-dim mismatch raisesShapeErrorwith both shapes printed.linalg/blas.hthin wrapper,blas_cpu.cppdispatches intocblas_sgemm/dgemmviafind_package(BLAS)(Apple Accelerate, OpenBLAS, MKL all work).blas_cuda.cuusescublasSgemm/Dgemmvia a per-thread per-device handle cache, with the standardC^T = B^T A^Ttrick for cuBLAS column-major.CTORCH_BLASoption (default ON) gated throughcmake/ctorch_blas.cmake. HonoursBLA_VENDOR, locatescblas.hin non-system prefixes (Homebrew, Apt). CUDA build linksCUDA::cublas.CTORCH_BLAS=OFFis a supported smoke build (matmul becomes a stub; CPU matmul tests gated off).SumBench.CudaWithin15PercentOfCubperf flake from issue [Feature] Reductions with axis & keepdim (CPU + CUDA) #9, unrelated).Test plan
ctest -j4→ 273/273 pass.CTORCH_CUDA=ON,CTORCH_BLAS=OFF):ctest -j→ 266/267 pass; only failure is the unrelated perf flake.matmul_cuda_test(4 cases):TwoDByTwoDMatchesHandComputed,AsymmetricMNKLockstheColumnMajorTrick(asymmetric M=5/N=11/K=7 to lock in the cuBLAS row→col reinterpretation),MatmulOfTransposeUsesCpuRoundTripFallback(non-contig CUDA),MixedFloatPromotesToFloat64(mixed dtype CUDA) — all pass.np.matmul.Closes #11