feat(linalg): matmul + transpose backed by OpenBLAS / Accelerate / cuBLAS by Hayden727 · Pull Request #18 · Hayden727/ctorch

Hayden727 · 2026-04-30T17:04:19Z

Summary

Implements issue #11 — BLAS-backed matrix multiplication and zero-copy transpose.

Tensor::T() and ctorch::transpose(x, i, j) — zero-copy view ops; T() is the 2-D shorthand and rejects non-2-D inputs with ShapeError.
ctorch::matmul — full PyTorch shape rules (1-D × 1-D dot, 1-D × 2-D, 2-D × 1-D, 2-D × 2-D, batched N-D with right-aligned broadcast). f32/f64 only; integer / bool raise DTypeError; cross-device raises DeviceError; inner-dim mismatch raises ShapeError with both shapes printed.
BLAS layer — linalg/blas.h thin wrapper, blas_cpu.cpp dispatches into cblas_sgemm/dgemm via find_package(BLAS) (Apple Accelerate, OpenBLAS, MKL all work). blas_cuda.cu uses cublasSgemm/Dgemm via a per-thread per-device handle cache, with the standard C^T = B^T A^T trick for cuBLAS column-major.
CMake — new CTORCH_BLAS option (default ON) gated through cmake/ctorch_blas.cmake. Honours BLA_VENDOR, locates cblas.h in non-system prefixes (Homebrew, Apt). CUDA build links CUDA::cublas. CTORCH_BLAS=OFF is a supported smoke build (matmul becomes a stub; CPU matmul tests gated off).
273 mac tests pass; full dell GPU run (CUDA 12.0, sm_75/80/86) is clean for matmul (the only failure is a pre-existing SumBench.CudaWithin15PercentOfCub perf flake from issue [Feature] Reductions with axis & keepdim (CPU + CUDA) #9, unrelated).

Test plan

Mac local CPU build (Apple Accelerate): ctest -j4 → 273/273 pass.
Dell remote CUDA build (CTORCH_CUDA=ON, CTORCH_BLAS=OFF): ctest -j → 266/267 pass; only failure is the unrelated perf flake.
New matmul_cuda_test (4 cases): TwoDByTwoDMatchesHandComputed, AsymmetricMNKLockstheColumnMajorTrick (asymmetric M=5/N=11/K=7 to lock in the cuBLAS row→col reinterpretation), MatmulOfTransposeUsesCpuRoundTripFallback (non-contig CUDA), MixedFloatPromotesToFloat64 (mixed dtype CUDA) — all pass.
All 5 matmul shape rules + 8 parity fixtures via np.matmul.

Closes #11

…BLAS Implements issue #11. - Tensor::T() and ctorch::transpose(x, i, j): zero-copy view ops; T() is a 2-D shorthand that rejects non-2-D inputs with ShapeError. - ctorch::matmul: full PyTorch shape rules (1D×1D, 1D×2D, 2D×1D, 2D×2D, batched N-D with right-aligned broadcast). f32/f64 only; integer/bool raise DTypeError; cross-device raises DeviceError; inner-dim mismatch raises ShapeError with both shapes printed. - BLAS layer: linalg/blas.h thin wrapper, blas_cpu.cpp dispatches into cblas_sgemm/dgemm via find_package(BLAS) (Apple Accelerate / OpenBLAS / MKL all work). blas_cuda.cu dispatches into cublasSgemm/Dgemm via a per-thread per-device handle cache, using the C^T = B^T A^T trick to feed cuBLAS column-major. - CMake: new CTORCH_BLAS option (default ON) gated through cmake/ctorch_blas.cmake; sets ACCELERATE_NEW_LAPACK on macOS to avoid the deprecated cblas_sgemm warning. CUDA build links CUDA::cublas. - Tests: 22 functional tests (transpose semantics, all 5 matmul shape cases, dtype rules, error paths) plus 8 parity cases vs np.matmul. - docs/ops.md gained a Linear algebra section; README roadmap marks the milestone shipped. Closes #11

…e + BLAS-OFF tests Addresses three issues raised by Codex on the issue-#11 PR: - P1: matmul on non-contiguous CUDA operands (e.g. matmul(a, a.T())) used to throw because Tensor::contiguous() also throws on non-CPU tensors. Round-trip through CPU instead — slow but correct, matching the issue's documented examples. Optimisation deferred. - P2: mixed float32 × float64 was accepted on CPU but rejected on CUDA; align both backends by routing CUDA casts through CPU as well. - P2: matmul tests are no-ops when CTORCH_BLAS=OFF (CPU backend is a stub in that mode). Gate the CPU-using matmul tests behind if(CTORCH_BLAS) so the no-BLAS build keeps the suite green. Adds matmul_cuda_test exercising the new fallbacks on real GPU plus asymmetric M/N/K to lock in the cuBLAS column-major C^T = B^T A^T trick. Skipped at runtime on hosts without CUDA.

The matmul P1 fallback `t.to(cpu).contiguous().to(dev)` couldn't run on strided CUDA inputs because `Tensor::to(cpu)` started by calling `contiguous()`, which throws on non-CPU. Extend `Tensor::to`: - Already-contiguous source → fast path (unchanged). - Non-CPU non-contiguous source → bulk-copy the entire underlying storage device-to-host while preserving shape / stride / offset. The destination is non-contiguous on its target device; callers can `.contiguous()` from there (works on CPU since the strided-to- contiguous odometer is CPU-side). - CPU non-contig source → unchanged (materialise on CPU first). This unblocks `matmul(a, a.T())` on CUDA — the test that was failing on dell now has a working round-trip path.

Codex review on PR #18 flagged the BLAS detection as platform-fragile: - P1: macOS unconditionally picked Accelerate's header and defined ACCELERATE_NEW_LAPACK, even when the user passed -DBLA_VENDOR=OpenBLAS / MKL. That compiled against Accelerate's CBLAS prototypes while linking a different ABI. Now check that the resolved BLAS_LIBRARIES actually contain Accelerate AND the vendor isn't explicitly set to a non-Apple choice before opting into the framework header. - P2: check_include_file_cxx("cblas.h") only searched the compiler's default include dirs, so brew/apt/Cellar OpenBLAS installs at non-system prefixes failed configuration. Replace with find_path() carrying explicit HINTS for the common prefixes; surface the resolved path through CTORCH_CBLAS_INCLUDE_DIR so ctorch_core can add it as a private include directory.

…tmul Codex review on the issue-#11 branch flagged two more P2s: - The float-only contract was checked against the *promoted* dtype, which let `int32 × float32` and `bool × float64` slip through after an implicit widen. Move the gate to each operand individually before promotion so any non-float input throws DTypeError as documented. - The CUDA round-trip fallback paths (`Tensor::to(Device::cpu())` and `.to(dev)`) ran without `cuda::DeviceGuard`, so on multi-GPU hosts the temporary download/upload could land on the wrong context when the calling thread's current device differed from `t.device()`. Pin the device for the duration of every fallback path. Adds a regression test for mixed int/bool × float on CPU.

Issue #11 added BLAS-backed matmul; the ctorch_blas.cmake module now calls find_package(BLAS REQUIRED) which fails on the GH Actions Linux images because no BLAS is preinstalled. macOS uses Accelerate so it works out of the box. Add libopenblas-dev to every Linux apt-get line (lint, build-test, asan, valgrind, build-cuda) so CTORCH_BLAS=ON (the default) configures cleanly across the matrix.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a321380f58

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Two issues stacked from review + CI: - Codex P1 on PR #18: `Tensor::to(cuda->cpu)` for a strided view used to allocate `nbytes()` worth of CPU storage and copy the entire underlying allocation. For a small slice into a large CUDA tensor that turned a few-KB transfer into a multi-GB blowup. Compute the byte range actually addressed by the view and copy only that; rebase the destination's offset to 0 so subsequent `.contiguous()` works. - CI clang-16 build-test: OpenBLAS's `cblas.h` transitively includes `openblas_config.h`, which uses `float _Complex` typedefs. Under `-Werror -Wpedantic -Wc99-extensions` clang rejects this even though we never reference the typedefs. Wrap the cblas include in a clang-specific diagnostic pragma to silence -Wc99-extensions for exactly that header. Adds a regression test (matmul_cuda_test) that allocates a 1024x1024 CUDA tensor, slices a 2x2 view, copies to CPU, and asserts the CPU storage size is < 1% of the source — fails clearly if the bounded- range fix regresses.

Hayden727 · 2026-04-30T17:22:42Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e72f445615

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Codex review on PR #18 caught a P1 in `broadcast_batch`: using `max(da, db)` when one operand is `1` collapses `(0, 1)` to `1` instead of `0`, scheduling GEMMs against empty storage and producing a non-empty output for inputs that should broadcast to zero rows. Pick the non-`1` operand explicitly (matches `ops/broadcast.cpp`), so `(0, 1)` and `(1, 0)` both stay `0`. Adds a regression test: `matmul((0,2,3), (1,3,4))` now produces an empty `(0, 2, 4)` tensor instead of corrupted output.

Hayden727 · 2026-05-02T16:08:43Z

@codex review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb174fe877

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Codex P2 on PR #18: backends narrow `MatmulPlan`'s int64 M/N/K to `int` for cblas/cuBLAS. A dim above INT_MAX would silently wrap to a negative value at the call site. Surface it as ShapeError up front in plan_matmul so both backends inherit the guard without duplication. Adds a regression test that constructs a tensor with shape `(INT_MAX+1, 0)` (storage allocates zero bytes thanks to the 0 companion dim) and expects ShapeError instead of corrupted GEMM.

Hayden727 · 2026-05-03T02:23:47Z

@codex review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ecaa71214

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Codex P1 on PR #18: when K==0 (empty contraction axis), the front-door still dispatched to the BLAS backend, which then issued cblas/cuBLAS calls with K=0 and lda=K=0 — illegal-parameter in both libraries. Mathematically the contraction sum over an empty axis is 0, so the result is an M×N tensor of zeros — exactly what `Tensor`'s zero-init already produces. Add `plan.K == 0` to the early-return guard so the GEMM call is skipped. Adds a regression test for `(2, 0) @ (0, 3)` → all-zero (2, 3).

Hayden727 · 2026-05-03T05:33:02Z

@codex review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7cebdf17e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Codex P2 on PR #18: `transpose`'s `dim + rank` overflows signed `int` when `dim == INT_MIN` (UB). The shared `normalise_dim` helper used by slice / select / narrow had the same pattern. Range-check the unnormalised value against the valid window `[-rank, rank)` before adding, mirroring the INT64_MIN fixes already landed in the same file. Adds INT_MIN regression tests for transpose, slice, select, narrow.

Hayden727 · 2026-05-07T09:17:43Z

@codex review.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0e005d050

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-07T09:24:30Z

+    if (plan.M == 0 || plan.N == 0 || plan.K == 0 || plan.a_offsets.empty()) {
+        // Empty result OR empty contraction: skip the GEMM call.
+        //  - M==0 / N==0 / batch==0: the output tensor is empty, so
+        //    nothing to write.
+        //  - K==0: the contraction sum is empty so C must be 0;
+        //    Tensor's zero-init already gives that. Dispatching anyway
+        //    would feed BLAS `K=0` plus `lda=K=0` (an illegal-parameter
+        //    error in cblas / cuBLAS).
+        return out;


Preserve BLAS-off error path for degenerate CPU matmul

Keep the early-exit path from bypassing backend dispatch on CPU builds without BLAS. In -DCTORCH_BLAS=OFF builds, matmul_cpu is the component that raises the documented runtime error, but this pre-dispatch return lets calls like (M,0) @ (0,N) or zero-batch inputs succeed silently instead of surfacing that matmul is unavailable on CPU. That makes behavior shape-dependent in a configuration that is supposed to be a stub and can hide unsupported deployment setups.

Useful? React with 👍 / 👎.

Hayden727 added 7 commits April 30, 2026 00:38

chore: ignore .claude per-developer agent state

a20395b

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread src/tensor.cpp Outdated

chatgpt-codex-connector Bot reviewed Apr 30, 2026

View reviewed changes

Comment thread src/ops/matmul_shape.cpp Outdated

chatgpt-codex-connector Bot reviewed May 2, 2026

View reviewed changes

Comment thread src/ops/matmul_cpu.cpp

chatgpt-codex-connector Bot reviewed May 3, 2026

View reviewed changes

Comment thread src/ops/matmul.cpp Outdated

chatgpt-codex-connector Bot reviewed May 3, 2026

View reviewed changes

Comment thread src/tensor.cpp Outdated

chatgpt-codex-connector Bot reviewed May 7, 2026

View reviewed changes

Conversation

Hayden727 commented Apr 30, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Hayden727 commented Apr 30, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Hayden727 commented May 2, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Hayden727 commented May 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Hayden727 commented May 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Hayden727 commented May 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 7, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant