Skip to content

feat(linalg): matmul + transpose backed by OpenBLAS / Accelerate / cuBLAS#18

Open
Hayden727 wants to merge 12 commits into
masterfrom
feat/linalg-matmul-issue-11
Open

feat(linalg): matmul + transpose backed by OpenBLAS / Accelerate / cuBLAS#18
Hayden727 wants to merge 12 commits into
masterfrom
feat/linalg-matmul-issue-11

Conversation

@Hayden727

Copy link
Copy Markdown
Owner

Summary

Implements issue #11 — BLAS-backed matrix multiplication and zero-copy transpose.

  • Tensor::T() and ctorch::transpose(x, i, j) — zero-copy view ops; T() is the 2-D shorthand and rejects non-2-D inputs with ShapeError.
  • ctorch::matmul — full PyTorch shape rules (1-D × 1-D dot, 1-D × 2-D, 2-D × 1-D, 2-D × 2-D, batched N-D with right-aligned broadcast). f32/f64 only; integer / bool raise DTypeError; cross-device raises DeviceError; inner-dim mismatch raises ShapeError with both shapes printed.
  • BLAS layer — linalg/blas.h thin wrapper, blas_cpu.cpp dispatches into cblas_sgemm/dgemm via find_package(BLAS) (Apple Accelerate, OpenBLAS, MKL all work). blas_cuda.cu uses cublasSgemm/Dgemm via a per-thread per-device handle cache, with the standard C^T = B^T A^T trick for cuBLAS column-major.
  • CMake — new CTORCH_BLAS option (default ON) gated through cmake/ctorch_blas.cmake. Honours BLA_VENDOR, locates cblas.h in non-system prefixes (Homebrew, Apt). CUDA build links CUDA::cublas. CTORCH_BLAS=OFF is a supported smoke build (matmul becomes a stub; CPU matmul tests gated off).
  • 273 mac tests pass; full dell GPU run (CUDA 12.0, sm_75/80/86) is clean for matmul (the only failure is a pre-existing SumBench.CudaWithin15PercentOfCub perf flake from issue [Feature] Reductions with axis & keepdim (CPU + CUDA) #9, unrelated).

Test plan

  • Mac local CPU build (Apple Accelerate): ctest -j4 → 273/273 pass.
  • Dell remote CUDA build (CTORCH_CUDA=ON, CTORCH_BLAS=OFF): ctest -j → 266/267 pass; only failure is the unrelated perf flake.
  • New matmul_cuda_test (4 cases): TwoDByTwoDMatchesHandComputed, AsymmetricMNKLockstheColumnMajorTrick (asymmetric M=5/N=11/K=7 to lock in the cuBLAS row→col reinterpretation), MatmulOfTransposeUsesCpuRoundTripFallback (non-contig CUDA), MixedFloatPromotesToFloat64 (mixed dtype CUDA) — all pass.
  • All 5 matmul shape rules + 8 parity fixtures via np.matmul.

Closes #11

…BLAS

Implements issue #11.

- Tensor::T() and ctorch::transpose(x, i, j): zero-copy view ops; T() is
  a 2-D shorthand that rejects non-2-D inputs with ShapeError.
- ctorch::matmul: full PyTorch shape rules (1D×1D, 1D×2D, 2D×1D, 2D×2D,
  batched N-D with right-aligned broadcast). f32/f64 only; integer/bool
  raise DTypeError; cross-device raises DeviceError; inner-dim mismatch
  raises ShapeError with both shapes printed.
- BLAS layer: linalg/blas.h thin wrapper, blas_cpu.cpp dispatches into
  cblas_sgemm/dgemm via find_package(BLAS) (Apple Accelerate / OpenBLAS
  / MKL all work). blas_cuda.cu dispatches into cublasSgemm/Dgemm via a
  per-thread per-device handle cache, using the C^T = B^T A^T trick to
  feed cuBLAS column-major.
- CMake: new CTORCH_BLAS option (default ON) gated through
  cmake/ctorch_blas.cmake; sets ACCELERATE_NEW_LAPACK on macOS to avoid
  the deprecated cblas_sgemm warning. CUDA build links CUDA::cublas.
- Tests: 22 functional tests (transpose semantics, all 5 matmul shape
  cases, dtype rules, error paths) plus 8 parity cases vs np.matmul.
- docs/ops.md gained a Linear algebra section; README roadmap marks
  the milestone shipped.

Closes #11
…e + BLAS-OFF tests

Addresses three issues raised by Codex on the issue-#11 PR:

- P1: matmul on non-contiguous CUDA operands (e.g. matmul(a, a.T()))
  used to throw because Tensor::contiguous() also throws on non-CPU
  tensors. Round-trip through CPU instead — slow but correct, matching
  the issue's documented examples. Optimisation deferred.
- P2: mixed float32 × float64 was accepted on CPU but rejected on CUDA;
  align both backends by routing CUDA casts through CPU as well.
- P2: matmul tests are no-ops when CTORCH_BLAS=OFF (CPU backend is a
  stub in that mode). Gate the CPU-using matmul tests behind
  if(CTORCH_BLAS) so the no-BLAS build keeps the suite green.

Adds matmul_cuda_test exercising the new fallbacks on real GPU plus
asymmetric M/N/K to lock in the cuBLAS column-major C^T = B^T A^T
trick. Skipped at runtime on hosts without CUDA.
The matmul P1 fallback `t.to(cpu).contiguous().to(dev)` couldn't run on
strided CUDA inputs because `Tensor::to(cpu)` started by calling
`contiguous()`, which throws on non-CPU. Extend `Tensor::to`:

- Already-contiguous source → fast path (unchanged).
- Non-CPU non-contiguous source → bulk-copy the entire underlying
  storage device-to-host while preserving shape / stride / offset.
  The destination is non-contiguous on its target device; callers can
  `.contiguous()` from there (works on CPU since the strided-to-
  contiguous odometer is CPU-side).
- CPU non-contig source → unchanged (materialise on CPU first).

This unblocks `matmul(a, a.T())` on CUDA — the test that was failing
on dell now has a working round-trip path.
Codex review on PR #18 flagged the BLAS detection as platform-fragile:

- P1: macOS unconditionally picked Accelerate's header and defined
  ACCELERATE_NEW_LAPACK, even when the user passed -DBLA_VENDOR=OpenBLAS
  / MKL. That compiled against Accelerate's CBLAS prototypes while
  linking a different ABI. Now check that the resolved BLAS_LIBRARIES
  actually contain Accelerate AND the vendor isn't explicitly set to a
  non-Apple choice before opting into the framework header.
- P2: check_include_file_cxx("cblas.h") only searched the compiler's
  default include dirs, so brew/apt/Cellar OpenBLAS installs at
  non-system prefixes failed configuration. Replace with find_path()
  carrying explicit HINTS for the common prefixes; surface the resolved
  path through CTORCH_CBLAS_INCLUDE_DIR so ctorch_core can add it as a
  private include directory.
…tmul

Codex review on the issue-#11 branch flagged two more P2s:

- The float-only contract was checked against the *promoted* dtype,
  which let `int32 × float32` and `bool × float64` slip through after
  an implicit widen. Move the gate to each operand individually before
  promotion so any non-float input throws DTypeError as documented.
- The CUDA round-trip fallback paths (`Tensor::to(Device::cpu())` and
  `.to(dev)`) ran without `cuda::DeviceGuard`, so on multi-GPU hosts
  the temporary download/upload could land on the wrong context when
  the calling thread's current device differed from `t.device()`.
  Pin the device for the duration of every fallback path.

Adds a regression test for mixed int/bool × float on CPU.
Issue #11 added BLAS-backed matmul; the ctorch_blas.cmake module now
calls find_package(BLAS REQUIRED) which fails on the GH Actions Linux
images because no BLAS is preinstalled. macOS uses Accelerate so it
works out of the box.

Add libopenblas-dev to every Linux apt-get line (lint, build-test,
asan, valgrind, build-cuda) so CTORCH_BLAS=ON (the default) configures
cleanly across the matrix.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a321380f58

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/tensor.cpp Outdated
Two issues stacked from review + CI:

- Codex P1 on PR #18: `Tensor::to(cuda->cpu)` for a strided view used
  to allocate `nbytes()` worth of CPU storage and copy the entire
  underlying allocation. For a small slice into a large CUDA tensor
  that turned a few-KB transfer into a multi-GB blowup. Compute the
  byte range actually addressed by the view and copy only that; rebase
  the destination's offset to 0 so subsequent `.contiguous()` works.
- CI clang-16 build-test: OpenBLAS's `cblas.h` transitively includes
  `openblas_config.h`, which uses `float _Complex` typedefs. Under
  `-Werror -Wpedantic -Wc99-extensions` clang rejects this even though
  we never reference the typedefs. Wrap the cblas include in a
  clang-specific diagnostic pragma to silence -Wc99-extensions for
  exactly that header.

Adds a regression test (matmul_cuda_test) that allocates a 1024x1024
CUDA tensor, slices a 2x2 view, copies to CPU, and asserts the CPU
storage size is < 1% of the source — fails clearly if the bounded-
range fix regresses.
@Hayden727

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e72f445615

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/ops/matmul_shape.cpp Outdated
Codex review on PR #18 caught a P1 in `broadcast_batch`: using
`max(da, db)` when one operand is `1` collapses `(0, 1)` to `1`
instead of `0`, scheduling GEMMs against empty storage and producing
a non-empty output for inputs that should broadcast to zero rows.

Pick the non-`1` operand explicitly (matches `ops/broadcast.cpp`),
so `(0, 1)` and `(1, 0)` both stay `0`.

Adds a regression test: `matmul((0,2,3), (1,3,4))` now produces an
empty `(0, 2, 4)` tensor instead of corrupted output.
@Hayden727

Copy link
Copy Markdown
Owner Author

@codex review.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: eb174fe877

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/ops/matmul_cpu.cpp
Codex P2 on PR #18: backends narrow `MatmulPlan`'s int64 M/N/K to
`int` for cblas/cuBLAS. A dim above INT_MAX would silently wrap to a
negative value at the call site. Surface it as ShapeError up front in
plan_matmul so both backends inherit the guard without duplication.

Adds a regression test that constructs a tensor with shape
`(INT_MAX+1, 0)` (storage allocates zero bytes thanks to the 0
companion dim) and expects ShapeError instead of corrupted GEMM.
@Hayden727

Copy link
Copy Markdown
Owner Author

@codex review.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2ecaa71214

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/ops/matmul.cpp Outdated
Codex P1 on PR #18: when K==0 (empty contraction axis), the front-door
still dispatched to the BLAS backend, which then issued cblas/cuBLAS
calls with K=0 and lda=K=0 — illegal-parameter in both libraries.

Mathematically the contraction sum over an empty axis is 0, so the
result is an M×N tensor of zeros — exactly what `Tensor`'s zero-init
already produces. Add `plan.K == 0` to the early-return guard so the
GEMM call is skipped.

Adds a regression test for `(2, 0) @ (0, 3)` → all-zero (2, 3).
@Hayden727

Copy link
Copy Markdown
Owner Author

@codex review.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: e7cebdf17e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/tensor.cpp Outdated
Codex P2 on PR #18: `transpose`'s `dim + rank` overflows signed `int`
when `dim == INT_MIN` (UB). The shared `normalise_dim` helper used by
slice / select / narrow had the same pattern.

Range-check the unnormalised value against the valid window
`[-rank, rank)` before adding, mirroring the INT64_MIN fixes already
landed in the same file.

Adds INT_MIN regression tests for transpose, slice, select, narrow.
@Hayden727

Copy link
Copy Markdown
Owner Author

@codex review.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0e005d050

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/ops/matmul.cpp
Comment on lines +136 to +144
if (plan.M == 0 || plan.N == 0 || plan.K == 0 || plan.a_offsets.empty()) {
// Empty result OR empty contraction: skip the GEMM call.
// - M==0 / N==0 / batch==0: the output tensor is empty, so
// nothing to write.
// - K==0: the contraction sum is empty so C must be 0;
// Tensor's zero-init already gives that. Dispatching anyway
// would feed BLAS `K=0` plus `lda=K=0` (an illegal-parameter
// error in cblas / cuBLAS).
return out;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve BLAS-off error path for degenerate CPU matmul

Keep the early-exit path from bypassing backend dispatch on CPU builds without BLAS. In -DCTORCH_BLAS=OFF builds, matmul_cpu is the component that raises the documented runtime error, but this pre-dispatch return lets calls like (M,0) @ (0,N) or zero-batch inputs succeed silently instead of surfacing that matmul is unavailable on CPU. That makes behavior shape-dependent in a configuration that is supposed to be a stub and can hide unsupported deployment setups.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Linear algebra: matmul + transpose (OpenBLAS / cuBLAS)

1 participant