Skip to content

linalg/mmm: cache-adaptive 2D-blocking for the single-thread tile walk#2274

Draft
czoli1976 wants to merge 2 commits into
sonos:mainfrom
czoli1976:feature/mmm-st-cache-blocking
Draft

linalg/mmm: cache-adaptive 2D-blocking for the single-thread tile walk#2274
czoli1976 wants to merge 2 commits into
sonos:mainfrom
czoli1976:feature/mmm-st-cache-blocking

Conversation

@czoli1976
Copy link
Copy Markdown
Contributor

Summary

The single-thread MMM tile walk uses a naive for n_panel { for m_panel } loop, so at large k it re-streams the entire inner operand (all of A in col-outer, all of B in row-outer) once per outer panel — memory/L1-bound. The multithread path already avoids this by 2D-blocking the panel grid (chunk_grid, 16-panel chunks). This PR brings the same blocking to the single-thread path, with the block size cache-derived so it stays L2-resident across hardware.

The change

  • run_single_thread_blocked: walk the m×n panel grid in BLK×BLK blocks (col/row-outer preserved as the within-block inner order).
  • st_block_edge: BLK sized so the block's A+B sub-panels (~BLK·(mr+nr)·k·elem) fit a working-set budget = detected L2 / 3 (sysctl hw.perflevel0.l2cachesize on macOS, sysfs on Linux), clamped to [1, 16], with a conservative 256 KiB fallback when L2 can't be read.

Results (single-thread, m=512, n=2048, f32; GFLOP/s, M1 Pro)

k before after vs Accelerate cblas_sgemm
1024 1038 1245 (+20%) 2.0×
2048 835 1216 (+45%) 1.67×
4096 831 1084 (+30%) 1.4×

Largest where the old loop thrashed cache; k≤512 within noise. Frame-level, so all kernels (NEON/AMX/SME/…) benefit.

Correctness / safety

  • Bit-identical, not just approximate: only reorders independent tiles — each computes its full-k reduction into a disjoint C region, so order changes no result (the multithread path already iterates in a different, chunked order).
  • Floor = the old loop: BLK clamps to 1 ⇒ exactly the naive nested loop, so a small/unknown cache can never over-block.
  • Bounded by MT parity: BLK ≤ 16 = the chunk_grid blocking already shipped on x86/WASM/ARM via the multithread path.
  • Tests: 5 new large-shape (>16-panel) frame tests in packed_packed.rs exercising the blocked path (16²- and 17²-panel boundaries) against the naive reference — the existing frame proptests only reach 3 panels, below the block threshold. Full tract-linalg suite green.

Scope / non-goals

Touches only the single-thread arms of run_with_scratch_space_{col,row}_outer; the multithread path is unchanged. No kc-loop (k-blocking) yet — see the follow-up comment. No new dependencies.

🤖 Generated with Claude Code

The single-thread MMM tile walk used a naive nested loop, re-streaming the
full inner operand (all of A in col-outer / B in row-outer) per panel at
large k, which is memory/L1-bound. The multithread path already 2D-blocks the
panel grid (chunk_grid); this brings the same blocking to the single-thread
path, with the block edge cache-derived (detected L2/3, conservative 256 KiB
fallback) so it stays L2-resident across hardware and never over-blocks a
cache it cannot see.

Bit-identical: it only reorders independent tiles (each computes its full-k
reduction into a disjoint C region). The block-edge floor of 1 degrades
exactly to the naive loop; the cap of 16 matches the multithread chunk_grid
blocking already shipped on all platforms. Frame-level, so all kernels
benefit. +20-45% at large k on Apple Silicon (single-thread); small / GEMV /
multithreaded shapes are unchanged.

Adds 5 large-shape (>16-panel) frame tests exercising the blocked path against
the naive reference (the existing frame proptests only reach 3 panels).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@czoli1976
Copy link
Copy Markdown
Contributor Author

Context, caveats & follow-ups (for review)

Where this came from. Profiling tract's MMM vs Apple's Accelerate (cblas_sgemm) across a k-sweep showed tract is competitive-to-better at k≤2048 but fell behind at large k only on the single-thread path — which, unlike the multithread path, had no cache blocking at all. This PR closes that.

Activation scope (why it's low-risk). The blocked walk is bit-identical to the old loop unless the inner panel dimension exceeds the block edge — i.e. only large 2D single-thread matmuls are reordered. GEMV (n==1), small matmuls, and all multithreaded execution are byte-for-byte unchanged. So embedded/audio/streaming workloads (GEMV-heavy, small shapes) see no change; the win is for large single-thread GEMM.

Validation honesty. Perf is verified on Apple Silicon only — M1 Pro is reliable (consistent re-runs). M4 is P/E-core-scheduling confounded over SSH (the same binary measured ~1772 on the P-cluster vs ~838 on the E-cluster at k=2048; macOS has no user-facing P-core pin — taskpolicy -c only clamps down). Best-of-3 same-session on M4 P-cores shows tract beating Accelerate ~1.2× at k=2048–4096 (a smaller margin than M1 because M4's larger L2 keeps Accelerate strong). It is safe on x86 / WASM / A-class ARM by construction (cache-derived budget, floor = naive loop, ≤ MT-parity blocking) but not yet perf-measured there — a not-a-regression check on those targets before merge would be ideal, and I'm happy to provide it given hardware/CI access. (Marked draft for this reason.)

Known remaining tail — k≥8192 (the real follow-up). At very large k we still lose to Accelerate (~0.83× on M4 at k=8192). Root cause: the microkernel reduces the full k in one pass, so a single tile's A/B panels (mr·k, k·nr ≈ 1 MB each at k=8192) overflow L1 — the inner k-loop is L1-bandwidth-bound, which 2D panel blocking can't fix. The proper fix is a kc loop (k-blocking) — split the reduction into L1-resident k-chunks and accumulate C across them (the classic GotoBLAS/BLIS structure; what Accelerate does). tract's frame has no k-blocking today; adding it touches the microkernel partial-k accumulation path + packed-panel k-slicing, so it's a larger, separate change. Happy to follow up if of interest. (k=8192 single-thread is also an edge case — large-LLM dims normally run multithreaded.)

Other follow-ups (not in this PR):

  • The multithread chunk_grid uses a fixed 16-panel chunk; it likely benefits from the same k-adaptive sizing for the production threaded large-k path.
  • The block budget (L2/3) is a reasonable default but is a natural candidate for a per-device knob — part of a broader on-device auto-tuning direction I've been prototyping (separate RFC to follow).

@czoli1976 czoli1976 marked this pull request as ready for review May 24, 2026 19:45
The cfg(linux) sysfs read in `detect_l2_bytes` was not rustfmt-conformant
(it wasn't run through rustfmt on the macOS dev machine), so `cargo fmt
--check` failed in CI. Pure formatting; no behavior change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@czoli1976
Copy link
Copy Markdown
Contributor Author

czoli1976 commented May 25, 2026

@kali this needs verification on Cortex A5x, thanks !

@kali
Copy link
Copy Markdown
Collaborator

kali commented May 26, 2026

Nice, this is an option I had considered in the initial design, since it is central in BLIS which was more or less my model. Let's see what it does.

@kali
Copy link
Copy Markdown
Collaborator

kali commented May 26, 2026

on A53 and A55, in the noise bracket, on audio loads and image classifiers.

@czoli1976
Copy link
Copy Markdown
Contributor Author

good

@czoli1976
Copy link
Copy Markdown
Contributor Author

Besides Apple Silicon, would you have access to more recent ARM Cores like Cortex-A76/A78/A715/A720 ?
As those are out-of-order and bigger cache sizes, this should pay off

@kali
Copy link
Copy Markdown
Collaborator

kali commented May 26, 2026

i have a plan to generalize benching to graviton instances. I also think i can run on a A78AE, but it will take a while, the bench runner was down so it will have to catch up a lot of history.

@czoli1976
Copy link
Copy Markdown
Contributor Author

can't wait to see how it goes with Graviton

@kali
Copy link
Copy Markdown
Collaborator

kali commented May 27, 2026

Switching this one back to draft, waiting for the A78AE bencher to be operational again.

@kali kali marked this pull request as draft May 27, 2026 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants