linalg/mmm: cache-adaptive 2D-blocking for the single-thread tile walk by czoli1976 · Pull Request #2274 · sonos/tract

czoli1976 · 2026-05-23T14:42:36Z

Summary

The single-thread MMM tile walk uses a naive for n_panel { for m_panel } loop, so at large k it re-streams the entire inner operand (all of A in col-outer, all of B in row-outer) once per outer panel — memory/L1-bound. The multithread path already avoids this by 2D-blocking the panel grid (chunk_grid, 16-panel chunks). This PR brings the same blocking to the single-thread path, with the block size cache-derived so it stays L2-resident across hardware.

The change

run_single_thread_blocked: walk the m×n panel grid in BLK×BLK blocks (col/row-outer preserved as the within-block inner order).
st_block_edge: BLK sized so the block's A+B sub-panels (~BLK·(mr+nr)·k·elem) fit a working-set budget = detected L2 / 3 (sysctl hw.perflevel0.l2cachesize on macOS, sysfs on Linux), clamped to [1, 16], with a conservative 256 KiB fallback when L2 can't be read.

Results (single-thread, m=512, n=2048, f32; GFLOP/s, M1 Pro)

k	before	after	vs Accelerate `cblas_sgemm`
1024	1038	1245 (+20%)	2.0×
2048	835	1216 (+45%)	1.67×
4096	831	1084 (+30%)	1.4×

Largest where the old loop thrashed cache; k≤512 within noise. Frame-level, so all kernels (NEON/AMX/SME/…) benefit.

Correctness / safety

Bit-identical, not just approximate: only reorders independent tiles — each computes its full-k reduction into a disjoint C region, so order changes no result (the multithread path already iterates in a different, chunked order).
Floor = the old loop: BLK clamps to 1 ⇒ exactly the naive nested loop, so a small/unknown cache can never over-block.
Bounded by MT parity: BLK ≤ 16 = the chunk_grid blocking already shipped on x86/WASM/ARM via the multithread path.
Tests: 5 new large-shape (>16-panel) frame tests in packed_packed.rs exercising the blocked path (16²- and 17²-panel boundaries) against the naive reference — the existing frame proptests only reach 3 panels, below the block threshold. Full tract-linalg suite green.

Scope / non-goals

Touches only the single-thread arms of run_with_scratch_space_{col,row}_outer; the multithread path is unchanged. No kc-loop (k-blocking) yet — see the follow-up comment. No new dependencies.

🤖 Generated with Claude Code

The single-thread MMM tile walk used a naive nested loop, re-streaming the full inner operand (all of A in col-outer / B in row-outer) per panel at large k, which is memory/L1-bound. The multithread path already 2D-blocks the panel grid (chunk_grid); this brings the same blocking to the single-thread path, with the block edge cache-derived (detected L2/3, conservative 256 KiB fallback) so it stays L2-resident across hardware and never over-blocks a cache it cannot see. Bit-identical: it only reorders independent tiles (each computes its full-k reduction into a disjoint C region). The block-edge floor of 1 degrades exactly to the naive loop; the cap of 16 matches the multithread chunk_grid blocking already shipped on all platforms. Frame-level, so all kernels benefit. +20-45% at large k on Apple Silicon (single-thread); small / GEMV / multithreaded shapes are unchanged. Adds 5 large-shape (>16-panel) frame tests exercising the blocked path against the naive reference (the existing frame proptests only reach 3 panels). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

czoli1976 · 2026-05-23T14:43:01Z

Context, caveats & follow-ups (for review)

Where this came from. Profiling tract's MMM vs Apple's Accelerate (cblas_sgemm) across a k-sweep showed tract is competitive-to-better at k≤2048 but fell behind at large k only on the single-thread path — which, unlike the multithread path, had no cache blocking at all. This PR closes that.

Activation scope (why it's low-risk). The blocked walk is bit-identical to the old loop unless the inner panel dimension exceeds the block edge — i.e. only large 2D single-thread matmuls are reordered. GEMV (n==1), small matmuls, and all multithreaded execution are byte-for-byte unchanged. So embedded/audio/streaming workloads (GEMV-heavy, small shapes) see no change; the win is for large single-thread GEMM.

Validation honesty. Perf is verified on Apple Silicon only — M1 Pro is reliable (consistent re-runs). M4 is P/E-core-scheduling confounded over SSH (the same binary measured ~1772 on the P-cluster vs ~838 on the E-cluster at k=2048; macOS has no user-facing P-core pin — taskpolicy -c only clamps down). Best-of-3 same-session on M4 P-cores shows tract beating Accelerate ~1.2× at k=2048–4096 (a smaller margin than M1 because M4's larger L2 keeps Accelerate strong). It is safe on x86 / WASM / A-class ARM by construction (cache-derived budget, floor = naive loop, ≤ MT-parity blocking) but not yet perf-measured there — a not-a-regression check on those targets before merge would be ideal, and I'm happy to provide it given hardware/CI access. (Marked draft for this reason.)

Known remaining tail — k≥8192 (the real follow-up). At very large k we still lose to Accelerate (~0.83× on M4 at k=8192). Root cause: the microkernel reduces the full k in one pass, so a single tile's A/B panels (mr·k, k·nr ≈ 1 MB each at k=8192) overflow L1 — the inner k-loop is L1-bandwidth-bound, which 2D panel blocking can't fix. The proper fix is a kc loop (k-blocking) — split the reduction into L1-resident k-chunks and accumulate C across them (the classic GotoBLAS/BLIS structure; what Accelerate does). tract's frame has no k-blocking today; adding it touches the microkernel partial-k accumulation path + packed-panel k-slicing, so it's a larger, separate change. Happy to follow up if of interest. (k=8192 single-thread is also an edge case — large-LLM dims normally run multithreaded.)

Other follow-ups (not in this PR):

The multithread chunk_grid uses a fixed 16-panel chunk; it likely benefits from the same k-adaptive sizing for the production threaded large-k path.
The block budget (L2/3) is a reasonable default but is a natural candidate for a per-device knob — part of a broader on-device auto-tuning direction I've been prototyping (separate RFC to follow).

The cfg(linux) sysfs read in `detect_l2_bytes` was not rustfmt-conformant (it wasn't run through rustfmt on the macOS dev machine), so `cargo fmt --check` failed in CI. Pure formatting; no behavior change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

czoli1976 · 2026-05-25T07:26:22Z

@kali this needs verification on Cortex A5x, thanks !

kali · 2026-05-26T06:18:09Z

Nice, this is an option I had considered in the initial design, since it is central in BLIS which was more or less my model. Let's see what it does.

kali · 2026-05-26T07:30:56Z

on A53 and A55, in the noise bracket, on audio loads and image classifiers.

czoli1976 · 2026-05-26T07:52:33Z

good

czoli1976 · 2026-05-26T08:08:43Z

Besides Apple Silicon, would you have access to more recent ARM Cores like Cortex-A76/A78/A715/A720 ?
As those are out-of-order and bigger cache sizes, this should pay off

kali · 2026-05-26T08:19:51Z

i have a plan to generalize benching to graviton instances. I also think i can run on a A78AE, but it will take a while, the bench runner was down so it will have to catch up a lot of history.

czoli1976 · 2026-05-26T08:23:59Z

can't wait to see how it goes with Graviton

kali · 2026-05-27T08:32:32Z

Switching this one back to draft, waiting for the A78AE bencher to be operational again.

czoli1976 marked this pull request as ready for review May 24, 2026 19:45

czoli1976 mentioned this pull request May 27, 2026

onnx,core: fuse standard ONNX LSTM cell into one LstmEpilogue op #2294

Open

kali marked this pull request as draft May 27, 2026 08:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

linalg/mmm: cache-adaptive 2D-blocking for the single-thread tile walk#2274

linalg/mmm: cache-adaptive 2D-blocking for the single-thread tile walk#2274
czoli1976 wants to merge 2 commits into
sonos:mainfrom
czoli1976:feature/mmm-st-cache-blocking

czoli1976 commented May 23, 2026

Uh oh!

czoli1976 commented May 23, 2026

Uh oh!

czoli1976 commented May 25, 2026 •

edited

Loading

Uh oh!

kali commented May 26, 2026

Uh oh!

kali commented May 26, 2026

Uh oh!

czoli1976 commented May 26, 2026

Uh oh!

czoli1976 commented May 26, 2026

Uh oh!

kali commented May 26, 2026

Uh oh!

czoli1976 commented May 26, 2026

Uh oh!

kali commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

czoli1976 commented May 23, 2026

Summary

The change

Results (single-thread, m=512, n=2048, f32; GFLOP/s, M1 Pro)

Correctness / safety

Scope / non-goals

Uh oh!

czoli1976 commented May 23, 2026

Context, caveats & follow-ups (for review)

Uh oh!

czoli1976 commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kali commented May 26, 2026

Uh oh!

kali commented May 26, 2026

Uh oh!

czoli1976 commented May 26, 2026

Uh oh!

czoli1976 commented May 26, 2026

Uh oh!

kali commented May 26, 2026

Uh oh!

czoli1976 commented May 26, 2026

Uh oh!

kali commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

czoli1976 commented May 25, 2026 •

edited

Loading