Skip to content

[PERFORMANCE] Templated kernels for grouped Conv1x1/Conv1D#271

Open
rhaist wants to merge 1 commit into
sdatkinson:mainfrom
rhaist:perf/templated-grouped-conv-kernels
Open

[PERFORMANCE] Templated kernels for grouped Conv1x1/Conv1D#271
rhaist wants to merge 1 commit into
sdatkinson:mainfrom
rhaist:perf/templated-grouped-conv-kernels

Conversation

@rhaist
Copy link
Copy Markdown

@rhaist rhaist commented May 24, 2026

Generalizes the depthwise-only fast path from #217 to all grouped (and small dense) Conv1x1 / Conv1D shapes using compile-time-specialized kernels. Targets the "compile-time optimizations" path hinted at by #215.

Closes #215.

Approach

templated_conv1x1_kernel<OutCh, InCh, Groups> and templated_conv1d_tap_kernel<OutCh, InCh, Groups> carry all shape information as template parameters. With constexpr loop bounds, the compiler unrolls every loop, folds index arithmetic, and never visits off-block-diagonal zeros.

Dispatch is a function pointer (_kernel / _tap_kernel) set at construction by pick_*_kernel(out, in, groups). Unknown shapes return nullptr and fall through to the existing inline-GEMM / Eigen path — no regression risk. Both default Eigen and NAM_USE_INLINE_GEMM builds benefit (no #ifdef gate around dispatch).

Depthwise (groups == channels) is intentionally not registered — already handled by the existing _is_depthwise fast path from #217.

Registered square shapes: (4,4), (6,6), (8,8), (12,12), (16,16) at groups in {1, 2, 3, 4, 6, 8}.

Microbenchmark (Conv1x1, 64-frame buffer, best of 3 x 2M iters)

Shape Baseline STD Templated STD Speedup
8x8 G=1 167 ns 69 ns 2.4x
8x8 G=2 170 ns 36 ns 4.7x
8x8 G=4 169 ns 49 ns 3.5x
16x16 G=2 ~426 ns 146 ns 2.9x
16x16 G=4 ~426 ns 66 ns 6.5x
16x16 G=8 ~426 ns 97 ns 4.4x

End-to-end (benchmodel, best of 5 runs, STD build, Apple M-series Release)

v4/1x1_groups/*.nam (varies layer1x1.groups, 8-channel WaveNet):

G Baseline Templated Delta
1 9.97 ms 8.76 ms -12%
2 9.83 ms 8.29 ms -16%
4 9.91 ms 8.55 ms -14%
8 8.99 ms 9.13 ms ~same (depthwise unchanged)

v4/input_groups/*.nam (varies Conv1D groups_input):

G Baseline Templated Delta
1 7.42 ms 6.66 ms -10%
2 6.97 ms 5.62 ms -19%
4 6.98 ms 5.41 ms -23%
8 6.66 ms 5.53 ms -17%

v4/channels/*.nam (varies channel width, groups=1 — shows dense gains from bypassing Eigen overhead on small matrices):

ch Baseline Templated Delta
4 17.21 ms 12.39 ms -28%
8 29.36 ms 25.11 ms -14%
12 45.71 ms 38.40 ms -16%
16 65.50 ms 65.04 ms -1%

Wins are independent of Eigen version — verified against both this repo's pinned Eigen 3.4-dev (87300c93) and a separately-tracked Eigen 5.0.0 bump.

Correctness

Check Result
run_tests x {STD, NAM_USE_INLINE_GEMM} pass
Conv1x1 templated vs reference dense GEMM (22 shape x groups, random weights) max diff 1.2e-7
Conv1D templated vs reference dense GEMM (198 shape x groups x K x dilation, random weights) max diff 2.4e-7
Render parity on 33 production models (baseline a1-{pico,nano,feather,lite,standard}, channels/{1..16}, bottleneck_sizes, 1x1_groups, input_groups, head1x1_groups) bit-identical WAV output (max diff = 0.0)
git-clang-format --diff HEAD clean
New compiler warnings none
Realtime allocations (raw-pointer kernel, no Eigen temporaries) none

Diff

  • NAM/dsp.h +10, NAM/dsp.cpp +120/-2
  • NAM/conv1d.h +11, NAM/conv1d.cpp +132
  • tools/CMakeLists.txt +36
  • new tools/bench_conv1x1_groups.cpp (microbench + correctness gate)
  • new tools/check_conv1d_grouped.cpp (Conv1D correctness gate across 198 shapes)

Notes for reviewers

  • pick_*_kernel tables are intentionally narrow — only square shapes that appear in the v4 model sweep. Trivial to extend later. Anything unregistered keeps the existing behavior exactly.
  • The bench/check tools are not wired into run_tests because they need -O2/-O3 (run_tests is -O0 for allocation tracking). They run cleanly as standalone CI steps if you want them gated.

Test plan

  • run_tests passes on STD + NAM_USE_INLINE_GEMM
  • tools/bench_conv1x1_groups correctness gate passes (22 shapes)
  • tools/check_conv1d_grouped passes (198 shapes)
  • Render parity on 33 production models: bit-identical
  • git-clang-format --diff HEAD clean

Compile-time-specialized GEMM kernels for the (out_channels, in_channels,
groups) shapes used by WaveNet models. Generalizes the depthwise-only fast
path from sdatkinson#217 to all grouped (and small dense) cases, addressing sdatkinson#215.

Both the default Eigen path and NAM_USE_INLINE_GEMM build benefit; unknown
shapes fall through to existing behavior.

Render output is bit-identical to main on 33 production models including
the v4 baseline a1-{pico,nano,feather,lite,standard} set.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[PERFORMANCE] Grouped convolutions appear dominated by overhead

1 participant