[PERFORMANCE] Templated kernels for grouped Conv1x1/Conv1D by rhaist · Pull Request #271 · sdatkinson/NeuralAmpModelerCore

rhaist · 2026-05-24T13:36:28Z

Generalizes the depthwise-only fast path from #217 to all grouped (and small dense) Conv1x1 / Conv1D shapes using compile-time-specialized kernels. Targets the "compile-time optimizations" path hinted at by #215.

Closes #215.

Approach

templated_conv1x1_kernel<OutCh, InCh, Groups> and templated_conv1d_tap_kernel<OutCh, InCh, Groups> carry all shape information as template parameters. With constexpr loop bounds, the compiler unrolls every loop, folds index arithmetic, and never visits off-block-diagonal zeros.

Dispatch is a function pointer (_kernel / _tap_kernel) set at construction by pick_*_kernel(out, in, groups). Unknown shapes return nullptr and fall through to the existing inline-GEMM / Eigen path — no regression risk. Both default Eigen and NAM_USE_INLINE_GEMM builds benefit (no #ifdef gate around dispatch).

Depthwise (groups == channels) is intentionally not registered — already handled by the existing _is_depthwise fast path from #217.

Registered square shapes: (4,4), (6,6), (8,8), (12,12), (16,16) at groups in {1, 2, 3, 4, 6, 8}.

Microbenchmark (Conv1x1, 64-frame buffer, best of 3 x 2M iters)

Shape	Baseline STD	Templated STD	Speedup
8x8 G=1	167 ns	69 ns	2.4x
8x8 G=2	170 ns	36 ns	4.7x
8x8 G=4	169 ns	49 ns	3.5x
16x16 G=2	~426 ns	146 ns	2.9x
16x16 G=4	~426 ns	66 ns	6.5x
16x16 G=8	~426 ns	97 ns	4.4x

End-to-end (`benchmodel`, best of 5 runs, STD build, Apple M-series Release)

v4/1x1_groups/*.nam (varies layer1x1.groups, 8-channel WaveNet):

G	Baseline	Templated	Delta
1	9.97 ms	8.76 ms	-12%
2	9.83 ms	8.29 ms	-16%
4	9.91 ms	8.55 ms	-14%
8	8.99 ms	9.13 ms	~same (depthwise unchanged)

v4/input_groups/*.nam (varies Conv1D groups_input):

G	Baseline	Templated	Delta
1	7.42 ms	6.66 ms	-10%
2	6.97 ms	5.62 ms	-19%
4	6.98 ms	5.41 ms	-23%
8	6.66 ms	5.53 ms	-17%

v4/channels/*.nam (varies channel width, groups=1 — shows dense gains from bypassing Eigen overhead on small matrices):

ch	Baseline	Templated	Delta
4	17.21 ms	12.39 ms	-28%
8	29.36 ms	25.11 ms	-14%
12	45.71 ms	38.40 ms	-16%
16	65.50 ms	65.04 ms	-1%

Wins are independent of Eigen version — verified against both this repo's pinned Eigen 3.4-dev (87300c93) and a separately-tracked Eigen 5.0.0 bump.

Correctness

Check	Result
`run_tests` x {STD, NAM_USE_INLINE_GEMM}	pass
Conv1x1 templated vs reference dense GEMM (22 shape x groups, random weights)	max diff 1.2e-7
Conv1D templated vs reference dense GEMM (198 shape x groups x K x dilation, random weights)	max diff 2.4e-7
Render parity on 33 production models (`baseline a1-{pico,nano,feather,lite,standard}`, `channels/{1..16}`, `bottleneck_sizes`, `1x1_groups`, `input_groups`, `head1x1_groups`)	bit-identical WAV output (max diff = 0.0)
`git-clang-format --diff HEAD`	clean
New compiler warnings	none
Realtime allocations (raw-pointer kernel, no Eigen temporaries)	none

Diff

NAM/dsp.h +10, NAM/dsp.cpp +120/-2
NAM/conv1d.h +11, NAM/conv1d.cpp +132
tools/CMakeLists.txt +36
new tools/bench_conv1x1_groups.cpp (microbench + correctness gate)
new tools/check_conv1d_grouped.cpp (Conv1D correctness gate across 198 shapes)

Notes for reviewers

pick_*_kernel tables are intentionally narrow — only square shapes that appear in the v4 model sweep. Trivial to extend later. Anything unregistered keeps the existing behavior exactly.
The bench/check tools are not wired into run_tests because they need -O2/-O3 (run_tests is -O0 for allocation tracking). They run cleanly as standalone CI steps if you want them gated.

Test plan

run_tests passes on STD + NAM_USE_INLINE_GEMM
tools/bench_conv1x1_groups correctness gate passes (22 shapes)
tools/check_conv1d_grouped passes (198 shapes)
Render parity on 33 production models: bit-identical
git-clang-format --diff HEAD clean

Compile-time-specialized GEMM kernels for the (out_channels, in_channels, groups) shapes used by WaveNet models. Generalizes the depthwise-only fast path from sdatkinson#217 to all grouped (and small dense) cases, addressing sdatkinson#215. Both the default Eigen path and NAM_USE_INLINE_GEMM build benefit; unknown shapes fall through to existing behavior. Render output is bit-identical to main on 33 production models including the v4 baseline a1-{pico,nano,feather,lite,standard} set.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PERFORMANCE] Templated kernels for grouped Conv1x1/Conv1D#271

[PERFORMANCE] Templated kernels for grouped Conv1x1/Conv1D#271
rhaist wants to merge 1 commit into
sdatkinson:mainfrom
rhaist:perf/templated-grouped-conv-kernels

rhaist commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rhaist commented May 24, 2026

Approach

Microbenchmark (Conv1x1, 64-frame buffer, best of 3 x 2M iters)

End-to-end (benchmodel, best of 5 runs, STD build, Apple M-series Release)

Correctness

Diff

Notes for reviewers

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

End-to-end (`benchmodel`, best of 5 runs, STD build, Apple M-series Release)