Hopper GEMM kernels, microbenchmarks, and tuning notes in one repo.
Kebab is a CUDA/CuTe playground for studying fast GEMM on NVIDIA Hopper.
The source lives in kebab/; the repo root stays focused on build entrypoints,
config, results, and docs.
- CUDA GEMM versions side by side, from simpler baselines to deeper Hopper kernels
- CuTe implementations, validation harnesses, and benchmark runners
- Focused microbenchmarks for WGMMA, TMA copy paths, sparse MMA, and metadata packing
- One
config.yamlfor reproducible runs
Hopper-only for the interesting path: sm_90 / sm_90a, CUDA 12.x, and yaml-cpp.
Generated by make bench-gemm with the checked-in config.yaml on NVIDIA H800 PCIe.
FP16, mode RC, all numbers in TFLOP/s.
| M = N = K | cuBLAS | v2 | v3 | v4 | v5 | v10 | Best |
|---|---|---|---|---|---|---|---|
| 2048 | 420.7 | 220.6 | 249.7 | 290.7 | 255.0 | 290.5 | v4 (290.7, 69.1%) |
| 4096 | 473.6 | 237.4 | 294.3 | 334.0 | 379.0 | 309.5 | v5 (379.0, 80.0%) |
| 8192 | 412.5 | 187.0 | 277.9 | 248.8 | 336.7 | 290.6 | v5 (336.7, 81.6%) |
The point of the repo is the ladder, not one magic kernel: each version exposes a different Hopper idea you can measure, compare, and inspect.
make setup
make bench-gemmResults land in bench_results/. If you want different sizes, versions, or
precisions, edit config.yaml.
make mbench-mma-wgmma- raw WGMMA behaviormake mbench-copy-gmem-to-smem-2d-tma-cute- TMA + CuTe copy pathmake mbench-sparse-mma- Hopper 2:4 sparse MMA experimentmake mbench-cutlass-meta-probe- CUTLASS metadata packing sanity check
kebab/lib/cuda- CUDA GEMM versions and baselineskebab/lib/cute- CuTe kernelskebab/lib/benchmark- operator benchmarkskebab/lib/microbench- focused kernel probesdocs/- design notes and optimization writeupsconfig.yaml- benchmark/runtime config
MIT. See LICENSE.