vulkan: Q2_0#32
Conversation
|
Awesome, thanks for giving this a try. Do you have some prompt processing and token generations speeds with llama-bench? |
There was a problem hiding this comment.
Pull request overview
Adds Vulkan shader-side support for GGML_TYPE_Q2_0 (Ternary-Bonsai / 1.58-bit ternary weights) so Q2_0 tensors can run fully on Vulkan instead of falling back to CPU, plus adds test coverage for correctness.
Changes:
- Implement Q2_0 dequant/quant logic across Vulkan shader paths (matvec, matmul, coopmat2, get_rows/to_fp16, cpy/set_rows).
- Wire shader generation and Vulkan pipeline creation to build and dispatch the new Q2_0 shader variants.
- Add a CPU shader-simulator test and extend backend-op tests to exercise Q2_0 on Vulkan.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test-vulkan-q2_0-shader-sim.cpp | New CPU simulator validating Q2_0 shader logic against CPU reference. |
| tests/test-backend-ops.cpp | Adds Q2_0 to tested type sets. |
| tests/CMakeLists.txt | Registers the new simulator test target. |
| ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp | Adds q2_0 to shader generation and enables Q2_0 copy/set_rows variants. |
| ggml/src/ggml-vulkan/vulkan-shaders/types.glsl | Defines block_q2_0 and associated macros. |
| ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_funcs.glsl | Adds Q2_0 shared-memory load/decode path for matmul. |
| ggml/src/ggml-vulkan/vulkan-shaders/dequant_q2_0.comp | New standalone Q2_0 block dequant kernel. |
| ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs.glsl | Adds Q2_0 dequantize, dequantize4, and get_dm helpers. |
| ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs_cm2.glsl | Adds Q2_0 dequant function + dispatch selection for coopmat2. |
| ggml/src/ggml-vulkan/vulkan-shaders/copy_to_quant.comp | Adds Q2_0 quantize path for f32 → Q2_0. |
| ggml/src/ggml-vulkan/ggml-vulkan.cpp | Adds Q2_0 pipeline creation and enables Q2_0 in relevant dispatch switches. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Thanks for letting me know! Copilot found an actual bug and I just patched it in a new commit And the llama benchmarks:
|
|
Awesome, copilot sometimes helpful. |
|
Got it! Just ran the PR #8 KL workflow on I did three runs to separate kernel parity from intrinsic Q2_0 quantization loss:
Rows 1 and 2 are byte-identical, so the Vulkan kernel adds zero measurable loss on top of the existing Q2_0 quantization. Row 3 confirms direct cross-backend parity at ULP noise. PPL Vulkan Q2_0: 13.285 +/- 0.81 |
|
Nice, yeah it looks pretty good. I will do another pass at the code and merge it. |
| @@ -0,0 +1,959 @@ | |||
| // cpu simulation of the q2_0 vulkan shader functions. | |||
There was a problem hiding this comment.
This is mostly unit tests and end-to-end testing right?
Eventually planning to send PRs to the main llama.cpp so need to see if they want something like this, for our fork its okay with me to have it.
There was a problem hiding this comment.
Yes, this file is just unit tests and property tests of the shader algorithm against the CPU reference :)
khosravipasha
left a comment
There was a problem hiding this comment.
Thanks LGTM.
Planning to start sending PRs to main llama.cpp, will use your commit and add you as author for the vulkan backend.
One thing to note is they wanted group size 64, so I need to make some changes to all our kernels, probably going to make the offical Q2_0 to be group size 64, and then have PQ2_0 with group size 128 in our fork.
| // fp16 <-> fp32 (ieee 754 half precision). | ||
| // | ||
| // we need a behavioural match with ggml's GGML_FP32_TO_FP16 / GGML_FP16_TO_FP32. | ||
| // the simplest deterministic conversion is via memcpy bit-punning between float | ||
| // and uint32_t. we use the standard bit-twiddling versions, with one important | ||
| // invariant ggml requires: a finite fp32 input that overflows the fp16 range | ||
| // must clamp to (signed) infinity, NOT become NaN. a NaN result is reserved | ||
| // for actual fp32 NaN inputs (raw exponent 0xff, non-zero mantissa) | ||
|
|
Overview
This PR adds Q2_0 quantization support to the Vulkan backend!! With this change, Ternary-Bonsai models (which use Q2_0 to pack 1.58-bit ternary weights) run on any Vulkan-capable GPU instead of falling back to CPU or erroring out at load time.
The five GLSL shader files this PR adds or extends:
dequant_funcs.glsl(new Q2_0 case fordequantize,dequantize4, andget_dm, used by the matrix-vector path that runs the token-generation hot loop)dequant_q2_0.comp(new standalone block dequant kernel, used byGGML_OP_GET_ROWSandto_fp16conversion)dequant_funcs_cm2.glsl(newdequantFuncQ2_0plus dispatch macro for the NVIDIA cooperative-matrix-2 path)mul_mm_funcs.glsl(new Q2_0 branch inload_a_to_shmem, used by the tiled matmul that drives prompt processing)copy_to_quant.comp(new Q2_0quantizefunction for f32 to Q2_0 conversion, used byGGML_OP_CPYandGGML_OP_SET_ROWS)The mechanical Q2_0 wiring already on the
prismbranch (block layout intypes.glsl, codegen entries invulkan-shaders-gen.cpp, and all 32GGML_TYPE_Q2_0registration points inggml-vulkan.cpp) is unchanged. This PR drops in the actual shader logic those registrations were dispatching to.Additional information
Each shader function is derived against the CPU reference (
ggml-quants.c::dequantize_row_q2_0andquantize_row_q2_0_ref). The standalone dequant kerneldequant_q2_0.compuses a thread-map of 16 blocks per workgroup with each thread reading two adjacent bytes (covers all 128 codes per block, no overlaps, no gaps), matching the existing{256*8, 1, 1}workgroup denominator already registered inggml-vulkan.cppfor Q2_0.A standalone CPU simulator (
tests/test-vulkan-q2_0-shader-sim.cpp) re-implements every shader function in C++ using the same uint8 shifts and masks the GLSL uses (the GLSL source is quoted in the comment immediately above eachsim_*function so the translation can be verified by inspection). The simulator runs against the CPU reference and currently lands at 262,677 checks passing (0 failed). Coverage includes:dequantize,dequantize4, the cm2 path, the mul_mm load, and the quantize pathThe simulator is registered as a CTest target so it runs as part of
ctest.GPU-side validation comes from
tests/test-backend-ops.cpp(Q2_0 added toall_types,base_types, andother_types). Results on Intel Arc 140V (Windows 11, Vulkan):iq1_sandiq1_mfailures unrelated to this PREnd-to-end inference of
Ternary-Bonsai-8B-Q2_0.gguf(399 tensors, full GPU offload via-ngl 99) loads cleanly throughllama-serverand produces coherent output (31.5 tok/s generation, 52 tok/s prompt processing on Intel Arc 140V).Performance comparison via
test-backend-ops perf -o MUL_MAT(m=4096, k=14336):Q2_0 sits between Q1_0 and Q4_0 on the matvec hot path (memory-bound, since Q2_0 reads twice the bytes per value of Q1_0 and half of Q4_0). Q2_0 wins on larger batches (n at or above 4), where the matmul shared-memory load benefits from the simpler unscaled-then-scale arithmetic and the tighter byte locality.
Requirements
Build requirements
glslcavailable on PATH for SPIR-V generationVK_KHR_shader_float16_int8andVK_KHR_8bit_storage(already required by the rest of the Vulkan backend, no new extensions added by this PR)Build invocation (Windows)
Test invocation
System this PR was developed and validated on
Ternary-Bonsai-8B-Q2_0.gguf(399 tensors, fully offloaded with-ngl 99)