vulkan: Q2_0 by Vort3xed · Pull Request #32 · PrismML-Eng/llama.cpp

Vort3xed · 2026-05-07T07:06:30Z

Overview

This PR adds Q2_0 quantization support to the Vulkan backend!! With this change, Ternary-Bonsai models (which use Q2_0 to pack 1.58-bit ternary weights) run on any Vulkan-capable GPU instead of falling back to CPU or erroring out at load time.

The five GLSL shader files this PR adds or extends:

dequant_funcs.glsl (new Q2_0 case for dequantize, dequantize4, and get_dm, used by the matrix-vector path that runs the token-generation hot loop)
dequant_q2_0.comp (new standalone block dequant kernel, used by GGML_OP_GET_ROWS and to_fp16 conversion)
dequant_funcs_cm2.glsl (new dequantFuncQ2_0 plus dispatch macro for the NVIDIA cooperative-matrix-2 path)
mul_mm_funcs.glsl (new Q2_0 branch in load_a_to_shmem, used by the tiled matmul that drives prompt processing)
copy_to_quant.comp (new Q2_0 quantize function for f32 to Q2_0 conversion, used by GGML_OP_CPY and GGML_OP_SET_ROWS)

The mechanical Q2_0 wiring already on the prism branch (block layout in types.glsl, codegen entries in vulkan-shaders-gen.cpp, and all 32 GGML_TYPE_Q2_0 registration points in ggml-vulkan.cpp) is unchanged. This PR drops in the actual shader logic those registrations were dispatching to.

Additional information

Each shader function is derived against the CPU reference (ggml-quants.c::dequantize_row_q2_0 and quantize_row_q2_0_ref). The standalone dequant kernel dequant_q2_0.comp uses a thread-map of 16 blocks per workgroup with each thread reading two adjacent bytes (covers all 128 codes per block, no overlaps, no gaps), matching the existing {256*8, 1, 1} workgroup denominator already registered in ggml-vulkan.cpp for Q2_0.

A standalone CPU simulator (tests/test-vulkan-q2_0-shader-sim.cpp) re-implements every shader function in C++ using the same uint8 shifts and masks the GLSL uses (the GLSL source is quoted in the comment immediately above each sim_* function so the translation can be verified by inspection). The simulator runs against the CPU reference and currently lands at 262,677 checks passing (0 failed). Coverage includes:

Exhaustive byte enumeration (every byte value in [0,256) at every slot at every iqs alignment)
Random fuzzing for dequantize, dequantize4, the cm2 path, the mul_mm load, and the quantize path
Thread-map covering for the standalone dequant kernel (boundary block counts of 1, 15, 16, 17, 31, 32, 33, 48, 65 to exercise the workgroup edge case)
Round-trip identity (sim quantize then CPU dequant compared against CPU quantize then CPU dequant, byte-exact)
Overflow and edge-case stress (all-zero blocks, magnitudes of 1e30, fp16-max scale, code-range invariants, and uint32 b_idx headroom for tensors over 2 GB)

The simulator is registered as a CTest target so it runs as part of ctest.

GPU-side validation comes from tests/test-backend-ops.cpp (Q2_0 added to all_types, base_types, and other_types). Results on Intel Arc 140V (Windows 11, Vulkan):

GET_ROWS Q2_0 passes (8 of 8 cases)
MUL_MAT Q2_0 passes every supported configuration. The 4 failing cases in the suite are pre-existing iq1_s and iq1_m failures unrelated to this PR
MUL_MAT_ID Q2_0 passes (74 of 74 cases)
CPY Q2_0 passes (138 of 138 cases, both f32 to Q2_0 and Q2_0 to f32)
SET_ROWS Q2_0 passes (159 of 159 cases)

End-to-end inference of Ternary-Bonsai-8B-Q2_0.gguf (399 tensors, full GPU offload via -ngl 99) loads cleanly through llama-server and produces coherent output (31.5 tok/s generation, 52 tok/s prompt processing on Intel Arc 140V).

Performance comparison via test-backend-ops perf -o MUL_MAT (m=4096, k=14336):

n (batch)	Q4_0	Q1_0	Q2_0
1	349 GFLOPS	761 GFLOPS	542 GFLOPS
4	1.28 TFLOPS	1.15 TFLOPS	1.44 TFLOPS
512	7.15 TFLOPS	7.31 TFLOPS	7.94 TFLOPS

Q2_0 sits between Q1_0 and Q4_0 on the matvec hot path (memory-bound, since Q2_0 reads twice the bytes per value of Q1_0 and half of Q4_0). Q2_0 wins on larger batches (n at or above 4), where the matmul shared-memory load benefits from the simpler unscaled-then-scale arithmetic and the tighter byte locality.

Requirements

Build requirements

Vulkan SDK 1.4 or newer (tested with 1.4.341.1) with glslc available on PATH for SPIR-V generation
CMake 3.14 or newer
A C++17 compiler with Vulkan headers available (tested with MSVC 14.44.35207 from Visual Studio 2022 Build Tools)
A Vulkan-capable GPU with VK_KHR_shader_float16_int8 and VK_KHR_8bit_storage (already required by the rest of the Vulkan backend, no new extensions added by this PR)

Build invocation (Windows)

cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release

Test invocation

build\bin\Release\test-vulkan-q2_0-shader-sim.exe
build\bin\Release\test-backend-ops.exe test -b Vulkan0

System this PR was developed and validated on

OS: Windows 11 Home 10.0.26200
CPU: Intel Core Ultra 9 288V
GPU: Intel Arc 140V (16 GB shared, driver 32.0.101.7026)
RAM: 31 GB
Vulkan SDK: 1.4.341.1
Compiler: MSVC 14.44.35207 (Visual Studio 2022 Build Tools)
Model used for end-to-end inference test: Ternary-Bonsai-8B-Q2_0.gguf (399 tensors, fully offloaded with -ngl 99)

I have read and agree with the contributing guidelines
AI usage disclosure:

khosravipasha · 2026-05-07T22:52:32Z

Awesome, thanks for giving this a try. Do you have some prompt processing and token generations speeds with llama-bench?
I will review more carefully sometime this week.

Copilot

Pull request overview

Adds Vulkan shader-side support for GGML_TYPE_Q2_0 (Ternary-Bonsai / 1.58-bit ternary weights) so Q2_0 tensors can run fully on Vulkan instead of falling back to CPU, plus adds test coverage for correctness.

Changes:

Implement Q2_0 dequant/quant logic across Vulkan shader paths (matvec, matmul, coopmat2, get_rows/to_fp16, cpy/set_rows).
Wire shader generation and Vulkan pipeline creation to build and dispatch the new Q2_0 shader variants.
Add a CPU shader-simulator test and extend backend-op tests to exercise Q2_0 on Vulkan.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
tests/test-vulkan-q2_0-shader-sim.cpp	New CPU simulator validating Q2_0 shader logic against CPU reference.
tests/test-backend-ops.cpp	Adds Q2_0 to tested type sets.
tests/CMakeLists.txt	Registers the new simulator test target.
ggml/src/ggml-vulkan/vulkan-shaders/vulkan-shaders-gen.cpp	Adds `q2_0` to shader generation and enables Q2_0 copy/set_rows variants.
ggml/src/ggml-vulkan/vulkan-shaders/types.glsl	Defines `block_q2_0` and associated macros.
ggml/src/ggml-vulkan/vulkan-shaders/mul_mm_funcs.glsl	Adds Q2_0 shared-memory load/decode path for matmul.
ggml/src/ggml-vulkan/vulkan-shaders/dequant_q2_0.comp	New standalone Q2_0 block dequant kernel.
ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs.glsl	Adds Q2_0 `dequantize`, `dequantize4`, and `get_dm` helpers.
ggml/src/ggml-vulkan/vulkan-shaders/dequant_funcs_cm2.glsl	Adds Q2_0 dequant function + dispatch selection for coopmat2.
ggml/src/ggml-vulkan/vulkan-shaders/copy_to_quant.comp	Adds Q2_0 quantize path for f32 → Q2_0.
ggml/src/ggml-vulkan/ggml-vulkan.cpp	Adds Q2_0 pipeline creation and enables Q2_0 in relevant dispatch switches.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Vort3xed · 2026-05-08T00:27:25Z

Thanks for letting me know! Copilot found an actual bug and I just patched it in a new commit

And the llama benchmarks:

file: Ternary-Bonsai-8B-Q2_0.gguf (Ternary-Bonsai 8B, qwen3 architecture, Q2_0 quantization)
size: 2.03 GiB
params: 8.19 B
backend: Vulkan
device: Intel Arc 140V (16 GB, fp16 yes, bf16 no, warp size 32, shared memory 32 KiB, KHR_coopmat available)
ngl: 99
repetitions: 3
build: d104cf1 (8846)
pp512 (prompt processing, 512 tokens): 511.08 +/- 4.23 t/s
tg128 (text generation, 128 tokens): 25.20 +/- 0.51 t/s
invocation: llama-bench -m Ternary-Bonsai-8B-Q2_0.gguf -ngl 99 -p 512 -n 128 -r 3

khosravipasha · 2026-05-11T14:55:11Z

Awesome, copilot sometimes helpful.
How did you test for correctness of kernels?
Can you run the KL tests similar to here: #8
can run 5-10 chunks instead of 100s for it to go faster.

Vort3xed · 2026-05-11T19:49:49Z

Got it! Just ran the PR #8 KL workflow on Ternary-Bonsai-8B-Q2_0.gguf vs Ternary-Bonsai-8B-F16.gguf (wikitext-2-raw, 10 chunks of 512, 5,120 token positions).

I did three runs to separate kernel parity from intrinsic Q2_0 quantization loss:

Comparison	Same top-p	Max KLD	RMS delta-p
fp16 vs Vulkan Q2_0 (this PR)	98.980% +/- 0.199%	2.7e-3	0.405%
fp16 vs CPU Q2_0 (intrinsic loss)	98.980% +/- 0.199%	2.7e-3	0.405%
CPU Q2_0 vs Vulkan Q2_0 (parity)	100.000% +/- 0.000%	1e-6	0.000%

Rows 1 and 2 are byte-identical, so the Vulkan kernel adds zero measurable loss on top of the existing Q2_0 quantization. Row 3 confirms direct cross-backend parity at ULP noise.

PPL Vulkan Q2_0: 13.285 +/- 0.81
PPL fp16: 13.281 +/- 0.81
PPL ratio 1.000294
ln(PPL) correlation 99.99%

khosravipasha · 2026-05-12T17:23:28Z

Nice, yeah it looks pretty good. I will do another pass at the code and merge it.

khosravipasha · 2026-05-12T17:27:03Z

@@ -0,0 +1,959 @@
+// cpu simulation of the q2_0 vulkan shader functions.


This is mostly unit tests and end-to-end testing right?

Eventually planning to send PRs to the main llama.cpp so need to see if they want something like this, for our fork its okay with me to have it.

Yes, this file is just unit tests and property tests of the shader algorithm against the CPU reference :)

khosravipasha

Thanks LGTM.

Planning to start sending PRs to main llama.cpp, will use your commit and add you as author for the vulkan backend.

One thing to note is they wanted group size 64, so I need to make some changes to all our kernels, probably going to make the offical Q2_0 to be group size 64, and then have PQ2_0 with group size 128 in our fork.

Copilot

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 2 comments.

+// fp16 <-> fp32 (ieee 754 half precision).
+//
+// we need a behavioural match with ggml's GGML_FP32_TO_FP16 / GGML_FP16_TO_FP32.
+// the simplest deterministic conversion is via memcpy bit-punning between float
+// and uint32_t. we use the standard bit-twiddling versions, with one important
+// invariant ggml requires: a finite fp32 input that overflows the fp16 range
+// must clamp to (signed) infinity, NOT become NaN. a NaN result is reserved
+// for actual fp32 NaN inputs (raw exponent 0xff, non-zero mantissa)
+


vulkan: Q2_0

13dec14

github-actions Bot added ggml Vulkan testing labels May 7, 2026

khosravipasha requested review from Copilot and khosravipasha May 7, 2026 22:48

Copilot started reviewing on behalf of khosravipasha May 7, 2026 22:49 View session

Copilot AI reviewed May 7, 2026

View reviewed changes

Comment thread tests/test-vulkan-q2_0-shader-sim.cpp

Comment thread tests/test-vulkan-q2_0-shader-sim.cpp

fix: fp32_to_fp16 saturates finite overflow to inf

990eb9c

khosravipasha mentioned this pull request May 11, 2026

Misc. bug: Linux Vulkan build is slower than CPU with Ternary 8B, and 10 times slower than a larger model #28

Open

khosravipasha self-assigned this May 11, 2026

khosravipasha requested a review from Copilot May 12, 2026 17:23

khosravipasha reviewed May 12, 2026

View reviewed changes

khosravipasha approved these changes May 12, 2026

View reviewed changes

Copilot AI reviewed May 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: Q2_0#32

vulkan: Q2_0#32
Vort3xed wants to merge 2 commits into
PrismML-Eng:prismfrom
Vort3xed:vulkan-q2_0-kernel

Vort3xed commented May 7, 2026

Uh oh!

khosravipasha commented May 7, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Vort3xed commented May 8, 2026

Uh oh!

khosravipasha commented May 11, 2026

Uh oh!

Vort3xed commented May 11, 2026

Uh oh!

khosravipasha commented May 12, 2026

Uh oh!

khosravipasha May 12, 2026

Uh oh!

Vort3xed May 12, 2026

Uh oh!

khosravipasha left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		@@ -0,0 +1,959 @@
		// cpu simulation of the q2_0 vulkan shader functions.

Conversation

Vort3xed commented May 7, 2026

Overview

Additional information

Requirements

Uh oh!

khosravipasha commented May 7, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Vort3xed commented May 8, 2026

Uh oh!

khosravipasha commented May 11, 2026

Uh oh!

Vort3xed commented May 11, 2026

Uh oh!

khosravipasha commented May 12, 2026

Uh oh!

khosravipasha May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Vort3xed May 12, 2026

Choose a reason for hiding this comment

Uh oh!

khosravipasha left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants