feat: add q8 quantized decode path#15
Conversation
Signed-off-by: Zach Bennett <zach@worldflowai.com>
📝 WalkthroughWalkthroughAdds Q8_0 and Q4_K quantized GEMV support (C reference and AArch64 NEON kernels) and FP16↔F32 array conversion kernels with CPU-dispatch. The Llama runtime is refactored to use typed ChangesQuantized GEMV and FP16 Kernel Integration
Sequence Diagram(s)sequenceDiagram
participant pg_llama_eval_token
participant llama_load_token_embedding
participant eval_layer
participant PgFp16DSP
participant PgQgemvDSP
participant KVCache as KV Cache (pg_fp16)
pg_llama_eval_token->>llama_load_token_embedding: token id
note over llama_load_token_embedding: dequant Q8_0/Q4_K row or memcpy f32
llama_load_token_embedding-->>pg_llama_eval_token: f32 hidden state
pg_llama_eval_token->>eval_layer: f32 hidden state
eval_layer->>PgFp16DSP: from_f32(k_vec) → KV Cache (pg_fp16)
eval_layer->>PgFp16DSP: from_f32(v_vec) → KV Cache (pg_fp16)
eval_layer->>KVCache: read key/value slots
eval_layer->>PgFp16DSP: to_f32(key_slot) → key_head scratch
eval_layer->>PgFp16DSP: to_f32(value_slot) → value_head scratch
eval_layer->>PgQgemvDSP: q8_0_gemv / q4_k_gemv (typed PgMat)
PgQgemvDSP-->>eval_layer: output activations
eval_layer-->>pg_llama_eval_token: updated hidden state
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/tensor/kernels/qgemv/qgemv_ref.c`:
- Around line 16-27: The functions pg_q8_0_row_bytes and pg_q4_k_row_bytes
perform unchecked multiplication operations that can overflow when K is very
large, causing the result to wrap and produce incorrect stride values used in
later validation checks. Add overflow guards to both functions before performing
the multiplication (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES and (K /
PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES respectively. Check if the multiplication
would overflow and return 0 (or an appropriate error value) if it would,
ensuring that invalid strides are caught early and prevent out-of-bounds reads
in downstream dequantization and dot product loops.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5e39f335-e7c2-4f83-843a-f0f62f005176
📒 Files selected for processing (23)
Makefile.bootstrapinclude/peregrine/model.hmeson.buildsrc/model/model_file.csrc/runtime/llama.csrc/tensor/kernels/fp16/arm/fp16_convert_neon.Ssrc/tensor/kernels/fp16/fp16.hsrc/tensor/kernels/fp16/fp16_init.csrc/tensor/kernels/fp16/fp16_ref.csrc/tensor/kernels/gemm/gemm_fp16.csrc/tensor/kernels/gemm/gemm_fp16.hsrc/tensor/kernels/qgemv/arm/q8_0_dot_neon.Ssrc/tensor/kernels/qgemv/arm/q8_0_gemv_neon.Ssrc/tensor/kernels/qgemv/qgemv.hsrc/tensor/kernels/qgemv/qgemv_driver.csrc/tensor/kernels/qgemv/qgemv_init.csrc/tensor/kernels/qgemv/qgemv_neon.csrc/tensor/kernels/qgemv/qgemv_ref.ctests/checkasm/checkasm.ctests/checkasm/checkasm.htests/checkasm/checkasm_fp16.ctests/checkasm/checkasm_qgemv.ctests/meson.build
| size_t pg_q8_0_row_bytes(size_t K) | ||
| { | ||
| if (K == 0 || K % PG_Q8_0_BLOCK != 0) | ||
| return 0; | ||
| return (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES; | ||
| } | ||
|
|
||
| size_t pg_q4_k_row_bytes(size_t K) | ||
| { | ||
| if (K == 0 || K % PG_Q4_K_BLOCK != 0) | ||
| return 0; | ||
| return (K / PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES; |
There was a problem hiding this comment.
Add overflow guards in row-byte helpers before multiplication.
pg_q8_0_row_bytes / pg_q4_k_row_bytes can wrap on very large K because (K / block) * block_bytes is unchecked. Since runtime size validation consumes these helpers for stride/expected-byte checks, a wrapped stride can validate incorrectly and lead to out-of-bounds reads later in dequant/dot loops.
💡 Suggested fix
size_t pg_q8_0_row_bytes(size_t K)
{
if (K == 0 || K % PG_Q8_0_BLOCK != 0)
return 0;
- return (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES;
+ size_t nb = K / PG_Q8_0_BLOCK;
+ if (nb > SIZE_MAX / PG_Q8_0_BLOCK_BYTES)
+ return 0;
+ return nb * PG_Q8_0_BLOCK_BYTES;
}
size_t pg_q4_k_row_bytes(size_t K)
{
if (K == 0 || K % PG_Q4_K_BLOCK != 0)
return 0;
- return (K / PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES;
+ size_t nb = K / PG_Q4_K_BLOCK;
+ if (nb > SIZE_MAX / PG_Q4_K_BLOCK_BYTES)
+ return 0;
+ return nb * PG_Q4_K_BLOCK_BYTES;
}📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| size_t pg_q8_0_row_bytes(size_t K) | |
| { | |
| if (K == 0 || K % PG_Q8_0_BLOCK != 0) | |
| return 0; | |
| return (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES; | |
| } | |
| size_t pg_q4_k_row_bytes(size_t K) | |
| { | |
| if (K == 0 || K % PG_Q4_K_BLOCK != 0) | |
| return 0; | |
| return (K / PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES; | |
| size_t pg_q8_0_row_bytes(size_t K) | |
| { | |
| if (K == 0 || K % PG_Q8_0_BLOCK != 0) | |
| return 0; | |
| size_t nb = K / PG_Q8_0_BLOCK; | |
| if (nb > SIZE_MAX / PG_Q8_0_BLOCK_BYTES) | |
| return 0; | |
| return nb * PG_Q8_0_BLOCK_BYTES; | |
| } | |
| size_t pg_q4_k_row_bytes(size_t K) | |
| { | |
| if (K == 0 || K % PG_Q4_K_BLOCK != 0) | |
| return 0; | |
| size_t nb = K / PG_Q4_K_BLOCK; | |
| if (nb > SIZE_MAX / PG_Q4_K_BLOCK_BYTES) | |
| return 0; | |
| return nb * PG_Q4_K_BLOCK_BYTES; | |
| } |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/tensor/kernels/qgemv/qgemv_ref.c` around lines 16 - 27, The functions
pg_q8_0_row_bytes and pg_q4_k_row_bytes perform unchecked multiplication
operations that can overflow when K is very large, causing the result to wrap
and produce incorrect stride values used in later validation checks. Add
overflow guards to both functions before performing the multiplication (K /
PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES and (K / PG_Q4_K_BLOCK) *
PG_Q4_K_BLOCK_BYTES respectively. Check if the multiplication would overflow and
return 0 (or an appropriate error value) if it would, ensuring that invalid
strides are caught early and prevent out-of-bounds reads in downstream
dequantization and dot product loops.
What & why
Adds the first quantized decode path for M4:
peregrine runcan consume f32, Q8_0, and compatible Q4_K weights.Q4_K note: the local
stories42MQ4_K_M quantization includes Q5_0/Q6_K fallback tensors because the FFN-down rows are 1376-wide, so that mixed model is intentionally rejected until those tensor formats are implemented. Pure/compatible Q4_K tensors are covered by loader + qgemv checkasm.Benchmarks
Machine: Apple M2 Pro, 12 threads. Model: locally quantized
/Users/zach/dev/worldflowai/models/stories42M-q8_0.gguf.Kernel benches:
Decode comparison, fixed-token tg128:
Profile still shows matvec-dominated time: QKV, FFN gate/up/down, logits, and attention output. A trial wiring the 2-row Q8_0 kernel into fused QKV/gate/down loops was slower and was not kept as default; the 2-row path remains only where it measured as a win for full qgemv shapes.
Verification
All checkasm variants pass locally: 39 passed, 0 failed.
Sign-off
Signed-off-by:(DCO)Summary by CodeRabbit
New Features
Performance
Tests