Skip to content

feat: add q8 quantized decode path#15

Open
zbennett10 wants to merge 1 commit into
mainfrom
m4/quantized-decode
Open

feat: add q8 quantized decode path#15
zbennett10 wants to merge 1 commit into
mainfrom
m4/quantized-decode

Conversation

@zbennett10

@zbennett10 zbennett10 commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

What & why

Adds the first quantized decode path for M4:

  • GGUF Q8_0 and Q4_K tensor directory support with block row sizing.
  • Q8_0/Q4_K quantized GEMV C references and double-oracle checkasm coverage.
  • AArch64 FEAT_FP16/NEON Q8_0 fused dequant-dot assembly and a 2-row full-GEMV microkernel.
  • Typed Llama matrices so peregrine run can consume f32, Q8_0, and compatible Q4_K weights.
  • f16 KV cache storage with checked fp16 conversion kernels and AArch64 conversion assembly.

Q4_K note: the local stories42M Q4_K_M quantization includes Q5_0/Q6_K fallback tensors because the FFN-down rows are 1376-wide, so that mixed model is intentionally rejected until those tensor formats are implemented. Pure/compatible Q4_K tensors are covered by loader + qgemv checkasm.

Benchmarks

Machine: Apple M2 Pro, 12 threads. Model: locally quantized /Users/zach/dev/worldflowai/models/stories42M-q8_0.gguf.

Kernel benches:

q8_0_gemv 32000x512: neon 140.25 GFLOP/s vs C 17.21 GFLOP/s (8.15x)
q8_0_gemv 4096x4096: neon 133.72 GFLOP/s vs C 12.98 GFLOP/s (10.32x)
fp16_to_f32 4096: neon 120.86 GB/s vs C 3.92 GB/s (30.84x)
f32_to_fp16 4096: neon 148.69 GB/s vs C 2.60 GB/s (57.29x)

Decode comparison, fixed-token tg128:

Peregrine Q8_0: 554.98 tok/s
llama.cpp CPU Q8_0: 969.25 tok/s ± 202.59
Remaining gap: ~1.75x

Profile still shows matvec-dominated time: QKV, FFN gate/up/down, logits, and attention output. A trial wiring the 2-row Q8_0 kernel into fused QKV/gate/down loops was slower and was not kept as default; the 2-row path remains only where it measured as a win for full qgemv shapes.

Verification

git diff --check
make -B -f Makefile.bootstrap checkasm && ./checkasm
meson setup build-m4-quant && meson test -C build-m4-quant --print-errorlogs        # Rosetta/x86 path
/usr/bin/python3 -m mesonbuild.mesonmain setup build-m4-quant-arm64
/usr/bin/python3 -m mesonbuild.mesonmain test -C build-m4-quant-arm64 --print-errorlogs
./peregrine run -m /Users/zach/dev/worldflowai/models/stories42M-q8_0.gguf -p "Once upon a time" -n 16 --temp 0

All checkasm variants pass locally: 39 passed, 0 failed.

Sign-off

  • Commits are Signed-off-by: (DCO)

Summary by CodeRabbit

  • New Features

    • Added support for Q8_0 and Q4_K quantized tensor formats, enabling more efficient model loading and inference
    • Added FP16 (half-precision) tensor support with bidirectional conversion capabilities
  • Performance

    • Implemented optimized ARM NEON kernels for quantized matrix-vector operations and FP16 conversions
  • Tests

    • Added comprehensive test suite for quantized operations and FP16 conversions

Signed-off-by: Zach Bennett <zach@worldflowai.com>
@coderabbitai

coderabbitai Bot commented Jun 21, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds Q8_0 and Q4_K quantized GEMV support (C reference and AArch64 NEON kernels) and FP16↔F32 array conversion kernels with CPU-dispatch. The Llama runtime is refactored to use typed PgMat (replacing PgMatF32), an fp16 KV cache, and quantized embedding lookup. GGUF model parsing gains Q8_0/Q4_K type handling, and checkasm fuzz+benchmark tests cover both new kernel families.

Changes

Quantized GEMV and FP16 Kernel Integration

Layer / File(s) Summary
Public types, constants, and kernel headers
include/peregrine/model.h, src/tensor/kernels/qgemv/qgemv.h, src/tensor/kernels/fp16/fp16.h, src/tensor/kernels/gemm/gemm_fp16.h
PgTensorType gains Q8_0/Q4_K enumerators; qgemv.h declares block constants, PgQgemvDSP/PgQgemvVariant structs, and all GEMV/dot/dequant prototypes; fp16.h declares PgFp16DSP/PgFp16ConvertVariant and array conversion prototypes; gemm_fp16.h adds pg_f32_to_fp16 declaration.
FP16↔F32 scalar and array conversion kernels
src/tensor/kernels/gemm/gemm_fp16.c, src/tensor/kernels/fp16/fp16_ref.c, src/tensor/kernels/fp16/arm/fp16_convert_neon.S, src/tensor/kernels/fp16/fp16_init.c
pg_f32_to_fp16 implements IEEE bit-level scalar conversion; portable C array loops cover both directions; AArch64 NEON bulk+tail routines (fp16_to_f32_array_neon, f32_to_fp16_array_neon) are added; pg_fp16_dsp_init selects NEON or C implementations via cpu_flags.
QGEMV C reference implementations and parallel driver
src/tensor/kernels/qgemv/qgemv_ref.c, src/tensor/kernels/qgemv/qgemv_driver.c
Implements per-row byte sizing, Q8_0/Q4_K dequantization (including Q4_K nibble/scale unpacking), and C reference dot/GEMV for both formats; pg_qgemv_driver parallelizes per-row dot calls via pg_parallel_for above a configurable threshold.
Q8_0 NEON assembly kernels and dispatch init
src/tensor/kernels/qgemv/arm/q8_0_dot_neon.S, src/tensor/kernels/qgemv/arm/q8_0_gemv_neon.S, src/tensor/kernels/qgemv/qgemv_neon.c, src/tensor/kernels/qgemv/qgemv_init.c
q8_0_dot_f32_neon accumulates via NEON FMLA with horizontal reduce; q8_0_gemv_2x_neon processes two rows simultaneously with dual accumulator sets; pg_q8_0_gemv_neon dispatches 2-row blocks with a scalar tail via pg_parallel_for; pg_qgemv_dsp_init switches to NEON paths when `PG_CPU_NEON
GGUF model file Q8_0/Q4_K support
src/model/model_file.c
Adds block-size/byte constants; extends tensor_nbytes with 2D-block layout validation and byte-count computation; maps raw GGUF type IDs 8 and 12 to new enum values; adds string names to pg_tensor_type_name.
Llama runtime: typed matrices, fp16 KV cache, quantized eval paths
src/runtime/llama.c
Replaces PgMatF32 with PgMat (type+stride) throughout layers/embeddings/output; changes KV cache to pg_fp16* with fp32 scratch buffers for attention score/value mixing; adds llama_load_token_embedding for quantized embedding dequantization; refactors GEMV pair/triple/add jobs and logits to use typed-mat helpers; initializes pg_qgemv_dsp_init/pg_fp16_dsp_init on model load.
Build wiring and checkasm test harnesses
meson.build, Makefile.bootstrap, tests/meson.build, tests/checkasm/checkasm.c, tests/checkasm/checkasm.h, tests/checkasm/checkasm_fp16.c, tests/checkasm/checkasm_qgemv.c
Adds new C and ARM .S sources to both build systems; registers checkasm_check_qgemv/checkasm_check_fp16 in the test registry; implements fuzz validation (oracle with tolerance) and adaptive-timed benchmarks (GB/s and GFLOP/s) for both kernel families.

Sequence Diagram(s)

sequenceDiagram
  participant pg_llama_eval_token
  participant llama_load_token_embedding
  participant eval_layer
  participant PgFp16DSP
  participant PgQgemvDSP
  participant KVCache as KV Cache (pg_fp16)

  pg_llama_eval_token->>llama_load_token_embedding: token id
  note over llama_load_token_embedding: dequant Q8_0/Q4_K row or memcpy f32
  llama_load_token_embedding-->>pg_llama_eval_token: f32 hidden state

  pg_llama_eval_token->>eval_layer: f32 hidden state
  eval_layer->>PgFp16DSP: from_f32(k_vec) → KV Cache (pg_fp16)
  eval_layer->>PgFp16DSP: from_f32(v_vec) → KV Cache (pg_fp16)
  eval_layer->>KVCache: read key/value slots
  eval_layer->>PgFp16DSP: to_f32(key_slot) → key_head scratch
  eval_layer->>PgFp16DSP: to_f32(value_slot) → value_head scratch
  eval_layer->>PgQgemvDSP: q8_0_gemv / q4_k_gemv (typed PgMat)
  PgQgemvDSP-->>eval_layer: output activations
  eval_layer-->>pg_llama_eval_token: updated hidden state
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

  • WorldFlowAI/peregrine#12: Adds PG_CPU_FP16 to the CPU flags infrastructure that this PR's pg_fp16_dsp_init and pg_qgemv_dsp_init dispatch logic depends on.

Poem

🐰 A rabbit in half-precision hops,
With Q8 and Q4 blocks that never stop.
NEON vectors widen and narrow with glee,
fp16 caches hold keys with such care!
The Llama runs quantized — fast and free. ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title 'feat: add q8 quantized decode path' incompletely describes the changeset, which includes support for both Q8_0 and Q4_K quantization formats, plus significant fp16 conversion infrastructure and AArch64 optimizations, not just Q8. Revise the title to reflect the full scope, such as 'feat: add quantized decode path with Q8_0/Q4_K and fp16 support' or similar to capture the multi-component nature of the changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch m4/quantized-decode

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/tensor/kernels/qgemv/qgemv_ref.c`:
- Around line 16-27: The functions pg_q8_0_row_bytes and pg_q4_k_row_bytes
perform unchecked multiplication operations that can overflow when K is very
large, causing the result to wrap and produce incorrect stride values used in
later validation checks. Add overflow guards to both functions before performing
the multiplication (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES and (K /
PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES respectively. Check if the multiplication
would overflow and return 0 (or an appropriate error value) if it would,
ensuring that invalid strides are caught early and prevent out-of-bounds reads
in downstream dequantization and dot product loops.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5e39f335-e7c2-4f83-843a-f0f62f005176

📥 Commits

Reviewing files that changed from the base of the PR and between 3a98d72 and 4ddf682.

📒 Files selected for processing (23)
  • Makefile.bootstrap
  • include/peregrine/model.h
  • meson.build
  • src/model/model_file.c
  • src/runtime/llama.c
  • src/tensor/kernels/fp16/arm/fp16_convert_neon.S
  • src/tensor/kernels/fp16/fp16.h
  • src/tensor/kernels/fp16/fp16_init.c
  • src/tensor/kernels/fp16/fp16_ref.c
  • src/tensor/kernels/gemm/gemm_fp16.c
  • src/tensor/kernels/gemm/gemm_fp16.h
  • src/tensor/kernels/qgemv/arm/q8_0_dot_neon.S
  • src/tensor/kernels/qgemv/arm/q8_0_gemv_neon.S
  • src/tensor/kernels/qgemv/qgemv.h
  • src/tensor/kernels/qgemv/qgemv_driver.c
  • src/tensor/kernels/qgemv/qgemv_init.c
  • src/tensor/kernels/qgemv/qgemv_neon.c
  • src/tensor/kernels/qgemv/qgemv_ref.c
  • tests/checkasm/checkasm.c
  • tests/checkasm/checkasm.h
  • tests/checkasm/checkasm_fp16.c
  • tests/checkasm/checkasm_qgemv.c
  • tests/meson.build

Comment on lines +16 to +27
size_t pg_q8_0_row_bytes(size_t K)
{
if (K == 0 || K % PG_Q8_0_BLOCK != 0)
return 0;
return (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES;
}

size_t pg_q4_k_row_bytes(size_t K)
{
if (K == 0 || K % PG_Q4_K_BLOCK != 0)
return 0;
return (K / PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add overflow guards in row-byte helpers before multiplication.

pg_q8_0_row_bytes / pg_q4_k_row_bytes can wrap on very large K because (K / block) * block_bytes is unchecked. Since runtime size validation consumes these helpers for stride/expected-byte checks, a wrapped stride can validate incorrectly and lead to out-of-bounds reads later in dequant/dot loops.

💡 Suggested fix
 size_t pg_q8_0_row_bytes(size_t K)
 {
     if (K == 0 || K % PG_Q8_0_BLOCK != 0)
         return 0;
-    return (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES;
+    size_t nb = K / PG_Q8_0_BLOCK;
+    if (nb > SIZE_MAX / PG_Q8_0_BLOCK_BYTES)
+        return 0;
+    return nb * PG_Q8_0_BLOCK_BYTES;
 }
 
 size_t pg_q4_k_row_bytes(size_t K)
 {
     if (K == 0 || K % PG_Q4_K_BLOCK != 0)
         return 0;
-    return (K / PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES;
+    size_t nb = K / PG_Q4_K_BLOCK;
+    if (nb > SIZE_MAX / PG_Q4_K_BLOCK_BYTES)
+        return 0;
+    return nb * PG_Q4_K_BLOCK_BYTES;
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
size_t pg_q8_0_row_bytes(size_t K)
{
if (K == 0 || K % PG_Q8_0_BLOCK != 0)
return 0;
return (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES;
}
size_t pg_q4_k_row_bytes(size_t K)
{
if (K == 0 || K % PG_Q4_K_BLOCK != 0)
return 0;
return (K / PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES;
size_t pg_q8_0_row_bytes(size_t K)
{
if (K == 0 || K % PG_Q8_0_BLOCK != 0)
return 0;
size_t nb = K / PG_Q8_0_BLOCK;
if (nb > SIZE_MAX / PG_Q8_0_BLOCK_BYTES)
return 0;
return nb * PG_Q8_0_BLOCK_BYTES;
}
size_t pg_q4_k_row_bytes(size_t K)
{
if (K == 0 || K % PG_Q4_K_BLOCK != 0)
return 0;
size_t nb = K / PG_Q4_K_BLOCK;
if (nb > SIZE_MAX / PG_Q4_K_BLOCK_BYTES)
return 0;
return nb * PG_Q4_K_BLOCK_BYTES;
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/tensor/kernels/qgemv/qgemv_ref.c` around lines 16 - 27, The functions
pg_q8_0_row_bytes and pg_q4_k_row_bytes perform unchecked multiplication
operations that can overflow when K is very large, causing the result to wrap
and produce incorrect stride values used in later validation checks. Add
overflow guards to both functions before performing the multiplication (K /
PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES and (K / PG_Q4_K_BLOCK) *
PG_Q4_K_BLOCK_BYTES respectively. Check if the multiplication would overflow and
return 0 (or an appropriate error value) if it would, ensuring that invalid
strides are caught early and prevent out-of-bounds reads in downstream
dequantization and dot product loops.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant