feat: add q8 quantized decode path by zbennett10 · Pull Request #15 · WorldFlowAI/peregrine

zbennett10 · 2026-06-21T00:18:17Z

What & why

Adds the first quantized decode path for M4:

GGUF Q8_0 and Q4_K tensor directory support with block row sizing.
Q8_0/Q4_K quantized GEMV C references and double-oracle checkasm coverage.
AArch64 FEAT_FP16/NEON Q8_0 fused dequant-dot assembly and a 2-row full-GEMV microkernel.
Typed Llama matrices so peregrine run can consume f32, Q8_0, and compatible Q4_K weights.
f16 KV cache storage with checked fp16 conversion kernels and AArch64 conversion assembly.

Q4_K note: the local stories42M Q4_K_M quantization includes Q5_0/Q6_K fallback tensors because the FFN-down rows are 1376-wide, so that mixed model is intentionally rejected until those tensor formats are implemented. Pure/compatible Q4_K tensors are covered by loader + qgemv checkasm.

Benchmarks

Machine: Apple M2 Pro, 12 threads. Model: locally quantized /Users/zach/dev/worldflowai/models/stories42M-q8_0.gguf.

Kernel benches:

q8_0_gemv 32000x512: neon 140.25 GFLOP/s vs C 17.21 GFLOP/s (8.15x)
q8_0_gemv 4096x4096: neon 133.72 GFLOP/s vs C 12.98 GFLOP/s (10.32x)
fp16_to_f32 4096: neon 120.86 GB/s vs C 3.92 GB/s (30.84x)
f32_to_fp16 4096: neon 148.69 GB/s vs C 2.60 GB/s (57.29x)

Decode comparison, fixed-token tg128:

Peregrine Q8_0: 554.98 tok/s
llama.cpp CPU Q8_0: 969.25 tok/s ± 202.59
Remaining gap: ~1.75x

Profile still shows matvec-dominated time: QKV, FFN gate/up/down, logits, and attention output. A trial wiring the 2-row Q8_0 kernel into fused QKV/gate/down loops was slower and was not kept as default; the 2-row path remains only where it measured as a win for full qgemv shapes.

Verification

git diff --check
make -B -f Makefile.bootstrap checkasm && ./checkasm
meson setup build-m4-quant && meson test -C build-m4-quant --print-errorlogs        # Rosetta/x86 path
/usr/bin/python3 -m mesonbuild.mesonmain setup build-m4-quant-arm64
/usr/bin/python3 -m mesonbuild.mesonmain test -C build-m4-quant-arm64 --print-errorlogs
./peregrine run -m /Users/zach/dev/worldflowai/models/stories42M-q8_0.gguf -p "Once upon a time" -n 16 --temp 0

All checkasm variants pass locally: 39 passed, 0 failed.

Sign-off

Commits are Signed-off-by: (DCO)

Summary by CodeRabbit

New Features
- Added support for Q8_0 and Q4_K quantized tensor formats, enabling more efficient model loading and inference
- Added FP16 (half-precision) tensor support with bidirectional conversion capabilities
Performance
- Implemented optimized ARM NEON kernels for quantized matrix-vector operations and FP16 conversions
Tests
- Added comprehensive test suite for quantized operations and FP16 conversions

Signed-off-by: Zach Bennett <zach@worldflowai.com>

coderabbitai · 2026-06-21T00:18:43Z

📝 Walkthrough

Walkthrough

Adds Q8_0 and Q4_K quantized GEMV support (C reference and AArch64 NEON kernels) and FP16↔F32 array conversion kernels with CPU-dispatch. The Llama runtime is refactored to use typed PgMat (replacing PgMatF32), an fp16 KV cache, and quantized embedding lookup. GGUF model parsing gains Q8_0/Q4_K type handling, and checkasm fuzz+benchmark tests cover both new kernel families.

Changes

Quantized GEMV and FP16 Kernel Integration

Layer / File(s)	Summary
Public types, constants, and kernel headers `include/peregrine/model.h`, `src/tensor/kernels/qgemv/qgemv.h`, `src/tensor/kernels/fp16/fp16.h`, `src/tensor/kernels/gemm/gemm_fp16.h`	`PgTensorType` gains `Q8_0`/`Q4_K` enumerators; `qgemv.h` declares block constants, `PgQgemvDSP`/`PgQgemvVariant` structs, and all GEMV/dot/dequant prototypes; `fp16.h` declares `PgFp16DSP`/`PgFp16ConvertVariant` and array conversion prototypes; `gemm_fp16.h` adds `pg_f32_to_fp16` declaration.
FP16↔F32 scalar and array conversion kernels `src/tensor/kernels/gemm/gemm_fp16.c`, `src/tensor/kernels/fp16/fp16_ref.c`, `src/tensor/kernels/fp16/arm/fp16_convert_neon.S`, `src/tensor/kernels/fp16/fp16_init.c`	`pg_f32_to_fp16` implements IEEE bit-level scalar conversion; portable C array loops cover both directions; AArch64 NEON bulk+tail routines (`fp16_to_f32_array_neon`, `f32_to_fp16_array_neon`) are added; `pg_fp16_dsp_init` selects NEON or C implementations via `cpu_flags`.
QGEMV C reference implementations and parallel driver `src/tensor/kernels/qgemv/qgemv_ref.c`, `src/tensor/kernels/qgemv/qgemv_driver.c`	Implements per-row byte sizing, Q8_0/Q4_K dequantization (including Q4_K nibble/scale unpacking), and C reference dot/GEMV for both formats; `pg_qgemv_driver` parallelizes per-row dot calls via `pg_parallel_for` above a configurable threshold.
Q8_0 NEON assembly kernels and dispatch init `src/tensor/kernels/qgemv/arm/q8_0_dot_neon.S`, `src/tensor/kernels/qgemv/arm/q8_0_gemv_neon.S`, `src/tensor/kernels/qgemv/qgemv_neon.c`, `src/tensor/kernels/qgemv/qgemv_init.c`	`q8_0_dot_f32_neon` accumulates via NEON FMLA with horizontal reduce; `q8_0_gemv_2x_neon` processes two rows simultaneously with dual accumulator sets; `pg_q8_0_gemv_neon` dispatches 2-row blocks with a scalar tail via `pg_parallel_for`; `pg_qgemv_dsp_init` switches to NEON paths when `PG_CPU_NEON
GGUF model file Q8_0/Q4_K support `src/model/model_file.c`	Adds block-size/byte constants; extends `tensor_nbytes` with 2D-block layout validation and byte-count computation; maps raw GGUF type IDs 8 and 12 to new enum values; adds string names to `pg_tensor_type_name`.
Llama runtime: typed matrices, fp16 KV cache, quantized eval paths `src/runtime/llama.c`	Replaces `PgMatF32` with `PgMat` (type+stride) throughout layers/embeddings/output; changes KV cache to `pg_fp16*` with fp32 scratch buffers for attention score/value mixing; adds `llama_load_token_embedding` for quantized embedding dequantization; refactors GEMV pair/triple/add jobs and logits to use typed-mat helpers; initializes `pg_qgemv_dsp_init`/`pg_fp16_dsp_init` on model load.
Build wiring and checkasm test harnesses `meson.build`, `Makefile.bootstrap`, `tests/meson.build`, `tests/checkasm/checkasm.c`, `tests/checkasm/checkasm.h`, `tests/checkasm/checkasm_fp16.c`, `tests/checkasm/checkasm_qgemv.c`	Adds new C and ARM `.S` sources to both build systems; registers `checkasm_check_qgemv`/`checkasm_check_fp16` in the test registry; implements fuzz validation (oracle with tolerance) and adaptive-timed benchmarks (GB/s and GFLOP/s) for both kernel families.

Sequence Diagram(s)

sequenceDiagram
  participant pg_llama_eval_token
  participant llama_load_token_embedding
  participant eval_layer
  participant PgFp16DSP
  participant PgQgemvDSP
  participant KVCache as KV Cache (pg_fp16)

  pg_llama_eval_token->>llama_load_token_embedding: token id
  note over llama_load_token_embedding: dequant Q8_0/Q4_K row or memcpy f32
  llama_load_token_embedding-->>pg_llama_eval_token: f32 hidden state

  pg_llama_eval_token->>eval_layer: f32 hidden state
  eval_layer->>PgFp16DSP: from_f32(k_vec) → KV Cache (pg_fp16)
  eval_layer->>PgFp16DSP: from_f32(v_vec) → KV Cache (pg_fp16)
  eval_layer->>KVCache: read key/value slots
  eval_layer->>PgFp16DSP: to_f32(key_slot) → key_head scratch
  eval_layer->>PgFp16DSP: to_f32(value_slot) → value_head scratch
  eval_layer->>PgQgemvDSP: q8_0_gemv / q4_k_gemv (typed PgMat)
  PgQgemvDSP-->>eval_layer: output activations
  eval_layer-->>pg_llama_eval_token: updated hidden state

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

WorldFlowAI/peregrine#12: Adds PG_CPU_FP16 to the CPU flags infrastructure that this PR's pg_fp16_dsp_init and pg_qgemv_dsp_init dispatch logic depends on.

Poem

🐰 A rabbit in half-precision hops,
With Q8 and Q4 blocks that never stop.
NEON vectors widen and narrow with glee,
fp16 caches hold keys with such care!
The Llama runs quantized — fast and free. ✨

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title 'feat: add q8 quantized decode path' incompletely describes the changeset, which includes support for both Q8_0 and Q4_K quantization formats, plus significant fp16 conversion infrastructure and AArch64 optimizations, not just Q8.	Revise the title to reflect the full scope, such as 'feat: add quantized decode path with Q8_0/Q4_K and fp16 support' or similar to capture the multi-component nature of the changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch m4/quantized-decode

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/tensor/kernels/qgemv/qgemv_ref.c`:
- Around line 16-27: The functions pg_q8_0_row_bytes and pg_q4_k_row_bytes
perform unchecked multiplication operations that can overflow when K is very
large, causing the result to wrap and produce incorrect stride values used in
later validation checks. Add overflow guards to both functions before performing
the multiplication (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES and (K /
PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES respectively. Check if the multiplication
would overflow and return 0 (or an appropriate error value) if it would,
ensuring that invalid strides are caught early and prevent out-of-bounds reads
in downstream dequantization and dot product loops.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5e39f335-e7c2-4f83-843a-f0f62f005176

📥 Commits

Reviewing files that changed from the base of the PR and between 3a98d72 and 4ddf682.

📒 Files selected for processing (23)

Makefile.bootstrap
include/peregrine/model.h
meson.build
src/model/model_file.c
src/runtime/llama.c
src/tensor/kernels/fp16/arm/fp16_convert_neon.S
src/tensor/kernels/fp16/fp16.h
src/tensor/kernels/fp16/fp16_init.c
src/tensor/kernels/fp16/fp16_ref.c
src/tensor/kernels/gemm/gemm_fp16.c
src/tensor/kernels/gemm/gemm_fp16.h
src/tensor/kernels/qgemv/arm/q8_0_dot_neon.S
src/tensor/kernels/qgemv/arm/q8_0_gemv_neon.S
src/tensor/kernels/qgemv/qgemv.h
src/tensor/kernels/qgemv/qgemv_driver.c
src/tensor/kernels/qgemv/qgemv_init.c
src/tensor/kernels/qgemv/qgemv_neon.c
src/tensor/kernels/qgemv/qgemv_ref.c
tests/checkasm/checkasm.c
tests/checkasm/checkasm.h
tests/checkasm/checkasm_fp16.c
tests/checkasm/checkasm_qgemv.c
tests/meson.build

coderabbitai · 2026-06-21T00:24:47Z

+size_t pg_q8_0_row_bytes(size_t K)
+{
+    if (K == 0 || K % PG_Q8_0_BLOCK != 0)
+        return 0;
+    return (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES;
+}
+
+size_t pg_q4_k_row_bytes(size_t K)
+{
+    if (K == 0 || K % PG_Q4_K_BLOCK != 0)
+        return 0;
+    return (K / PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES;


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add overflow guards in row-byte helpers before multiplication.

pg_q8_0_row_bytes / pg_q4_k_row_bytes can wrap on very large K because (K / block) * block_bytes is unchecked. Since runtime size validation consumes these helpers for stride/expected-byte checks, a wrapped stride can validate incorrectly and lead to out-of-bounds reads later in dequant/dot loops.

💡 Suggested fix

size_t pg_q8_0_row_bytes(size_t K) { if (K == 0 || K % PG_Q8_0_BLOCK != 0) return 0; - return (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES; + size_t nb = K / PG_Q8_0_BLOCK; + if (nb > SIZE_MAX / PG_Q8_0_BLOCK_BYTES) + return 0; + return nb * PG_Q8_0_BLOCK_BYTES; } size_t pg_q4_k_row_bytes(size_t K) { if (K == 0 || K % PG_Q4_K_BLOCK != 0) return 0; - return (K / PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES; + size_t nb = K / PG_Q4_K_BLOCK; + if (nb > SIZE_MAX / PG_Q4_K_BLOCK_BYTES) + return 0; + return nb * PG_Q4_K_BLOCK_BYTES; }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

size_t pg_q8_0_row_bytes(size_t K)

{

if (K == 0 || K % PG_Q8_0_BLOCK != 0)

return 0;

return (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES;

}

size_t pg_q4_k_row_bytes(size_t K)

{

if (K == 0 || K % PG_Q4_K_BLOCK != 0)

return 0;

return (K / PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES;

size_t pg_q8_0_row_bytes(size_t K)

{

if (K == 0 || K % PG_Q8_0_BLOCK != 0)

return 0;

size_t nb = K / PG_Q8_0_BLOCK;

if (nb > SIZE_MAX / PG_Q8_0_BLOCK_BYTES)

return 0;

return nb * PG_Q8_0_BLOCK_BYTES;

}

size_t pg_q4_k_row_bytes(size_t K)

{

if (K == 0 || K % PG_Q4_K_BLOCK != 0)

return 0;

size_t nb = K / PG_Q4_K_BLOCK;

if (nb > SIZE_MAX / PG_Q4_K_BLOCK_BYTES)

return 0;

return nb * PG_Q4_K_BLOCK_BYTES;

}

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/tensor/kernels/qgemv/qgemv_ref.c` around lines 16 - 27, The functions pg_q8_0_row_bytes and pg_q4_k_row_bytes perform unchecked multiplication operations that can overflow when K is very large, causing the result to wrap and produce incorrect stride values used in later validation checks. Add overflow guards to both functions before performing the multiplication (K / PG_Q8_0_BLOCK) * PG_Q8_0_BLOCK_BYTES and (K / PG_Q4_K_BLOCK) * PG_Q4_K_BLOCK_BYTES respectively. Check if the multiplication would overflow and return 0 (or an appropriate error value) if it would, ensuring that invalid strides are caught early and prevent out-of-bounds reads in downstream dequantization and dot product loops.

feat: add q8 quantized decode path

4ddf682

Signed-off-by: Zach Bennett <zach@worldflowai.com>

coderabbitai Bot reviewed Jun 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add q8 quantized decode path#15

feat: add q8 quantized decode path#15
zbennett10 wants to merge 1 commit into
mainfrom
m4/quantized-decode

zbennett10 commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot Jun 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

zbennett10 commented Jun 21, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What & why

Benchmarks

Verification

Sign-off

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jun 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jun 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zbennett10 commented Jun 21, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jun 21, 2026 •

edited

Loading