Skip to content

feat: add M3 model loading and llama runner#14

Merged
zbennett10 merged 8 commits into
mainfrom
m3/model-loaders
Jun 17, 2026
Merged

feat: add M3 model loading and llama runner#14
zbennett10 merged 8 commits into
mainfrom
m3/model-loaders

Conversation

@zbennett10

@zbennett10 zbennett10 commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

What & why

Builds the first runnable M3 slice: peregrine run -m model.gguf -p "..." now loads an open GGUF model, tokenizes a prompt, executes a f32 Llama-class decoder, and greedily emits tokens.

  • Adds zero-copy GGUF/safetensors model-file loading and typed GGUF metadata retention.
  • Adds a GGUF Llama tokenizer path with token scores and byte fallback.
  • Adds include/peregrine/llama.h plus an f32 GGUF Llama decoder with contiguous KV cache.
  • Wires the CLI run command for greedy generation.
  • Keeps decode hot math on DSP dispatch: GEMV/dot/RMSNorm/softmax/SiLU/mul/add/axpy/RoPE.
  • Adds x86 AVX2 RoPE assembly behind the existing C reference/checkasm entry, fixes RoPE DSP selection, and makes CI require rope_f32.avx2 under forced AVX2.
  • Adds a synthetic GGUF Llama fixture test that exercises model load, one-token eval, and greedy sampling without vendoring a model file.

Current limits are explicit: f32 GGUF Llama only, greedy sampling only, approximate tokenizer parity, and unfused attention. Quantized/f16 decode, top-k/top-p/temp, and llama.cpp parity checks remain follow-ups.

Verification

arch -arm64 /usr/bin/python3 -m mesonbuild.mesonmain test -C build-arm64-m3 --print-errorlogs
# 4/4 passed: model_file, tokenizer, llama, checkasm

meson test -C build-m3 --print-errorlogs
# 4/4 passed on local x86/Rosetta build; AVX2 is not exposed locally

make -f Makefile.bootstrap clean test
# 34 passed, 0 failed

nasm -f elf64 -I src/ src/tensor/kernels/rope/x86/rope_avx2.asm -o /tmp/peregrine_rope_avx2_elf64.o
nasm -f macho64 -I src/ src/tensor/kernels/rope/x86/rope_avx2.asm -o /tmp/peregrine_rope_avx2_macho64.o

git diff --cached --check

git diff --check

./build-arm64-m3/tests/checkasm --test=rope --bench --seed=0x1234abcd
# rope_f32.neon and rope_f32.c passed; NEON is ~2x faster on the actual decode shape in a scratch microbench

./build-arm64-m3/peregrine run -m /Users/zach/dev/worldflowai/models/stories15M.gguf -p "Once upon a time" -n 16
# Once upon a timeese. Oar. He was the bears, and he was the one

Forced x86 AVX2 correctness for the new RoPE assembly is covered by the updated GitHub Actions QEMU/SDE expected-variant lists.

Sign-off

  • Commits have Signed-off-by: (DCO)

Summary by CodeRabbit

  • New Features
    • Added peregrine CLI subcommands: inspect, tokenize, logits, and a full run loop for greedy generation.
    • Added zero-copy model loading for GGUF and safetensors with model/tensor metadata support and typed metadata value helpers.
    • Introduced a public tokenizer API and expanded the C Llama runtime with sampling (temperature/top-k/top-p/seed).
  • Tests
    • Added C tests covering model parsing (GGUF/safetensors, including malformed safetensors) and tokenizer end-to-end behavior.
    • Added Llama smoke tests and an optional llama.cpp parity comparison tool/script.
  • CI
    • Extended RoPE checkasm expectations for an additional AVX2 variant.

Add mmap-backed GGUF and safetensors tensor directory loading, expose tensor views through the public C API, and add a CLI inspect command for the new loader seam.

Signed-off-by: Zach Bennett <zach@worldflowai.com>
@coderabbitai

coderabbitai Bot commented Jun 17, 2026

Copy link
Copy Markdown

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds a complete Llama inference stack: POSIX mmap file utility (PgFileMap), dual-format model parser (GGUF + safetensors) with metadata support, LLaMA tokenizer with SPM-style merge-based encoding/decoding, full Llama-class f32 inference runtime with attention and FFN layers, temperature-based sampling with top-k/top-p nucleus filtering, AVX2 RoPE rotation kernel with runtime dispatch, CLI inspect, tokenize, logits, and run subcommands for model exploration and greedy/sampled text generation, optional llama.cpp parity comparison tooling, comprehensive integration tests, and complete build system wiring into Meson and Makefile.bootstrap.

Changes

Llama inference stack: model loading, tokenization, runtime, and CLI

Layer / File(s) Summary
POSIX file mapping utility
src/util/filemap.h, src/util/filemap.c
Introduces PgFileMap struct and pg_file_map_open/pg_file_map_close with open/fstat/mmap validation, PROT_READ|MAP_PRIVATE mapping, comprehensive error reporting, and safe cleanup.
Public model and metadata API contracts
include/peregrine/model.h, include/peregrine/peregrine.h
Defines PgModelFormat/PgTensorType/PgMetadataType enums; defines PgStringView/PgTensorView/PgMetadataValue/PgMetadataEntry structs; declares opaque PgModelFile handle with query/conversion function signatures; umbrella header adds model.h, tokenizer.h, llama.h includes.
Model file parser: GGUF and safetensors infrastructure
src/model/model_file.c (lines 1–1144)
Implements bounded Reader abstraction, overflow-checked arithmetic, GGUF signature/version/count/alignment/metadata parsing with type mapping and tensor bounds validation; implements bounded JSON parsing, safetensors dtype/shape/offset reading with interval coverage validation enforcing exact contiguous data coverage with no holes/overlaps.
Model file public API and integration tests
src/model/model_file.c (lines 1146–1383), tests/test_model_file.c
Implements pg_model_file_open with mmap and format dispatch, pg_model_file_free, format/metadata/tensor count/lookup accessors, type/format/size name mappers, and typed metadata conversions (u64/i64/f64/string/array-string). Tests create temporary GGUF/safetensors files, validate format/count/tensor properties/float values, and assert parse failure for bounds and gap errors.
Tokenizer: API, LLaMA implementation, and tests
include/peregrine/tokenizer.h, src/token/tokenizer.c, tests/test_tokenizer.c
Introduces PgTokenizer opaque type and PgTokenBuffer struct with encoder/decoder/ID accessor declarations. Implements GGUF tokenizer construction (reading metadata, building hash table, byte-id mapping), LLaMA text normalization (space → SPM "▁"), SPM-style merge-based tokenization with heap-driven bigram scoring and byte-level fallback, token decode with reverse mapping, and accessor functions. Tests write GGUF tokenizer fixture, validate vocabulary/IDs, and verify round-trip encode/decode behavior.
Llama inference runtime: model, context, and evaluation
include/peregrine/llama.h, src/runtime/llama.c
Introduces PgLlamaModel/PgLlamaContext opaque types with API for load/free, model properties, context creation/free, eval_token, logits, position, and greedy sample. Implements f32 GGUF loader with architecture/dimension validation, DSP backend initialization, overflow-safe KV-cache/scratch allocation. Implements per-layer evaluation: RMSNorm, Q/K/V GEMV, RoPE, KV cache, attention, FFN, residual add/norm. Implements pg_llama_eval_token validating inputs, loading embedding, caching RoPE, evaluating layers, computing logits, incrementing position.
Non-greedy sampling: API and implementation
include/peregrine/llama.h, src/runtime/llama.c (lines 746–934)
Adds PgSampler opaque type and PgSamplerParams struct (temperature, top_k, top_p, seed) with sampler constructor/destructor/sampling function declarations. Implements temperature-scaled log-probability sampling with top-k and top-p nucleus filtering, PRNG seeding, finite-logit candidate filtering, probability mass validation, and error handling.
RoPE AVX2 kernel and runtime dispatch
src/tensor/kernels/rope/rope_init.c, src/tensor/kernels/rope/x86/rope_avx2.asm
Adds avx2 variant to x86_64 rope table with PG_CPU_AVX2 requirement; implements runtime dispatch in pg_rope_dsp_init selecting first variant where flags satisfy requirement. Implements complete AVX2 rope_f32 kernel: scalar fallback for interleaved pairs, NeoX vectorized path with 32-byte vmovups/vmulps/vaddps/vsubps and scalar tail loop, per-head cos/sin advancement.
CLI: inspect, tokenize, logits, and run commands
src/cli/peregrine.c
Adds cmd_inspect parsing arguments, opening model, printing format/tensor count/per-tensor properties, and cleanup. Implements numeric argument parsing helpers and safe token output printing. Adds cmd_tokenize parsing -m/-p plus BOS/EOS toggles and --ids-only mode, encoding/decoding token text. Implements cmd_logits extracting and displaying top-N logits per position. Implements cmd_run parsing -m/-p/-n and sampler options, loading model/tokenizer/context, evaluating/sampling tokens, streaming output. Updates usage() and main() dispatch.
Optional parity testing: Peregrine vs llama.cpp
tests/test_llama_cpp_parity.py, tools/llama_cpp_parity.py
Adds Python test runner orchestrating optional parity comparison (skipping when prerequisites absent). Implements comprehensive parity tool with token comparison (exact equality), logits comparison (top-N ID matching + tolerance-based value validation), tokenizer/logits extraction from both Peregrine and llama.cpp, and mode flags for selective testing.
Llama integration test: GGUF generation and model evaluation
tests/test_llama.c
Implements C test creating tiny deterministic GGUF on disk with fixed tensors, evaluating model, sampling greedily, validating output invariants. Includes test_sampler() validating non-greedy sampling across temperature/top-k/top-p configs and asserting invalid params cause construction failure. Returns exit status based on check failures.
Build system wiring and test registration
Makefile.bootstrap, meson.build, tests/meson.build, .github/workflows/ci.yml, tests/checkasm/checkasm_rope.c
Updates both build systems to compile src/util/filemap.c and src/model/model_file.c; adds x86_64 ASM_OBJ entry for rope_avx2.o; registers test_model_file, test_tokenizer, test_llama as test targets; conditionally registers llama_cpp_parity Python test with 120s timeout; updates CI to expect rope_f32.avx2 checkasm variant in QEMU and Intel SDE runs.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~75 minutes

Possibly related PRs

  • WorldFlowAI/peregrine#3: Introduces the RoPE foundation and checkasm infrastructure that this PR extends with AVX2 kernel implementation and rope_init.c dispatch logic.

Poem

🐇 A bunny loads GGUF, parses with care,
Safetensors mapped into memory's layer,
LLaMA tokens dance through transformer's song,
RoPE vectors spin where AVX2 belongs,
run, inspect, tokenize—let inference shine bright!

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'feat: add M3 model loading and llama runner' accurately describes the main additions in the PR: M3 phase implementation with GGUF model loading and an end-to-end Llama inference runner with the peregrine run command.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch m3/model-loaders

Comment @coderabbitai help to get the list of available commands and usage tips.

Parse GGUF metadata into model-file views, expose scalar/string accessors, and cover tokenizer string arrays in the model loader tests.

Signed-off-by: Zach Bennett <zach@worldflowai.com>
Add a GGUF llama tokenizer backed by retained metadata, token buffer APIs, encode/decode helpers, and synthetic tokenizer fixtures.

Signed-off-by: Zach Bennett <zach@worldflowai.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/model/model_file.c (1)

681-695: ⚠️ Potential issue | 🔴 Critical

Do not reject valid JSON strings with escapes.

Safetensors headers are JSON, but json_string() fails on any \. This rejects valid files with escaped tensor names/keys or escaped string values inside __metadata__. The code needs escape-aware string consumption and decoding for strings whose logical value is used as a tensor name/key.

The error message "escaped/control JSON strings are not supported yet" suggests this is a known limitation, but there is no documentation explicitly restricting safetensors headers to unescaped JSON only. Any safetensors file with escaped characters (e.g., "format":"p\"t") in the header will fail to load.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/model/model_file.c` around lines 681 - 695, The `json_string()` function
currently rejects any backslash character in JSON strings, which prevents
parsing valid escaped characters like `\"` or `\\`. Instead of treating
backslashes as an error, modify the while loop logic to properly handle escape
sequences by detecting when a backslash is encountered and then skipping over
both the backslash and the following escaped character to continue parsing. This
will allow valid JSON escape sequences in safetensors headers (tensor names,
keys, and metadata values) to be properly consumed and decoded.
🧹 Nitpick comments (2)
include/peregrine/model.h (2)

47-55: ⚡ Quick win

Add a bool conversion helper for the public bool metadata type.

The API exposes PG_METADATA_TYPE_BOOL, but callers must inspect value->as.boolean directly because there is no pg_metadata_as_bool() helper alongside the other scalar conversions.

API shape suggestion
 int pg_metadata_as_u64(const PgMetadataValue *value, uint64_t *out);
 int pg_metadata_as_i64(const PgMetadataValue *value, int64_t *out);
 int pg_metadata_as_f64(const PgMetadataValue *value, double *out);
+int pg_metadata_as_bool(const PgMetadataValue *value, int *out);
 int pg_metadata_as_string(const PgMetadataValue *value, PgStringView *out);
 int pg_metadata_array_string(const PgMetadataValue *value, size_t idx,
                              PgStringView *out);

Also applies to: 110-115

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/peregrine/model.h` around lines 47 - 55, Add a
`pg_metadata_as_bool()` helper function to match the pattern of other scalar
metadata conversion helpers like those for u64, i64, and f64 types. This
function should accept a metadata value parameter and return the boolean value
from the metadata structure, allowing callers to safely convert
PG_METADATA_TYPE_BOOL values without directly accessing the internal union
field.

19-22: ⚡ Quick win

Document the PgStringView lifetime and terminator contract.

PgStringView is now used for metadata keys/values, but only tensor names are documented as non-NUL-terminated. Add the non-owning/lifetime contract so API users do not treat data as a C string or keep it after pg_model_file_free.

Suggested header wording
+/* Non-owning byte view. `data` is not NUL-terminated and remains valid only
+ * while the `PgModelFile` that returned it remains alive. */
 typedef struct PgStringView {
     const char *data;
     size_t len;
 } PgStringView;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/peregrine/model.h` around lines 19 - 22, The PgStringView struct
typedef lacks documentation about its lifetime and NUL-termination contract,
which can lead to API misuse. Add a documentation comment above the PgStringView
typedef that clearly documents: it is a non-owning view into string data, the
data is not guaranteed to be NUL-terminated, the pointer validity is bounded by
the owning structure's lifetime, users should not treat it as a C string, and
the pointer becomes invalid after pg_model_file_free is called.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/model/model_file.c`:
- Around line 356-458: The code performs calloc() allocations for metadata
arrays without first validating that the multiplication of count and the element
size would not overflow. Before each calloc() call in the switch statement for
cases PG_METADATA_TYPE_U64, PG_METADATA_TYPE_I64, PG_METADATA_TYPE_F64,
PG_METADATA_TYPE_BOOL, and PG_METADATA_TYPE_STRING, add an overflow check that
verifies count does not exceed SIZE_MAX divided by the size of each array type's
element (e.g., count > SIZE_MAX / sizeof(uint64_t) for the U64 case). If the
check fails, call reader_err() with an appropriate message and return false
before attempting the allocation.

---

Outside diff comments:
In `@src/model/model_file.c`:
- Around line 681-695: The `json_string()` function currently rejects any
backslash character in JSON strings, which prevents parsing valid escaped
characters like `\"` or `\\`. Instead of treating backslashes as an error,
modify the while loop logic to properly handle escape sequences by detecting
when a backslash is encountered and then skipping over both the backslash and
the following escaped character to continue parsing. This will allow valid JSON
escape sequences in safetensors headers (tensor names, keys, and metadata
values) to be properly consumed and decoded.

---

Nitpick comments:
In `@include/peregrine/model.h`:
- Around line 47-55: Add a `pg_metadata_as_bool()` helper function to match the
pattern of other scalar metadata conversion helpers like those for u64, i64, and
f64 types. This function should accept a metadata value parameter and return the
boolean value from the metadata structure, allowing callers to safely convert
PG_METADATA_TYPE_BOOL values without directly accessing the internal union
field.
- Around line 19-22: The PgStringView struct typedef lacks documentation about
its lifetime and NUL-termination contract, which can lead to API misuse. Add a
documentation comment above the PgStringView typedef that clearly documents: it
is a non-owning view into string data, the data is not guaranteed to be
NUL-terminated, the pointer validity is bounded by the owning structure's
lifetime, users should not treat it as a C string, and the pointer becomes
invalid after pg_model_file_free is called.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 97ac3c8b-d825-4b12-9a73-0d3ecf9d861e

📥 Commits

Reviewing files that changed from the base of the PR and between cf9f346 and 0df910a.

📒 Files selected for processing (3)
  • include/peregrine/model.h
  • src/model/model_file.c
  • tests/test_model_file.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/test_model_file.c

Comment thread src/model/model_file.c
Comment on lines +356 to +458
if (!rd_u32(r, &elem_raw_type) || !rd_u64(r, &len))
return false;
elem_type = gguf_metadata_type(elem_raw_type);
if (elem_type == PG_METADATA_TYPE_UNKNOWN || elem_type == PG_METADATA_TYPE_ARRAY) {
reader_err(r, "unsupported GGUF metadata array type %u", elem_raw_type);
return false;
}
if (len > SIZE_MAX) {
reader_err(r, "metadata array too large");
return false;
}

count = (size_t)len;
value->type = PG_METADATA_TYPE_ARRAY;
value->elem_type = elem_type;
value->count = count;

switch (elem_type) {
case PG_METADATA_TYPE_U64: {
uint64_t *arr = calloc(count ? count : 1, sizeof(*arr));
if (!arr) {
reader_err(r, "out of memory");
return false;
}
for (i = 0; i < count; i++) {
PgMetadataValue elem;
if (!gguf_read_scalar(r, elem_raw_type, &elem)) {
free(arr);
return false;
}
arr[i] = elem.as.u64;
}
value->as.u64_array = arr;
return true;
}
case PG_METADATA_TYPE_I64: {
int64_t *arr = calloc(count ? count : 1, sizeof(*arr));
if (!arr) {
reader_err(r, "out of memory");
return false;
}
for (i = 0; i < count; i++) {
PgMetadataValue elem;
if (!gguf_read_scalar(r, elem_raw_type, &elem)) {
free(arr);
return false;
}
arr[i] = elem.as.i64;
}
value->as.i64_array = arr;
return true;
}
case PG_METADATA_TYPE_F64: {
double *arr = calloc(count ? count : 1, sizeof(*arr));
if (!arr) {
reader_err(r, "out of memory");
return false;
}
for (i = 0; i < count; i++) {
PgMetadataValue elem;
if (!gguf_read_scalar(r, elem_raw_type, &elem)) {
free(arr);
return false;
}
arr[i] = elem.as.f64;
}
value->as.f64_array = arr;
return true;
}
case PG_METADATA_TYPE_BOOL: {
int *arr = calloc(count ? count : 1, sizeof(*arr));
if (!arr) {
reader_err(r, "out of memory");
return false;
}
for (i = 0; i < count; i++) {
PgMetadataValue elem;
if (!gguf_read_scalar(r, elem_raw_type, &elem)) {
free(arr);
return false;
}
arr[i] = elem.as.boolean;
}
value->as.bool_array = arr;
return true;
}
case PG_METADATA_TYPE_STRING: {
PgStringView *arr = calloc(count ? count : 1, sizeof(*arr));
if (!arr) {
reader_err(r, "out of memory");
return false;
}
for (i = 0; i < count; i++) {
PgMetadataValue elem;
if (!gguf_read_scalar(r, elem_raw_type, &elem)) {
free(arr);
return false;
}
arr[i] = elem.as.string;
}
value->as.string_array = arr;
return true;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Preflight GGUF metadata array sizes before allocating.

len is only narrowed to size_t before the per-type calloc(). A malformed GGUF can declare a huge array and force a large allocation attempt before the reader discovers the file does not contain that many elements. Check the minimum encoded payload size against r->end - r->p, and check count <= SIZE_MAX / sizeof(*arr) for each allocated array type before calling calloc().

Guard the declared array length before allocation
+static size_t gguf_scalar_min_size(uint32_t raw_type)
+{
+    switch (raw_type) {
+    case 0:
+    case 1:
+    case 7:
+        return 1;
+    case 2:
+    case 3:
+        return 2;
+    case 4:
+    case 5:
+    case 6:
+        return 4;
+    case 8:
+        return 8; /* string length prefix; payload is checked by rd_string */
+    case 10:
+    case 11:
+    case 12:
+        return 8;
+    default:
+        return 0;
+    }
+}
+
     count = (size_t)len;
+    {
+        size_t min_size = gguf_scalar_min_size(elem_raw_type);
+        size_t min_bytes;
+
+        if (min_size == 0 ||
+            !checked_mul_size(count, min_size, &min_bytes) ||
+            min_bytes > (size_t)(r->end - r->p)) {
+            reader_err(r, "GGUF metadata array extends past end of file");
+            return false;
+        }
+    }
     value->type = PG_METADATA_TYPE_ARRAY;

Apply the same explicit count > SIZE_MAX / sizeof(*arr) guard before each per-type calloc().

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/model/model_file.c` around lines 356 - 458, The code performs calloc()
allocations for metadata arrays without first validating that the multiplication
of count and the element size would not overflow. Before each calloc() call in
the switch statement for cases PG_METADATA_TYPE_U64, PG_METADATA_TYPE_I64,
PG_METADATA_TYPE_F64, PG_METADATA_TYPE_BOOL, and PG_METADATA_TYPE_STRING, add an
overflow check that verifies count does not exceed SIZE_MAX divided by the size
of each array type's element (e.g., count > SIZE_MAX / sizeof(uint64_t) for the
U64 case). If the check fails, call reader_err() with an appropriate message and
return false before attempting the allocation.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/token/tokenizer.c`:
- Around line 259-262: The tokenizer vocabulary is borrowing token text memory
directly from the model file without copying it, creating a dangling pointer
risk if the model is freed before the tokenizer. In the loop where
tok->vocab[i].text is assigned from tokens_entry->value.as.string_array[i], you
need to allocate and copy the string data instead of storing the borrowed
pointer. Ensure proper memory allocation for each token text and use string copy
functions to make the tokenizer own the memory. Apply the same fix to the second
occurrence mentioned in lines 272-279 where similar borrowing occurs.
- Around line 67-68: The buffer capacity doubling arithmetic at line 67
(buf->capacity * 2) and the size multiplication at line 68 (cap * sizeof(*next))
lack overflow protection, which can cause integer wraparound on large inputs and
result in under-allocation followed by out-of-bounds writes. Add overflow checks
before both arithmetic operations to ensure buf->capacity * 2 and cap *
sizeof(*next) do not overflow; if overflow is detected, handle the error
appropriately (such as returning an error or terminating gracefully). Apply the
same fix to the other buffer growth locations mentioned at lines 299-301,
344-346, and 398.
- Around line 238-240: The three metadata_i32 calls for
"tokenizer.ggml.bos_token_id", "tokenizer.ggml.eos_token_id", and
"tokenizer.ggml.unknown_token_id" are casting their return values to void, which
silently ignores any parsing or validation failures. Instead of discarding these
return values, capture the return status from each metadata_i32 call for
tok->bos_id, tok->eos_id, and tok->unk_id, then validate that when these special
token metadata keys are present in the file, they contain valid values (not
malformed or out-of-range). If validation fails, log an appropriate error and
fail the tokenizer initialization rather than falling back to defaults with
potentially corrupt data.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 1e828d22-563d-4b03-af90-62f476e8aa16

📥 Commits

Reviewing files that changed from the base of the PR and between 0df910a and 2e6c7bb.

📒 Files selected for processing (7)
  • Makefile.bootstrap
  • include/peregrine/peregrine.h
  • include/peregrine/tokenizer.h
  • meson.build
  • src/token/tokenizer.c
  • tests/meson.build
  • tests/test_tokenizer.c
✅ Files skipped from review due to trivial changes (1)
  • include/peregrine/peregrine.h
🚧 Files skipped from review as they are similar to previous changes (2)
  • meson.build
  • Makefile.bootstrap

Comment thread src/token/tokenizer.c
Comment thread src/token/tokenizer.c
Comment thread src/token/tokenizer.c
Comment on lines +259 to +262
for (i = 0; i < tok->vocab_size; i++) {
tok->vocab[i].text = tokens_entry->value.as.string_array[i];
tok->vocab[i].score = scores_entry ? scores_entry->value.as.f64_array[i] : 0.0;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

PgTokenizer currently borrows token text memory from PgModelFile, creating a dangling-pointer risk.

Line 260 stores PgStringView pointers owned by the model metadata. If the model is freed before the tokenizer, later encode/decode reads invalid memory.

Suggested fix (make tokenizer own token text)
 struct PgTokenizer {
     TokenEntry *vocab;
+    char **owned_text;
     size_t vocab_size;
@@
     tok->vocab = calloc(tok->vocab_size ? tok->vocab_size : 1, sizeof(*tok->vocab));
+    tok->owned_text = calloc(tok->vocab_size ? tok->vocab_size : 1, sizeof(*tok->owned_text));
-    if (!tok->vocab) {
+    if (!tok->vocab || !tok->owned_text) {
         pg_tokenizer_free(tok);
         set_err(err, err_len, "out of memory");
         return NULL;
     }
@@
     for (i = 0; i < tok->vocab_size; i++) {
-        tok->vocab[i].text = tokens_entry->value.as.string_array[i];
+        PgStringView src = tokens_entry->value.as.string_array[i];
+        char *dup = malloc(src.len ? src.len : 1);
+        if (!dup) {
+            pg_tokenizer_free(tok);
+            set_err(err, err_len, "out of memory");
+            return NULL;
+        }
+        if (src.len)
+            memcpy(dup, src.data, src.len);
+        tok->owned_text[i] = dup;
+        tok->vocab[i].text.data = dup;
+        tok->vocab[i].text.len = src.len;
         tok->vocab[i].score = scores_entry ? scores_entry->value.as.f64_array[i] : 0.0;
     }
@@
 void pg_tokenizer_free(PgTokenizer *tok)
 {
+    size_t i;
     if (!tok)
         return;
+    for (i = 0; i < tok->vocab_size; i++)
+        free(tok->owned_text ? tok->owned_text[i] : NULL);
+    free(tok->owned_text);
     free(tok->vocab);
     free(tok->hash_ids);
     free(tok);
 }

Also applies to: 272-279

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/token/tokenizer.c` around lines 259 - 262, The tokenizer vocabulary is
borrowing token text memory directly from the model file without copying it,
creating a dangling pointer risk if the model is freed before the tokenizer. In
the loop where tok->vocab[i].text is assigned from
tokens_entry->value.as.string_array[i], you need to allocate and copy the string
data instead of storing the borrowed pointer. Ensure proper memory allocation
for each token text and use string copy functions to make the tokenizer own the
memory. Apply the same fix to the second occurrence mentioned in lines 272-279
where similar borrowing occurs.

Signed-off-by: Zach Bennett <zach@worldflowai.com>
@zbennett10 zbennett10 changed the title feat: add zero-copy model file loaders feat: add M3 model loading and llama runner Jun 17, 2026

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
src/cli/peregrine.c (1)

77-89: 💤 Low value

strtoul silently accepts negative numbers.

strtoul("-5", ...) wraps to a very large positive value (≈ULONG_MAX) instead of failing. If a user accidentally passes a negative -n value, generation would run for an extremely long time.

Consider adding a leading minus check:

Proposed defensive check
 static int parse_size_arg(const char *s, size_t *out)
 {
     char *end = NULL;
     unsigned long v;
 
-    if (!s || !*s)
+    if (!s || !*s || *s == '-')
         return -1;
     v = strtoul(s, &end, 10);
     if (!end || *end != '\0')
         return -1;
     *out = (size_t)v;
     return 0;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cli/peregrine.c` around lines 77 - 89, The parse_size_arg function does
not validate against negative input values, allowing strtoul to silently wrap
negative numbers to large positive values. After the initial null/empty string
check in parse_size_arg, add a validation check to detect if the first character
of the input string s is a minus sign and return an error code if it is. This
prevents negative values from being passed to strtoul and wrapped to
unexpectedly large positive numbers.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/cli/peregrine.c`:
- Line 191: The main usage text for the `run` subcommand is missing the optional
`-n <tokens>` argument which is documented in the internal usage text. Update
the format string in the help text (the line containing "  %s run -m
<model.gguf> -p <txt>  run greedy f32 GGUF inference\n") to include the optional
`-n <tokens>` option between the existing options and the description, ensuring
consistency with the internal usage documentation shown at lines 116/121.

---

Nitpick comments:
In `@src/cli/peregrine.c`:
- Around line 77-89: The parse_size_arg function does not validate against
negative input values, allowing strtoul to silently wrap negative numbers to
large positive values. After the initial null/empty string check in
parse_size_arg, add a validation check to detect if the first character of the
input string s is a minus sign and return an error code if it is. This prevents
negative values from being passed to strtoul and wrapped to unexpectedly large
positive numbers.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: caf89cdb-2416-4c5f-afbf-05d66876e7d3

📥 Commits

Reviewing files that changed from the base of the PR and between 2e6c7bb and 8ecbd1b.

📒 Files selected for processing (12)
  • .github/workflows/ci.yml
  • Makefile.bootstrap
  • include/peregrine/llama.h
  • include/peregrine/peregrine.h
  • meson.build
  • src/cli/peregrine.c
  • src/runtime/llama.c
  • src/tensor/kernels/rope/rope_init.c
  • src/tensor/kernels/rope/x86/rope_avx2.asm
  • tests/checkasm/checkasm_rope.c
  • tests/meson.build
  • tests/test_llama.c
✅ Files skipped from review due to trivial changes (1)
  • tests/checkasm/checkasm_rope.c
🚧 Files skipped from review as they are similar to previous changes (1)
  • include/peregrine/peregrine.h

Comment thread src/cli/peregrine.c Outdated
Signed-off-by: Zach Bennett <zach@worldflowai.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
src/runtime/llama.c (2)

323-334: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject invalid RMSNorm/RoPE scalar metadata at load time.

NaN, inf, zero, or negative values currently load successfully and then propagate invalid logits through RMSNorm/RoPE. Validate the casted floats before accepting the model.

Proposed validation
     if (metadata_f64_default(m->file, "llama.attention.layer_norm_rms_epsilon",
                              1.0e-5, &f) != 0) {
         set_err(err, err_len,
                 "invalid metadata 'llama.attention.layer_norm_rms_epsilon'");
         return -1;
     }
     m->norm_eps = (float)f;
+    if (!isfinite(m->norm_eps) || m->norm_eps <= 0.0f) {
+        set_err(err, err_len,
+                "invalid metadata 'llama.attention.layer_norm_rms_epsilon'");
+        return -1;
+    }
     if (metadata_f64_default(m->file, "llama.rope.freq_base", 10000.0, &f) != 0) {
         set_err(err, err_len, "invalid metadata 'llama.rope.freq_base'");
         return -1;
     }
     m->rope_theta = (float)f;
+    if (!isfinite(m->rope_theta) || m->rope_theta <= 0.0f) {
+        set_err(err, err_len, "invalid metadata 'llama.rope.freq_base'");
+        return -1;
+    }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/runtime/llama.c` around lines 323 - 334, Add validation checks
immediately after casting the metadata values to float in both the norm_eps and
rope_theta assignments. After the cast in the line assigning to m->norm_eps (the
layer_norm_rms_epsilon value), validate that the float is not NaN, not infinite,
and greater than zero. Similarly, after the cast in the line assigning to
m->rope_theta (the rope.freq_base value), validate that the float is not NaN,
not infinite, and greater than zero. If any validation fails, set an appropriate
error message using set_err and return -1 to reject the invalid model at load
time.

375-380: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not silently tie embeddings when output.weight is present but invalid.

Any shape/type mismatch from load_mat() falls back to token_embd.weight, masking malformed or unsupported GGUF tensors and producing wrong logits. Only use tied embeddings when output.weight is actually absent.

Proposed fix
-    if (load_mat(m->file, "output.weight", m->vocab, m->n_embd,
-                 &m->output, NULL, 0) != 0) {
+    if (pg_model_file_find_tensor(m->file, "output.weight",
+                                  strlen("output.weight"))) {
+        if (load_mat(m->file, "output.weight", m->vocab, m->n_embd,
+                     &m->output, err, err_len) != 0)
+            return -1;
+    } else {
         m->output.data = m->token_embd;
         m->output.rows = m->vocab;
         m->output.cols = m->n_embd;
     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/runtime/llama.c` around lines 375 - 380, The current code at the load_mat
call for output.weight unconditionally falls back to using token_embd whenever
load_mat returns non-zero, which masks shape and type mismatches. Distinguish
between output.weight being genuinely absent from the file versus being present
but having incompatible shape or type. Only apply the fallback that ties output
to token_embd when output.weight is actually missing; when load_mat detects a
shape or type mismatch, the error should propagate so the model loading fails
rather than silently producing incorrect logits.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@include/peregrine/llama.h`:
- Around line 20-24: The comments in the PgSamplerParams struct definition on
lines 21 and 23 claim that temperature <= 0 enables greedy mode and top_p <= 0
disables sampling, but the pg_sampler_new() constructor validation actually
rejects temperature < 0 and top_p < 0. Update the comments for the temperature
and top_p fields in the struct to reflect the actual validation behavior by
changing the threshold from <= 0 to < 0, so the documented contract matches what
the constructor enforces.

In `@src/cli/peregrine.c`:
- Around line 91-103: The parse_u64_arg function does not validate for negative
inputs or overflow conditions. Add a check to reject inputs that start with a
minus sign (negative values) before calling strtoull to prevent signed values
like -1 from being wrapped into large unsigned values. Additionally, check the
errno variable after the strtoull call to detect overflow conditions and return
an error code if errno is set to ERANGE, ensuring that values exceeding the
uint64_t range are properly rejected instead of being accepted as valid.

In `@tests/test_llama_cpp_parity.py`:
- Around line 17-23: The env_path function only validates that a path exists but
does not verify it is an executable file, allowing directories or non-executable
files to pass validation and fail later during subprocess execution. Add
validation in the env_path function to check that the path is both a regular
file (using path.is_file()) and is executable (using os.access(path, os.X_OK))
before returning it, ensuring only valid executable paths are returned and
preventing downstream subprocess errors.

---

Outside diff comments:
In `@src/runtime/llama.c`:
- Around line 323-334: Add validation checks immediately after casting the
metadata values to float in both the norm_eps and rope_theta assignments. After
the cast in the line assigning to m->norm_eps (the layer_norm_rms_epsilon
value), validate that the float is not NaN, not infinite, and greater than zero.
Similarly, after the cast in the line assigning to m->rope_theta (the
rope.freq_base value), validate that the float is not NaN, not infinite, and
greater than zero. If any validation fails, set an appropriate error message
using set_err and return -1 to reject the invalid model at load time.
- Around line 375-380: The current code at the load_mat call for output.weight
unconditionally falls back to using token_embd whenever load_mat returns
non-zero, which masks shape and type mismatches. Distinguish between
output.weight being genuinely absent from the file versus being present but
having incompatible shape or type. Only apply the fallback that ties output to
token_embd when output.weight is actually missing; when load_mat detects a shape
or type mismatch, the error should propagate so the model loading fails rather
than silently producing incorrect logits.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: dbad5b90-4d27-438c-ab54-8f81942fa456

📥 Commits

Reviewing files that changed from the base of the PR and between 8ecbd1b and 19763e0.

📒 Files selected for processing (10)
  • include/peregrine/llama.h
  • meson.build
  • src/cli/peregrine.c
  • src/runtime/llama.c
  • src/token/tokenizer.c
  • tests/meson.build
  • tests/test_llama.c
  • tests/test_llama_cpp_parity.py
  • tests/test_tokenizer.c
  • tools/llama_cpp_parity.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • meson.build
  • tests/test_tokenizer.c

Comment thread include/peregrine/llama.h
Comment on lines +20 to +24
typedef struct PgSamplerParams {
float temperature; /* <= 0: greedy */
size_t top_k; /* 0: disabled */
float top_p; /* <= 0 or >= 1: disabled */
uint64_t seed;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Align sampler parameter comments with constructor validation.

Line 21 and Line 23 document negative values as greedy/disabled, but pg_sampler_new() rejects temperature < 0 and top_p < 0. Update the comments or allow those values in the constructor so callers can rely on the public contract.

Proposed header-only fix
 typedef struct PgSamplerParams {
-    float temperature;  /* <= 0: greedy */
+    float temperature;  /* 0: greedy; < 0 invalid */
     size_t top_k;       /* 0: disabled */
-    float top_p;        /* <= 0 or >= 1: disabled */
+    float top_p;        /* 0 or >= 1: disabled; < 0 invalid */
     uint64_t seed;
 } PgSamplerParams;
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
typedef struct PgSamplerParams {
float temperature; /* <= 0: greedy */
size_t top_k; /* 0: disabled */
float top_p; /* <= 0 or >= 1: disabled */
uint64_t seed;
typedef struct PgSamplerParams {
float temperature; /* 0: greedy; < 0 invalid */
size_t top_k; /* 0: disabled */
float top_p; /* 0 or >= 1: disabled; < 0 invalid */
uint64_t seed;
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@include/peregrine/llama.h` around lines 20 - 24, The comments in the
PgSamplerParams struct definition on lines 21 and 23 claim that temperature <= 0
enables greedy mode and top_p <= 0 disables sampling, but the pg_sampler_new()
constructor validation actually rejects temperature < 0 and top_p < 0. Update
the comments for the temperature and top_p fields in the struct to reflect the
actual validation behavior by changing the threshold from <= 0 to < 0, so the
documented contract matches what the constructor enforces.

Comment thread src/cli/peregrine.c
Comment on lines +91 to +103
static int parse_u64_arg(const char *s, uint64_t *out)
{
char *end = NULL;
unsigned long long v;

if (!s || !*s)
return -1;
v = strtoull(s, &end, 0);
if (!end || *end != '\0')
return -1;
*out = (uint64_t)v;
return 0;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Reject signed/overflowed --seed inputs in parse_u64_arg.

parse_u64_arg currently accepts values like -1 (wrapped) and does not reject overflow via errno, so invalid seed input can be treated as a valid large value instead of returning usage error.

Suggested fix
+#include <errno.h>
@@
 static int parse_u64_arg(const char *s, uint64_t *out)
 {
     char *end = NULL;
     unsigned long long v;
 
     if (!s || !*s)
         return -1;
-    v = strtoull(s, &end, 0);
-    if (!end || *end != '\0')
+    if (*s == '-')
+        return -1;
+    errno = 0;
+    v = strtoull(s, &end, 10);
+    if (errno == ERANGE || end == s || !end || *end != '\0')
         return -1;
     *out = (uint64_t)v;
     return 0;
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/cli/peregrine.c` around lines 91 - 103, The parse_u64_arg function does
not validate for negative inputs or overflow conditions. Add a check to reject
inputs that start with a minus sign (negative values) before calling strtoull to
prevent signed values like -1 from being wrapped into large unsigned values.
Additionally, check the errno variable after the strtoull call to detect
overflow conditions and return an error code if errno is set to ERANGE, ensuring
that values exceeding the uint64_t range are properly rejected instead of being
accepted as valid.

Comment on lines +17 to +23
def env_path(name):
value = os.environ.get(name)
if not value:
return None
path = pathlib.Path(value)
return path if path.exists() else None

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Validate executable tool paths before invoking subprocesses.

Current checks only verify existence. A directory or non-executable file can pass validation and then fail at subprocess.run instead of cleanly skipping with a prerequisite message.

Suggested fix
 def env_path(name):
@@
     path = pathlib.Path(value)
     return path if path.exists() else None
+
+
+def is_executable(path):
+    return path is not None and path.is_file() and os.access(path, os.X_OK)
@@
-    if not peregrine.exists():
+    if not is_executable(peregrine):
         return skip("peregrine CLI was not built")
@@
-    if llama_tokenize is None:
+    if not is_executable(llama_tokenize):
         return skip("set PG_LLAMA_CPP_TOKENIZE to llama-tokenize")
-    if llama_debug is None:
+    if not is_executable(llama_debug):
         return skip("set PG_LLAMA_CPP_DEBUG to llama-debug")

Also applies to: 37-44

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/test_llama_cpp_parity.py` around lines 17 - 23, The env_path function
only validates that a path exists but does not verify it is an executable file,
allowing directories or non-executable files to pass validation and fail later
during subprocess execution. Add validation in the env_path function to check
that the path is both a regular file (using path.is_file()) and is executable
(using os.access(path, os.X_OK)) before returning it, ensuring only valid
executable paths are returned and preventing downstream subprocess errors.

Signed-off-by: Zach Bennett <zach@worldflowai.com>
Signed-off-by: Zach Bennett <zach@worldflowai.com>
@zbennett10

Copy link
Copy Markdown
Contributor Author

M3 decode perf update (28f9a5d):

  • Added peregrine bench --profile decode-stage timing.
  • Fused QKV and FFN gate/up row scheduling, direct-accumulated FFN down into the residual, sped up greedy argmax, and retuned GEMV row grain.
  • Local M2 Pro, stories42M f32, 6 threads, 128 decode tokens: Peregrine decode_tok_s: 450.15.
  • Same model with llama.cpp: no-BLAS CPU avg_ts: 555.59, Accelerate/BLAS avg_ts: 545.32. Peregrine is improved but still behind.
  • Tried static row partitioning and a handwritten NEON 4-row GEMV microkernel; both passed correctness where applicable but lost the actual decode benchmark, so they were not shipped.

Validation run locally: ARM release/debug Meson tests, x86 Meson tests, bootstrap checkasm. GitHub arm64/x86_64/CodeQL/CodeRabbit are green.

Signed-off-by: Zach Bennett <zach@worldflowai.com>
@zbennett10

Copy link
Copy Markdown
Contributor Author

Update after 550250c:\n\n- Added AArch64 NEON fixed-depth dot fast paths for the decode-heavy K=512 and K=1376 shapes.\n- Expanded checkasm coverage so those dot sizes are deterministic fuzz/bench cases, and GEMV bench coverage now includes model-shaped 512x512, 512x1376, 1376x512, and 32000x512 cases.\n- Validation: local ARM release/debug Meson tests pass; bootstrap checkasm passes; local x86 Meson tests pass; remote PR CI is green for arm64, x86_64, CodeQL, and CodeRabbit.\n- Measured dot bench on M2 Pro release checkasm: K=512 NEON 27.7 ns / 37.0 GFLOP/s; K=1376 NEON 73.3 ns / 37.6 GFLOP/s.\n- Current 42M f32 decode snapshot: Peregrine 442.28 tok/s vs llama.cpp no-BLAS 541.58 tok/s at 6 threads, 128 gen tokens. Remaining gap is ~18.3% slower.\n\nRejected/not landed: row-pair/fixed GEMV variants, smaller/larger grains, naive threadpool polling, and Darwin QoS hints. They either lost in full decode or were too noisy to justify shipping.

@zbennett10 zbennett10 merged commit 9e6f142 into main Jun 17, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant