PR-E1c: fix kv_live_bytes reporting path by FluffyAIcode · Pull Request #52 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-02T01:19:11Z

Why

PR-E1b's 4h Mac M4 bench (bench_session_4h_1780332893.json, on main) showed min/mean/max kv_live_bytes = 0 across all 480 turns. Reporting bug, not a verifier bug:

Session.kv_live_bytes() reads slab.live_kv_bytes, designed to pass through slab.live_kv_bytes_override when the verifier publishes its real KV byte count.
The HTTP shim's PooledVerifier writes the override on every forward via _sync_slab_bytes.
The gRPC path's AppendTokensCoordinator + GenerationCoordinator never wrote the override. The slab stayed in tensor-write-based accounting, which is 0 because the verifier writes its own SinkWindowKVCache, not slab tensors.
Result: PR-E1b's bench reported kv_bounded=True trivially (0-0 < 10% × max(0,1) = 0.10) instead of an actual empirical bound.

PR-E1c closes this. Architecture is unchanged; the slab placeholder mechanism stays. The fix is two coordinator-level write-throughs plus a verifier-level kv_live_bytes(session) accessor.

Production-side changes

File	Δ	What
`inference_engine/session/coordinator.py`	+27	`VerifierProtocol.kv_live_bytes(session)`; module-level `_sync_slab_bytes(session, verifier)` helper; `AppendTokensCoordinator.append_tokens` calls it after position-advance.
`inference_engine/session/generator.py`	+12	`GenerationCoordinator.generate` calls `_sync_slab_bytes` on both EOS and max-tokens exit.
`inference_engine/session/store.py`	+25	`Session.kv_live_bytes()` docstring updated to reflect the dual HTTP-shim + gRPC sync paths.
`kv_cache_proposer/verifier.py` (CPU)	+38	`__init__` precomputes `_bytes_per_kv_token = num_layers × num_kv_heads × head_dim × dtype.itemsize × 2`. `kv_live_bytes(session)` returns `_cache_seq_length() × _bytes_per_kv_token`. O(1).
`inference_engine/backends/mlx/verifier.py` (MLX)	+30	Same shape; reads dims from the wrapped HF config.

GQA / MQA via num_key_value_heads is honored (Qwen3 / Gemma / DeepSeek).

Tests

Suite	Tests	Where it runs
`tests/inference_engine/session/test_coordinator.py::TestSlabBytesSync`	4	Linux CI ✅
`tests/inference_engine/session/test_generator.py::TestGenerationSyncsSlabBytes`	2	Linux CI ✅
`tests/core/test_verifier.py`	+3 against real Qwen3-0.6B	Mac via `run_platform_tests.sh`
`tests/backends/mlx/test_verifier.py`	+3 against MLX	Apple Silicon only

FakeVerifier gains a kv_live_bytes impl with a constant-prime BYTES_PER_KV_TOKEN = 17 so any off-by-one in dim arithmetic surfaces immediately.

Linux verification

PYTHONPATH=.:sdks/python pytest <Linux CI gate set>:
  701 passed, 100% coverage on 1702 stmts.

Mac M4 reviewer aid — `scripts/review_pr_e1c_on_mac.sh`

Two parts:

Verifier tests — pytest tests/core/test_verifier.py + tests/backends/mlx/test_verifier.py -k 'kv_live_bytes or k_seq_length or cache_inspector'. Validates the closed-form relationship against real Qwen3-0.6B numerics.
5-min gRPC bench — boots the gRPC server, runs bench_session_long_run.py for 300s, asserts min/mean/max kv_live_bytes > 0. Reproduces PR-E1b's reporting bug as fixed.

Part 2 skips gracefully if PR-E1b's start_grpc_runtime_server.py / bench_session_long_run.py aren't on the tree, so this PR can be reviewed on Mac before PR-E1b lands.

Stack — recommended merge order

PR-E1c is branched off main directly so its CI triggers normally. Logically:

#	PR	Status	Why
1	#49 PR-D1	6/6 ✅	Pure cleanup, no dependencies.
2	#51 PR-E1b	5/6 ✅ + 4h evidence on main	Adds the bench machinery.
3	#50 PR-E1	6/6 ✅, awaiting Mac M4 INV-3 evidence	Independent of PR-E1b/c.
4	this PR (PR-E1c)	TBD ⏳	Closes the `kv_live_bytes=0` reporting bug.

After PR-E1c merges, the next 4-hour Mac M4 bench will produce a non-zero kv_live_bytes series and turn kv_bounded into a non-trivial empirical claim alongside the architectural one (capacity = (sink+window) tokens × dims, fixed at construction time).

Reviewer checklist

Part 1 of review_pr_e1c_on_mac.sh passes — all 6 verifier-level tests green.
Part 2 (after PR-E1b merges) shows max_kv_live_bytes > 0 (concretely: with sink=4 + window=64 + Qwen3-0.6B, expect ~6-8 MiB once at capacity).
Linux CI green (the FakeVerifier-based unit tests don't load real models).
No regressions to existing INV-1 / INV-2 / INV-3 enforcement.

PR-E1b's 4h Mac M4 bench surfaced that GetSessionInfo.kv_live_bytes was 0 across all 480 turns. This was a reporting bug, not a verifier bug: - Session.kv_live_bytes() reads slab.live_kv_bytes, designed to pass through slab.live_kv_bytes_override when the verifier publishes its real KV byte count. - The HTTP shim's PooledVerifier writes the override on every forward via _sync_slab_bytes (\u00a72.10). - The gRPC path's AppendTokensCoordinator + GenerationCoordinator never wrote the override. Slab stayed in its tensor-write-based accounting, which is 0 because the verifier writes its own SinkWindowKVCache, not the slab tensors. - Result: PR-E1b's bench reported kv_bounded=True trivially (0 - 0 < 10% \u00d7 max(0,1) = 0.10) instead of an actual empirical bound. PR-E1c closes this. Architecture is unchanged; the slab placeholder mechanism stays. The fix is two coordinator-level write-throughs plus a verifier-level kv_live_bytes(session) accessor. Production-side changes ----------------------- inference_engine/session/coordinator.py + VerifierProtocol.kv_live_bytes(session) -> int. Mirrors k_seq_length's per-session ignored-arg shape. + module-level _sync_slab_bytes(session, verifier) helper. No-op when session.slab is None (pool-less SessionStore in coordinator unit tests). Otherwise writes int(verifier.kv_live_bytes(session)) onto session.slab.live_kv_bytes_override. + AppendTokensCoordinator.append_tokens calls _sync_slab_bytes after the position-advance step on every successful path. Empty appends remain a no-op (slab override unchanged). inference_engine/session/generator.py + GenerationCoordinator.generate calls _sync_slab_bytes both on EOS exit and on max-tokens exit. Once cache hits sink+window capacity the value plateaus, which is precisely what the bench needs to measure to claim kv_bounded empirically. inference_engine/session/store.py Updated Session.kv_live_bytes() docstring to reflect the dual HTTP-shim + gRPC sync paths now feeding the slab override. kv_cache_proposer/verifier.py (CPU) + SinkWindowVerifier.__init__ precomputes _bytes_per_kv_token = num_layers \u00d7 num_kv_heads \u00d7 head_dim \u00d7 dtype.itemsize \u00d7 2. Reads dims from the HF config so GQA / MQA via num_key_value_heads is honored (Qwen3 / Gemma / DeepSeek). + SinkWindowVerifier.kv_live_bytes(session) -> int. Returns self._cache_seq_length() \u00d7 self._bytes_per_kv_token. O(1). inference_engine/backends/mlx/verifier.py (MLX) + Same precomputation in __init__; reads dims from the wrapped HF config (handles both model.config and bare-model layouts). + MLXSinkWindowVerifier.kv_live_bytes(session) mirroring the CPU verifier. Tests ----- tests/inference_engine/session/test_coordinator.py + FakeVerifier.kv_live_bytes implementation (constant-prime BYTES_PER_KV_TOKEN = 17 so any off-by-one in dim arithmetic surfaces immediately). + TestSlabBytesSync class (4 tests): verifies the override is written after first prefill, after subsequent forward_block, skipped gracefully when slab is None, and not overwritten by empty (no-op) appends. tests/inference_engine/session/test_generator.py + TestGenerationSyncsSlabBytes class (2 tests): max-tokens path and EOS path both leave a synced override. tests/core/test_verifier.py (HF-bound, runs on Mac via run_platform_tests.sh, NOT in Linux CI) + 3 tests against the real Qwen3-0.6B verifier: - kv_live_bytes is 0 before prefill - kv_live_bytes equals k_seq_length \u00d7 per-token bytes (computed from HF config the same way the verifier does \u2014 so the closed-form relationship is exercised end-to-end) - kv_live_bytes plateaus at sink+window capacity even after further forward_block calls (the architectural KV-bound claim, now verifiable empirically) tests/backends/mlx/test_verifier.py (Apple Silicon only, runs on Mac via run_platform_tests.sh, NOT in Linux CI) + 3 tests mirroring the CPU verifier suite, against MLXSinkWindowVerifier. Mac M4 reviewer aid ------------------- scripts/review_pr_e1c_on_mac.sh Runs: Part 1: pytest tests/core/test_verifier.py + tests/backends/mlx/test_verifier.py -k 'kv_live_bytes or k_seq_length or cache_inspector'. Part 2: 5-min bench against the running gRPC server, asserting kv_live_bytes is non-zero (PR-E1b's reporting bug REPRODUCED + FIXED). Skipped gracefully when PR-E1b's gRPC server / bench scripts aren't on the tree (PR-E1c stacks logically after PR-E1b but is branched off main so CI triggers \u2014 see PR description). Linux gate ---------- PYTHONPATH=.:sdks/python pytest <Linux CI gate set>: 701 passed, 100% coverage on 1702 stmts (was 700 / 1702; +1 net test reflects the new TestSlabBytesSync + the existing bench tests \u2014 different counts add up the same way thanks to some pre-existing test renames in PR-D1). Per ADR 0008 \u00a79 ---------------- Linux gate covers the coordinator-level dispatch with a deterministic FakeVerifier. Real-numerics validation (kv_live_bytes against actual Qwen3-0.6B / MLX cache state) lives in tests/core/ and tests/backends/mlx/ which only run on Mac \u2014 so this PR's MERGE requires Mac M4 reviewer evidence (Part 1 of the review aid is sufficient; Part 2 is icing). Stack ----- PR-E1c is branched off main. Logically follows PR-E1b (#51) for the bench-based verification. Recommended merge order: 1. PR-D1 (#49) \u2014 pure cleanup, can merge first 2. PR-E1b (#51) \u2014 4h bench evidence already on main 3. PR-E1c (this) \u2014 closes the kv_live_bytes=0 reporting bug Next 4h Mac M4 bench (post-merge) will produce a non-zero kv_live_bytes series and turn kv_bounded into a non-trivial empirical claim alongside the architectural one. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…tion workflow Closes the loop on automated GA gating. After PR-N1..N4 retired all verifier-protocol test doubles from the Linux gate, the integration suite (tests/integration/) became the binding correctness gate for runtime modules \u2014 inference_engine.session.coordinator, inference_engine.session.generator, inference_engine.scheduler.scheduler, inference_engine.server.{app,engine,tokenizer,streaming}, and kakeya.{client,session}. Until this PR, that suite ran manually via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every PR labelled needs-mac-m4. Three artifacts ship: .github/workflows/integration.yaml +136 lines Self-hosted runner workflow targeting [self-hosted, macOS, ARM64, kakeya-mac-m4]. Triggers on PR events when the needs-mac-m4 label is present, plus on workflow_dispatch for manual re-runs. Steps: 1. Checkout (full history). 2. Verify host shape (chip, memory, python version). 3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1 at test time \u2014 no downloads in CI; cache miss fails fast with a clear pre-warm command). 4. pip install -e . + pytest dependencies (warm pip cache keeps this <30 s). 5. pytest -m integration tests/integration/ \u2014 expected runtime 60-120 s on M4 with warm cache. 90-min timeout is a safety margin, not the operating point. 6. Upload JUnit XML artifact. 7. On failure, inline the test names + first-line error messages into the Action log so triage doesn't require downloading the artifact. Concurrency: cancel-in-progress per PR, so a new push supersedes the previous run. .github/workflows/auto-label-mac.yaml +89 lines pull_request_target workflow that auto-applies (or removes) the needs-mac-m4 label based on which paths the PR touches. Trigger paths: inference_engine/ \u2014 runtime, scheduler, session, server sdks/ \u2014 Python + TypeScript SDK proto/ \u2014 wire contract tests/integration/ \u2014 the integration suite itself kv_cache_proposer/ \u2014 verifier + decoder Doc-only or CI-only PRs are NOT labelled \u2014 they skip the integration gate entirely, saving runner time. The label is automatically dropped if a subsequent push removes all verifier-dependent edits. docs/ops/mac-m4-runner-setup.md +137 lines Operator runbook for the self-hosted runner: hardware requirements (24 GB minimum, ~50 GB free disk), runner registration with the kakeya-mac-m4 label, HF cache pre-warm command (Qwen3-0.6B), Python toolchain setup, runtime expectations, cache hygiene cron, runner upgrade procedure, and failure triage steps. CI workflow split rationale --------------------------- The pre-existing .github/workflows/ci.yaml stays as the Linux gate (verifier-independent, runs on github-hosted ubuntu-latest, fires on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow because: 1. Self-hosted runners are slow / few; doc-only PRs shouldn't touch them. 2. The integration gate is intentionally OPT-IN by label; ci.yaml is non-optional. 3. Failure semantics differ: Linux gate failure blocks merge unconditionally; Mac M4 gate failure surfaces a structured report but the merge decision is a human one until v0.3.0 final ships. Together the two workflows form the post-cleanup gating model: - Linux gate (ci.yaml): verifier-independent code; 100% coverage; every PR. - Mac M4 gate (integration.yaml): verifier-dependent code; binding GA gate; PRs touching runtime / SDK / proto / integration tests. Stack ----- PR-E2 is branched off main, independent of the cleanup PRs (#49, #50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at launch even before PR-E1 lands; it just won't find any tests under tests/integration/ until that PR is merged. Recommended merge order: cleanup PRs first (so the workflow has tests to run), then PR-E2. Per ADR 0008 \u00a79 ---------------- PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source changes. No Mac M4 evidence required for this PR (the workflow itself becomes the Mac M4 evidence machinery for ALL future PRs). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

FluffyAIcode mentioned this pull request Jun 2, 2026

PR-N1: remove verifier-protocol test doubles from Linux CI gate #53

Merged

FluffyAIcode marked this pull request as ready for review June 2, 2026 04:02

FluffyAIcode merged commit ac3b1da into main Jun 2, 2026
6 checks passed

FluffyAIcode deleted the AgentMemory/v030-pr-e1c-fix-kv-live-bytes-reporting-8e7f branch June 2, 2026 04:02

FluffyAIcode mentioned this pull request Jun 2, 2026

PR-E2 (ADR 0008 §6.5): self-hosted Mac M4 GitHub Actions integration workflow #57

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR-E1c: fix kv_live_bytes reporting path#52

PR-E1c: fix kv_live_bytes reporting path#52
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/v030-pr-e1c-fix-kv-live-bytes-reporting-8e7f

FluffyAIcode commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 2, 2026

Why

Production-side changes

Tests

Linux verification

Mac M4 reviewer aid — scripts/review_pr_e1c_on_mac.sh

Stack — recommended merge order

Reviewer checklist

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mac M4 reviewer aid — `scripts/review_pr_e1c_on_mac.sh`