PR-E1c: fix kv_live_bytes reporting path#52
Merged
FluffyAIcode merged 1 commit intoJun 2, 2026
Conversation
PR-E1b's 4h Mac M4 bench surfaced that GetSessionInfo.kv_live_bytes
was 0 across all 480 turns. This was a reporting bug, not a
verifier bug:
- Session.kv_live_bytes() reads slab.live_kv_bytes, designed to
pass through slab.live_kv_bytes_override when the verifier
publishes its real KV byte count.
- The HTTP shim's PooledVerifier writes the override on every
forward via _sync_slab_bytes (\u00a72.10).
- The gRPC path's AppendTokensCoordinator + GenerationCoordinator
never wrote the override. Slab stayed in its tensor-write-based
accounting, which is 0 because the verifier writes its own
SinkWindowKVCache, not the slab tensors.
- Result: PR-E1b's bench reported kv_bounded=True trivially
(0 - 0 < 10% \u00d7 max(0,1) = 0.10) instead of an actual
empirical bound.
PR-E1c closes this. Architecture is unchanged; the slab placeholder
mechanism stays. The fix is two coordinator-level write-throughs
plus a verifier-level kv_live_bytes(session) accessor.
Production-side changes
-----------------------
inference_engine/session/coordinator.py
+ VerifierProtocol.kv_live_bytes(session) -> int. Mirrors
k_seq_length's per-session ignored-arg shape.
+ module-level _sync_slab_bytes(session, verifier) helper. No-op
when session.slab is None (pool-less SessionStore in
coordinator unit tests). Otherwise writes
int(verifier.kv_live_bytes(session)) onto
session.slab.live_kv_bytes_override.
+ AppendTokensCoordinator.append_tokens calls _sync_slab_bytes
after the position-advance step on every successful path.
Empty appends remain a no-op (slab override unchanged).
inference_engine/session/generator.py
+ GenerationCoordinator.generate calls _sync_slab_bytes both on
EOS exit and on max-tokens exit. Once cache hits sink+window
capacity the value plateaus, which is precisely what the bench
needs to measure to claim kv_bounded empirically.
inference_engine/session/store.py
Updated Session.kv_live_bytes() docstring to reflect the dual
HTTP-shim + gRPC sync paths now feeding the slab override.
kv_cache_proposer/verifier.py (CPU)
+ SinkWindowVerifier.__init__ precomputes _bytes_per_kv_token =
num_layers \u00d7 num_kv_heads \u00d7 head_dim \u00d7 dtype.itemsize \u00d7 2.
Reads dims from the HF config so GQA / MQA via
num_key_value_heads is honored (Qwen3 / Gemma / DeepSeek).
+ SinkWindowVerifier.kv_live_bytes(session) -> int. Returns
self._cache_seq_length() \u00d7 self._bytes_per_kv_token. O(1).
inference_engine/backends/mlx/verifier.py (MLX)
+ Same precomputation in __init__; reads dims from the wrapped HF
config (handles both model.config and bare-model layouts).
+ MLXSinkWindowVerifier.kv_live_bytes(session) mirroring the CPU
verifier.
Tests
-----
tests/inference_engine/session/test_coordinator.py
+ FakeVerifier.kv_live_bytes implementation (constant-prime
BYTES_PER_KV_TOKEN = 17 so any off-by-one in dim arithmetic
surfaces immediately).
+ TestSlabBytesSync class (4 tests): verifies the override is
written after first prefill, after subsequent forward_block,
skipped gracefully when slab is None, and not overwritten by
empty (no-op) appends.
tests/inference_engine/session/test_generator.py
+ TestGenerationSyncsSlabBytes class (2 tests): max-tokens path
and EOS path both leave a synced override.
tests/core/test_verifier.py (HF-bound, runs on Mac via
run_platform_tests.sh, NOT in Linux CI)
+ 3 tests against the real Qwen3-0.6B verifier:
- kv_live_bytes is 0 before prefill
- kv_live_bytes equals k_seq_length \u00d7 per-token bytes
(computed from HF config the same way the verifier does \u2014
so the closed-form relationship is exercised end-to-end)
- kv_live_bytes plateaus at sink+window capacity even after
further forward_block calls (the architectural KV-bound
claim, now verifiable empirically)
tests/backends/mlx/test_verifier.py (Apple Silicon only, runs on
Mac via run_platform_tests.sh, NOT in Linux CI)
+ 3 tests mirroring the CPU verifier suite, against
MLXSinkWindowVerifier.
Mac M4 reviewer aid
-------------------
scripts/review_pr_e1c_on_mac.sh
Runs:
Part 1: pytest tests/core/test_verifier.py +
tests/backends/mlx/test_verifier.py -k 'kv_live_bytes
or k_seq_length or cache_inspector'.
Part 2: 5-min bench against the running gRPC server, asserting
kv_live_bytes is non-zero (PR-E1b's reporting bug
REPRODUCED + FIXED). Skipped gracefully when PR-E1b's
gRPC server / bench scripts aren't on the tree (PR-E1c
stacks logically after PR-E1b but is branched off main
so CI triggers \u2014 see PR description).
Linux gate
----------
PYTHONPATH=.:sdks/python pytest <Linux CI gate set>:
701 passed, 100% coverage on 1702 stmts (was 700 / 1702;
+1 net test reflects the new TestSlabBytesSync + the existing
bench tests \u2014 different counts add up the same way thanks to
some pre-existing test renames in PR-D1).
Per ADR 0008 \u00a79
----------------
Linux gate covers the coordinator-level dispatch with a
deterministic FakeVerifier. Real-numerics validation
(kv_live_bytes against actual Qwen3-0.6B / MLX cache state) lives
in tests/core/ and tests/backends/mlx/ which only run on Mac \u2014 so
this PR's MERGE requires Mac M4 reviewer evidence (Part 1 of the
review aid is sufficient; Part 2 is icing).
Stack
-----
PR-E1c is branched off main. Logically follows PR-E1b (#51) for
the bench-based verification. Recommended merge order:
1. PR-D1 (#49) \u2014 pure cleanup, can merge first
2. PR-E1b (#51) \u2014 4h bench evidence already on main
3. PR-E1c (this) \u2014 closes the kv_live_bytes=0 reporting bug
Next 4h Mac M4 bench (post-merge) will produce a non-zero
kv_live_bytes series and turn kv_bounded into a non-trivial
empirical claim alongside the architectural one.
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
4 tasks
FluffyAIcode
added a commit
that referenced
this pull request
Jun 2, 2026
…tion workflow
Closes the loop on automated GA gating. After PR-N1..N4 retired all
verifier-protocol test doubles from the Linux gate, the integration
suite (tests/integration/) became the binding correctness gate for
runtime modules \u2014 inference_engine.session.coordinator,
inference_engine.session.generator,
inference_engine.scheduler.scheduler,
inference_engine.server.{app,engine,tokenizer,streaming}, and
kakeya.{client,session}. Until this PR, that suite ran manually
via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every
PR labelled needs-mac-m4.
Three artifacts ship:
.github/workflows/integration.yaml +136 lines
Self-hosted runner workflow targeting [self-hosted, macOS,
ARM64, kakeya-mac-m4]. Triggers on PR events when the
needs-mac-m4 label is present, plus on workflow_dispatch
for manual re-runs. Steps:
1. Checkout (full history).
2. Verify host shape (chip, memory, python version).
3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1
at test time \u2014 no downloads in CI; cache miss fails
fast with a clear pre-warm command).
4. pip install -e . + pytest dependencies (warm pip cache
keeps this <30 s).
5. pytest -m integration tests/integration/ \u2014 expected
runtime 60-120 s on M4 with warm cache. 90-min timeout
is a safety margin, not the operating point.
6. Upload JUnit XML artifact.
7. On failure, inline the test names + first-line error
messages into the Action log so triage doesn't require
downloading the artifact.
Concurrency: cancel-in-progress per PR, so a new push
supersedes the previous run.
.github/workflows/auto-label-mac.yaml +89 lines
pull_request_target workflow that auto-applies (or removes)
the needs-mac-m4 label based on which paths the PR touches.
Trigger paths:
inference_engine/ \u2014 runtime, scheduler, session, server
sdks/ \u2014 Python + TypeScript SDK
proto/ \u2014 wire contract
tests/integration/ \u2014 the integration suite itself
kv_cache_proposer/ \u2014 verifier + decoder
Doc-only or CI-only PRs are NOT labelled \u2014 they skip the
integration gate entirely, saving runner time. The label is
automatically dropped if a subsequent push removes all
verifier-dependent edits.
docs/ops/mac-m4-runner-setup.md +137 lines
Operator runbook for the self-hosted runner: hardware
requirements (24 GB minimum, ~50 GB free disk), runner
registration with the kakeya-mac-m4 label, HF cache
pre-warm command (Qwen3-0.6B), Python toolchain setup,
runtime expectations, cache hygiene cron, runner upgrade
procedure, and failure triage steps.
CI workflow split rationale
---------------------------
The pre-existing .github/workflows/ci.yaml stays as the Linux gate
(verifier-independent, runs on github-hosted ubuntu-latest, fires
on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow
because:
1. Self-hosted runners are slow / few; doc-only PRs shouldn't
touch them.
2. The integration gate is intentionally OPT-IN by label; ci.yaml
is non-optional.
3. Failure semantics differ: Linux gate failure blocks merge
unconditionally; Mac M4 gate failure surfaces a structured
report but the merge decision is a human one until v0.3.0
final ships.
Together the two workflows form the post-cleanup gating model:
- Linux gate (ci.yaml):
verifier-independent code; 100% coverage; every PR.
- Mac M4 gate (integration.yaml):
verifier-dependent code; binding GA gate; PRs touching
runtime / SDK / proto / integration tests.
Stack
-----
PR-E2 is branched off main, independent of the cleanup PRs (#49,
#50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at
launch even before PR-E1 lands; it just won't find any tests
under tests/integration/ until that PR is merged. Recommended
merge order: cleanup PRs first (so the workflow has tests to
run), then PR-E2.
Per ADR 0008 \u00a79
----------------
PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source
changes. No Mac M4 evidence required for this PR (the workflow
itself becomes the Mac M4 evidence machinery for ALL future PRs).
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
PR-E1b's 4h Mac M4 bench (
bench_session_4h_1780332893.json, onmain) showedmin/mean/max kv_live_bytes = 0across all 480 turns. Reporting bug, not a verifier bug:Session.kv_live_bytes()readsslab.live_kv_bytes, designed to pass throughslab.live_kv_bytes_overridewhen the verifier publishes its real KV byte count.PooledVerifierwrites the override on every forward via_sync_slab_bytes.AppendTokensCoordinator+GenerationCoordinatornever wrote the override. The slab stayed in tensor-write-based accounting, which is 0 because the verifier writes its ownSinkWindowKVCache, not slab tensors.kv_bounded=Truetrivially (0-0 < 10% × max(0,1) = 0.10) instead of an actual empirical bound.PR-E1c closes this. Architecture is unchanged; the slab placeholder mechanism stays. The fix is two coordinator-level write-throughs plus a verifier-level
kv_live_bytes(session)accessor.Production-side changes
inference_engine/session/coordinator.pyVerifierProtocol.kv_live_bytes(session); module-level_sync_slab_bytes(session, verifier)helper;AppendTokensCoordinator.append_tokenscalls it after position-advance.inference_engine/session/generator.pyGenerationCoordinator.generatecalls_sync_slab_byteson both EOS and max-tokens exit.inference_engine/session/store.pySession.kv_live_bytes()docstring updated to reflect the dual HTTP-shim + gRPC sync paths.kv_cache_proposer/verifier.py(CPU)__init__precomputes_bytes_per_kv_token = num_layers × num_kv_heads × head_dim × dtype.itemsize × 2.kv_live_bytes(session)returns_cache_seq_length() × _bytes_per_kv_token. O(1).inference_engine/backends/mlx/verifier.py(MLX)GQA / MQA via
num_key_value_headsis honored (Qwen3 / Gemma / DeepSeek).Tests
tests/inference_engine/session/test_coordinator.py::TestSlabBytesSynctests/inference_engine/session/test_generator.py::TestGenerationSyncsSlabBytestests/core/test_verifier.pyrun_platform_tests.shtests/backends/mlx/test_verifier.pyFakeVerifiergains akv_live_bytesimpl with a constant-primeBYTES_PER_KV_TOKEN = 17so any off-by-one in dim arithmetic surfaces immediately.Linux verification
Mac M4 reviewer aid —
scripts/review_pr_e1c_on_mac.shTwo parts:
pytest tests/core/test_verifier.py + tests/backends/mlx/test_verifier.py -k 'kv_live_bytes or k_seq_length or cache_inspector'. Validates the closed-form relationship against real Qwen3-0.6B numerics.bench_session_long_run.pyfor 300s, assertsmin/mean/max kv_live_bytes > 0. Reproduces PR-E1b's reporting bug as fixed.Part 2 skips gracefully if PR-E1b's
start_grpc_runtime_server.py/bench_session_long_run.pyaren't on the tree, so this PR can be reviewed on Mac before PR-E1b lands.Stack — recommended merge order
PR-E1c is branched off
maindirectly so its CI triggers normally. Logically:kv_live_bytes=0reporting bug.After PR-E1c merges, the next 4-hour Mac M4 bench will produce a non-zero
kv_live_bytesseries and turnkv_boundedinto a non-trivial empirical claim alongside the architectural one (capacity = (sink+window) tokens × dims, fixed at construction time).Reviewer checklist
review_pr_e1c_on_mac.shpasses — all 6 verifier-level tests green.max_kv_live_bytes > 0(concretely: with sink=4 + window=64 + Qwen3-0.6B, expect ~6-8 MiB once at capacity).