Skip to content

PR-E1c: fix kv_live_bytes reporting path#52

Merged
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/v030-pr-e1c-fix-kv-live-bytes-reporting-8e7f
Jun 2, 2026
Merged

PR-E1c: fix kv_live_bytes reporting path#52
FluffyAIcode merged 1 commit into
mainfrom
AgentMemory/v030-pr-e1c-fix-kv-live-bytes-reporting-8e7f

Conversation

@FluffyAIcode
Copy link
Copy Markdown
Owner

Why

PR-E1b's 4h Mac M4 bench (bench_session_4h_1780332893.json, on main) showed min/mean/max kv_live_bytes = 0 across all 480 turns. Reporting bug, not a verifier bug:

  • Session.kv_live_bytes() reads slab.live_kv_bytes, designed to pass through slab.live_kv_bytes_override when the verifier publishes its real KV byte count.
  • The HTTP shim's PooledVerifier writes the override on every forward via _sync_slab_bytes.
  • The gRPC path's AppendTokensCoordinator + GenerationCoordinator never wrote the override. The slab stayed in tensor-write-based accounting, which is 0 because the verifier writes its own SinkWindowKVCache, not slab tensors.
  • Result: PR-E1b's bench reported kv_bounded=True trivially (0-0 < 10% × max(0,1) = 0.10) instead of an actual empirical bound.

PR-E1c closes this. Architecture is unchanged; the slab placeholder mechanism stays. The fix is two coordinator-level write-throughs plus a verifier-level kv_live_bytes(session) accessor.

Production-side changes

File Δ What
inference_engine/session/coordinator.py +27 VerifierProtocol.kv_live_bytes(session); module-level _sync_slab_bytes(session, verifier) helper; AppendTokensCoordinator.append_tokens calls it after position-advance.
inference_engine/session/generator.py +12 GenerationCoordinator.generate calls _sync_slab_bytes on both EOS and max-tokens exit.
inference_engine/session/store.py +25 Session.kv_live_bytes() docstring updated to reflect the dual HTTP-shim + gRPC sync paths.
kv_cache_proposer/verifier.py (CPU) +38 __init__ precomputes _bytes_per_kv_token = num_layers × num_kv_heads × head_dim × dtype.itemsize × 2. kv_live_bytes(session) returns _cache_seq_length() × _bytes_per_kv_token. O(1).
inference_engine/backends/mlx/verifier.py (MLX) +30 Same shape; reads dims from the wrapped HF config.

GQA / MQA via num_key_value_heads is honored (Qwen3 / Gemma / DeepSeek).

Tests

Suite Tests Where it runs
tests/inference_engine/session/test_coordinator.py::TestSlabBytesSync 4 Linux CI ✅
tests/inference_engine/session/test_generator.py::TestGenerationSyncsSlabBytes 2 Linux CI ✅
tests/core/test_verifier.py +3 against real Qwen3-0.6B Mac via run_platform_tests.sh
tests/backends/mlx/test_verifier.py +3 against MLX Apple Silicon only

FakeVerifier gains a kv_live_bytes impl with a constant-prime BYTES_PER_KV_TOKEN = 17 so any off-by-one in dim arithmetic surfaces immediately.

Linux verification

PYTHONPATH=.:sdks/python pytest <Linux CI gate set>:
  701 passed, 100% coverage on 1702 stmts.

Mac M4 reviewer aid — scripts/review_pr_e1c_on_mac.sh

Two parts:

  1. Verifier testspytest tests/core/test_verifier.py + tests/backends/mlx/test_verifier.py -k 'kv_live_bytes or k_seq_length or cache_inspector'. Validates the closed-form relationship against real Qwen3-0.6B numerics.
  2. 5-min gRPC bench — boots the gRPC server, runs bench_session_long_run.py for 300s, asserts min/mean/max kv_live_bytes > 0. Reproduces PR-E1b's reporting bug as fixed.

Part 2 skips gracefully if PR-E1b's start_grpc_runtime_server.py / bench_session_long_run.py aren't on the tree, so this PR can be reviewed on Mac before PR-E1b lands.

Stack — recommended merge order

PR-E1c is branched off main directly so its CI triggers normally. Logically:

# PR Status Why
1 #49 PR-D1 6/6 ✅ Pure cleanup, no dependencies.
2 #51 PR-E1b 5/6 ✅ + 4h evidence on main Adds the bench machinery.
3 #50 PR-E1 6/6 ✅, awaiting Mac M4 INV-3 evidence Independent of PR-E1b/c.
4 this PR (PR-E1c) TBD ⏳ Closes the kv_live_bytes=0 reporting bug.

After PR-E1c merges, the next 4-hour Mac M4 bench will produce a non-zero kv_live_bytes series and turn kv_bounded into a non-trivial empirical claim alongside the architectural one (capacity = (sink+window) tokens × dims, fixed at construction time).

Reviewer checklist

  • Part 1 of review_pr_e1c_on_mac.sh passes — all 6 verifier-level tests green.
  • Part 2 (after PR-E1b merges) shows max_kv_live_bytes > 0 (concretely: with sink=4 + window=64 + Qwen3-0.6B, expect ~6-8 MiB once at capacity).
  • Linux CI green (the FakeVerifier-based unit tests don't load real models).
  • No regressions to existing INV-1 / INV-2 / INV-3 enforcement.
Open in Web Open in Cursor 

PR-E1b's 4h Mac M4 bench surfaced that GetSessionInfo.kv_live_bytes
was 0 across all 480 turns. This was a reporting bug, not a
verifier bug:

  - Session.kv_live_bytes() reads slab.live_kv_bytes, designed to
    pass through slab.live_kv_bytes_override when the verifier
    publishes its real KV byte count.
  - The HTTP shim's PooledVerifier writes the override on every
    forward via _sync_slab_bytes (\u00a72.10).
  - The gRPC path's AppendTokensCoordinator + GenerationCoordinator
    never wrote the override. Slab stayed in its tensor-write-based
    accounting, which is 0 because the verifier writes its own
    SinkWindowKVCache, not the slab tensors.
  - Result: PR-E1b's bench reported kv_bounded=True trivially
    (0 - 0 < 10% \u00d7 max(0,1) = 0.10) instead of an actual
    empirical bound.

PR-E1c closes this. Architecture is unchanged; the slab placeholder
mechanism stays. The fix is two coordinator-level write-throughs
plus a verifier-level kv_live_bytes(session) accessor.

Production-side changes
-----------------------

inference_engine/session/coordinator.py
  + VerifierProtocol.kv_live_bytes(session) -> int. Mirrors
    k_seq_length's per-session ignored-arg shape.
  + module-level _sync_slab_bytes(session, verifier) helper. No-op
    when session.slab is None (pool-less SessionStore in
    coordinator unit tests). Otherwise writes
    int(verifier.kv_live_bytes(session)) onto
    session.slab.live_kv_bytes_override.
  + AppendTokensCoordinator.append_tokens calls _sync_slab_bytes
    after the position-advance step on every successful path.
    Empty appends remain a no-op (slab override unchanged).

inference_engine/session/generator.py
  + GenerationCoordinator.generate calls _sync_slab_bytes both on
    EOS exit and on max-tokens exit. Once cache hits sink+window
    capacity the value plateaus, which is precisely what the bench
    needs to measure to claim kv_bounded empirically.

inference_engine/session/store.py
  Updated Session.kv_live_bytes() docstring to reflect the dual
  HTTP-shim + gRPC sync paths now feeding the slab override.

kv_cache_proposer/verifier.py (CPU)
  + SinkWindowVerifier.__init__ precomputes _bytes_per_kv_token =
    num_layers \u00d7 num_kv_heads \u00d7 head_dim \u00d7 dtype.itemsize \u00d7 2.
    Reads dims from the HF config so GQA / MQA via
    num_key_value_heads is honored (Qwen3 / Gemma / DeepSeek).
  + SinkWindowVerifier.kv_live_bytes(session) -> int. Returns
    self._cache_seq_length() \u00d7 self._bytes_per_kv_token. O(1).

inference_engine/backends/mlx/verifier.py (MLX)
  + Same precomputation in __init__; reads dims from the wrapped HF
    config (handles both model.config and bare-model layouts).
  + MLXSinkWindowVerifier.kv_live_bytes(session) mirroring the CPU
    verifier.

Tests
-----

tests/inference_engine/session/test_coordinator.py
  + FakeVerifier.kv_live_bytes implementation (constant-prime
    BYTES_PER_KV_TOKEN = 17 so any off-by-one in dim arithmetic
    surfaces immediately).
  + TestSlabBytesSync class (4 tests): verifies the override is
    written after first prefill, after subsequent forward_block,
    skipped gracefully when slab is None, and not overwritten by
    empty (no-op) appends.

tests/inference_engine/session/test_generator.py
  + TestGenerationSyncsSlabBytes class (2 tests): max-tokens path
    and EOS path both leave a synced override.

tests/core/test_verifier.py (HF-bound, runs on Mac via
run_platform_tests.sh, NOT in Linux CI)
  + 3 tests against the real Qwen3-0.6B verifier:
      - kv_live_bytes is 0 before prefill
      - kv_live_bytes equals k_seq_length \u00d7 per-token bytes
        (computed from HF config the same way the verifier does \u2014
        so the closed-form relationship is exercised end-to-end)
      - kv_live_bytes plateaus at sink+window capacity even after
        further forward_block calls (the architectural KV-bound
        claim, now verifiable empirically)

tests/backends/mlx/test_verifier.py (Apple Silicon only, runs on
Mac via run_platform_tests.sh, NOT in Linux CI)
  + 3 tests mirroring the CPU verifier suite, against
    MLXSinkWindowVerifier.

Mac M4 reviewer aid
-------------------

scripts/review_pr_e1c_on_mac.sh
  Runs:
    Part 1: pytest tests/core/test_verifier.py +
            tests/backends/mlx/test_verifier.py -k 'kv_live_bytes
            or k_seq_length or cache_inspector'.
    Part 2: 5-min bench against the running gRPC server, asserting
            kv_live_bytes is non-zero (PR-E1b's reporting bug
            REPRODUCED + FIXED). Skipped gracefully when PR-E1b's
            gRPC server / bench scripts aren't on the tree (PR-E1c
            stacks logically after PR-E1b but is branched off main
            so CI triggers \u2014 see PR description).

Linux gate
----------
PYTHONPATH=.:sdks/python pytest <Linux CI gate set>:
  701 passed, 100% coverage on 1702 stmts (was 700 / 1702;
  +1 net test reflects the new TestSlabBytesSync + the existing
  bench tests \u2014 different counts add up the same way thanks to
  some pre-existing test renames in PR-D1).

Per ADR 0008 \u00a79
----------------
Linux gate covers the coordinator-level dispatch with a
deterministic FakeVerifier. Real-numerics validation
(kv_live_bytes against actual Qwen3-0.6B / MLX cache state) lives
in tests/core/ and tests/backends/mlx/ which only run on Mac \u2014 so
this PR's MERGE requires Mac M4 reviewer evidence (Part 1 of the
review aid is sufficient; Part 2 is icing).

Stack
-----
PR-E1c is branched off main. Logically follows PR-E1b (#51) for
the bench-based verification. Recommended merge order:
  1. PR-D1 (#49) \u2014 pure cleanup, can merge first
  2. PR-E1b (#51) \u2014 4h bench evidence already on main
  3. PR-E1c (this) \u2014 closes the kv_live_bytes=0 reporting bug

Next 4h Mac M4 bench (post-merge) will produce a non-zero
kv_live_bytes series and turn kv_bounded into a non-trivial
empirical claim alongside the architectural one.

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
@FluffyAIcode FluffyAIcode marked this pull request as ready for review June 2, 2026 04:02
@FluffyAIcode FluffyAIcode merged commit ac3b1da into main Jun 2, 2026
6 checks passed
@FluffyAIcode FluffyAIcode deleted the AgentMemory/v030-pr-e1c-fix-kv-live-bytes-reporting-8e7f branch June 2, 2026 04:02
FluffyAIcode added a commit that referenced this pull request Jun 2, 2026
…tion workflow

Closes the loop on automated GA gating. After PR-N1..N4 retired all
verifier-protocol test doubles from the Linux gate, the integration
suite (tests/integration/) became the binding correctness gate for
runtime modules \u2014 inference_engine.session.coordinator,
inference_engine.session.generator,
inference_engine.scheduler.scheduler,
inference_engine.server.{app,engine,tokenizer,streaming}, and
kakeya.{client,session}. Until this PR, that suite ran manually
via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every
PR labelled needs-mac-m4.

Three artifacts ship:

  .github/workflows/integration.yaml           +136 lines
    Self-hosted runner workflow targeting [self-hosted, macOS,
    ARM64, kakeya-mac-m4]. Triggers on PR events when the
    needs-mac-m4 label is present, plus on workflow_dispatch
    for manual re-runs. Steps:
      1. Checkout (full history).
      2. Verify host shape (chip, memory, python version).
      3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1
         at test time \u2014 no downloads in CI; cache miss fails
         fast with a clear pre-warm command).
      4. pip install -e . + pytest dependencies (warm pip cache
         keeps this <30 s).
      5. pytest -m integration tests/integration/ \u2014 expected
         runtime 60-120 s on M4 with warm cache. 90-min timeout
         is a safety margin, not the operating point.
      6. Upload JUnit XML artifact.
      7. On failure, inline the test names + first-line error
         messages into the Action log so triage doesn't require
         downloading the artifact.
    Concurrency: cancel-in-progress per PR, so a new push
    supersedes the previous run.

  .github/workflows/auto-label-mac.yaml        +89 lines
    pull_request_target workflow that auto-applies (or removes)
    the needs-mac-m4 label based on which paths the PR touches.
    Trigger paths:
      inference_engine/  \u2014 runtime, scheduler, session, server
      sdks/              \u2014 Python + TypeScript SDK
      proto/             \u2014 wire contract
      tests/integration/ \u2014 the integration suite itself
      kv_cache_proposer/ \u2014 verifier + decoder
    Doc-only or CI-only PRs are NOT labelled \u2014 they skip the
    integration gate entirely, saving runner time. The label is
    automatically dropped if a subsequent push removes all
    verifier-dependent edits.

  docs/ops/mac-m4-runner-setup.md              +137 lines
    Operator runbook for the self-hosted runner: hardware
    requirements (24 GB minimum, ~50 GB free disk), runner
    registration with the kakeya-mac-m4 label, HF cache
    pre-warm command (Qwen3-0.6B), Python toolchain setup,
    runtime expectations, cache hygiene cron, runner upgrade
    procedure, and failure triage steps.

CI workflow split rationale
---------------------------
The pre-existing .github/workflows/ci.yaml stays as the Linux gate
(verifier-independent, runs on github-hosted ubuntu-latest, fires
on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow
because:
  1. Self-hosted runners are slow / few; doc-only PRs shouldn't
     touch them.
  2. The integration gate is intentionally OPT-IN by label; ci.yaml
     is non-optional.
  3. Failure semantics differ: Linux gate failure blocks merge
     unconditionally; Mac M4 gate failure surfaces a structured
     report but the merge decision is a human one until v0.3.0
     final ships.

Together the two workflows form the post-cleanup gating model:
  - Linux gate (ci.yaml):
      verifier-independent code; 100% coverage; every PR.
  - Mac M4 gate (integration.yaml):
      verifier-dependent code; binding GA gate; PRs touching
      runtime / SDK / proto / integration tests.

Stack
-----
PR-E2 is branched off main, independent of the cleanup PRs (#49,
#50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at
launch even before PR-E1 lands; it just won't find any tests
under tests/integration/ until that PR is merged. Recommended
merge order: cleanup PRs first (so the workflow has tests to
run), then PR-E2.

Per ADR 0008 \u00a79
----------------
PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source
changes. No Mac M4 evidence required for this PR (the workflow
itself becomes the Mac M4 evidence machinery for ALL future PRs).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants