PR-E1 (ADR 0008 §6.5): integration suite + INV-3 byte-exact GA gate by FluffyAIcode · Pull Request #50 · FluffyAIcode/Kakeya-LLM-Inference-engine

FluffyAIcode · 2026-06-01T15:58:43Z

What ships

Per ADR 0008 §6.5, PR-E1 delivers:

1. `tests/integration/` directory

__init__.py
conftest.py — auto-applies @pytest.mark.integration to every test in the directory; bare pytest skips them, contributors opt in via pytest -m integration.

2. `tests/integration/test_inv3_session_determinism_gate.py`

The INV-3 byte-exact GA gate (ADR 0008 §7 G3). Drives two independent SinkWindowVerifier instances (real Qwen3-0.6B weights via fresh_verifier_factory) through identical history fed via different chunkings, asserts the resulting greedy token streams are byte-identical. Three tests:

Test	What it covers
`test_one_call_vs_two_calls_yield_byte_identical_tokens`	Minimal: 1×10 vs 2×5 chunking, 12 tokens generated greedy.
`test_chunking_invariance_across_three_splits`	Stronger: 1×20 / 3×medium / 10×small chunkings, 8 tokens generated. Catches chunk-boundary bugs the 1-vs-2 case might miss (e.g., a chunk crossing a sink+window trim boundary).
`test_repeated_runs_with_same_history_byte_identical`	Sanity: same workload twice on same verifier == identical output.

This replaces tests/core/test_determinism_gate.py (deleted in PR-A3 along with verifier.path_select). Per ADR 0008 §6.6, the replacement lives in tests/integration/ rather than tests/core/ because integration is where Mac-M4-only GA gates belong per §9.

3. `pytest.ini`

Minimal new file registering the integration marker so opt-in invocations (pytest -m integration) don't trigger PytestUnknownMarkWarning.

4. `scripts/review_pr_e1_on_mac.sh`

Mac M4 reviewer aid. Runs pytest -m integration tests/integration/ and produces pr-e1-mac-integration-tests-<unix>.json under results/platform-tests/. Same coverage-free pattern as review_pr_b3_on_mac.sh.

Independence from PR-D1

PR-E1 was originally stacked on PR-D1 (#49) but reviewed-once it's clear the two are file-disjoint: PR-E1 only adds new files under tests/integration/, plus pytest.ini and scripts/review_pr_e1_on_mac.sh. It does not depend on PR-D1's deletions. Rebased onto main directly so CI triggers normally.

The two PRs can merge in either order.

Not in this PR (deferred)

scripts/bench_agentic/bench_session_long_run.py: §6.5 also mentions a Mac M4 long-session bench using the gRPC SDK. Splitting it out so PR-E1's diff stays focused on the GA gate. Will land as PR-E1b or rolled into PR-E2.
PR-E2: GitHub Actions self-hosted Mac M4 runner workflow invoking pytest -m integration on every PR labelled needs-mac-m4. Until that workflow lands, the gate runs manually via scripts/review_pr_e1_on_mac.sh.

Linux verification

Linux CI gate (existing 8 test paths):
  682 passed, 100% coverage  ← UNCHANGED. tests/integration/ is not in the Linux paths.

Integration suite collection:
  3 tests collected from tests/integration/, marker auto-applied via conftest.

Mac M4 evidence (REQUIRED for merge — load-bearing for v0.3 GA)

Per ADR 0008 §9, this PR's true validation happens on Mac M4. Linux CI cannot validate INV-3 against real Qwen3 numerics. Reviewer runs:

bash scripts/review_pr_e1_on_mac.sh
git add results/platform-tests/pr-e1-mac-*
git commit -m "Mac M4 review evidence for PR-E1"
git push

…and pushes the JSON evidence to this PR branch before merge. All 3 tests must pass with byte-exact equality. Any failure here means INV-3 is broken on real numerics → BLOCKS v0.3 GA.

Next PR after merge

PR-E1b or fold-in: bench_session_long_run.py against the gRPC SDK.
PR-E2: self-hosted Mac M4 GitHub Actions workflow.
PR-D2 (independent): HTTP shim refactor onto SessionStore.

Stacks on PR-D1 (#49). When this PR merges, PR-D1 lands along with it. Per ADR 0008 \u00a76.5, PR-E1 ships: 1. tests/integration/ \u2014 new test directory with pytest.mark.integration marker auto-applied by tests/integration/conftest.py. Every test in this directory is opt-in via 'pytest -m integration'; a bare pytest invocation skips them. 2. tests/integration/test_inv3_session_determinism_gate.py \u2014 the INV-3 byte-exact GA gate (ADR 0008 \u00a77 G3). Drives two independent SinkWindowVerifier instances (real Qwen3-0.6B weights) through identical history fed via different chunkings, asserts the resulting greedy token streams are byte-identical. Three tests: test_one_call_vs_two_calls_yield_byte_identical_tokens Minimal gate: 1\u00d710 vs 2\u00d75 chunking on a 10-token history, 12 tokens of greedy generation. test_chunking_invariance_across_three_splits Stronger version: 1\u00d720 / 3\u00d7medium / 10\u00d7small chunkings on a 20-token history, 8 tokens of greedy generation. Catches chunk-boundary bugs the 1-vs-2 case might miss (e.g., a bug that only triggers when a chunk crosses a sink+window trim boundary). test_repeated_runs_with_same_history_byte_identical Sanity: same workload run twice on the same verifier produces the same output. Greedy decoding has no legitimate source of nondeterminism. 3. pytest.ini \u2014 minimal new file registering the 'integration' marker so it doesn't trigger PytestUnknownMarkWarning. Tests opt in via 'pytest -m integration'. Replaces the deleted tests/core/test_determinism_gate.py (PR-A3 removed it together with verifier.path_select; per ADR 0008 \u00a76.6 PR-E1's replacement is in tests/integration/, not tests/core/, because integration is where Mac-M4-only GA gates belong per \u00a79). NOT in this PR (deferred): scripts/bench_agentic/bench_session_long_run.py \u00a76.5 also mentions a Mac M4 long-session bench using the gRPC SDK. Splitting that out as PR-E1b (or rolled into PR-E2's CI workflow PR) so PR-E1's diff stays focused on the GA gate. PR-E2: GitHub Actions self-hosted Mac M4 runner workflow that invokes 'pytest -m integration' on every PR labelled 'needs-mac-m4'. Until that lands, the gate is run manually via scripts/review_pr_e1_on_mac.sh. Mac M4 reviewer aid: scripts/review_pr_e1_on_mac.sh Runs 'pytest -m integration tests/integration/' and produces one JSON artifact (pr-e1-mac-integration-tests-<unix>.json) under results/platform-tests/. Same coverage-free pattern as review_pr_b3_on_mac.sh + commit 9d1a250 hotfixes already folded in. Local verification (Linux VM, py3.12): Linux CI gate (existing 8 paths): 682 passed, 100% coverage, UNCHANGED. tests/integration/ is not in the Linux paths. Integration suite collection: 3 tests collected from tests/integration/, marker auto-applied via conftest. Bare 'pytest' from repo root: would still pick up tests/integration/ if discovered, but the existing project convention is explicit test paths in CI; the marker is the safety net. Per ADR 0008 \u00a79: this PR ships the test suite that IS the GA gate. Linux CI gate does not exercise the integration tests (HF-cache- bound, real-model-numerics-dependent), so PR-E1's true validation happens on Mac M4. Reviewer pushes scripts/review_pr_e1_on_mac.sh's JSON output to the PR branch as evidence. Next PR after merge: PR-E2: GitHub Actions self-hosted Mac M4 runner workflow. PR-E1b (or rolled into PR-E2): bench_session_long_run.py. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

@pytest

The Mac smoke run reported INV-3 gate fixture scope mismatch: session_verifier_pair was @pytest.fixture(scope="module") but depended on fresh_verifier_factory which is function-scoped in tests/conftest.py. Pytest forbids module-scoped fixtures from depending on function-scoped ones \u2014 raises ScopeMismatch. Inlined the verifier build inside session_verifier_pair so the module scope is self-contained. No fixture dependency on the function-scoped factory. Behavior identical: same VerifierConfig (sink=4, window=64, bf16, CPU), same shared-pair pattern across the 3 tests in this file. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

…tion workflow Closes the loop on automated GA gating. After PR-N1..N4 retired all verifier-protocol test doubles from the Linux gate, the integration suite (tests/integration/) became the binding correctness gate for runtime modules \u2014 inference_engine.session.coordinator, inference_engine.session.generator, inference_engine.scheduler.scheduler, inference_engine.server.{app,engine,tokenizer,streaming}, and kakeya.{client,session}. Until this PR, that suite ran manually via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every PR labelled needs-mac-m4. Three artifacts ship: .github/workflows/integration.yaml +136 lines Self-hosted runner workflow targeting [self-hosted, macOS, ARM64, kakeya-mac-m4]. Triggers on PR events when the needs-mac-m4 label is present, plus on workflow_dispatch for manual re-runs. Steps: 1. Checkout (full history). 2. Verify host shape (chip, memory, python version). 3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1 at test time \u2014 no downloads in CI; cache miss fails fast with a clear pre-warm command). 4. pip install -e . + pytest dependencies (warm pip cache keeps this <30 s). 5. pytest -m integration tests/integration/ \u2014 expected runtime 60-120 s on M4 with warm cache. 90-min timeout is a safety margin, not the operating point. 6. Upload JUnit XML artifact. 7. On failure, inline the test names + first-line error messages into the Action log so triage doesn't require downloading the artifact. Concurrency: cancel-in-progress per PR, so a new push supersedes the previous run. .github/workflows/auto-label-mac.yaml +89 lines pull_request_target workflow that auto-applies (or removes) the needs-mac-m4 label based on which paths the PR touches. Trigger paths: inference_engine/ \u2014 runtime, scheduler, session, server sdks/ \u2014 Python + TypeScript SDK proto/ \u2014 wire contract tests/integration/ \u2014 the integration suite itself kv_cache_proposer/ \u2014 verifier + decoder Doc-only or CI-only PRs are NOT labelled \u2014 they skip the integration gate entirely, saving runner time. The label is automatically dropped if a subsequent push removes all verifier-dependent edits. docs/ops/mac-m4-runner-setup.md +137 lines Operator runbook for the self-hosted runner: hardware requirements (24 GB minimum, ~50 GB free disk), runner registration with the kakeya-mac-m4 label, HF cache pre-warm command (Qwen3-0.6B), Python toolchain setup, runtime expectations, cache hygiene cron, runner upgrade procedure, and failure triage steps. CI workflow split rationale --------------------------- The pre-existing .github/workflows/ci.yaml stays as the Linux gate (verifier-independent, runs on github-hosted ubuntu-latest, fires on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow because: 1. Self-hosted runners are slow / few; doc-only PRs shouldn't touch them. 2. The integration gate is intentionally OPT-IN by label; ci.yaml is non-optional. 3. Failure semantics differ: Linux gate failure blocks merge unconditionally; Mac M4 gate failure surfaces a structured report but the merge decision is a human one until v0.3.0 final ships. Together the two workflows form the post-cleanup gating model: - Linux gate (ci.yaml): verifier-independent code; 100% coverage; every PR. - Mac M4 gate (integration.yaml): verifier-dependent code; binding GA gate; PRs touching runtime / SDK / proto / integration tests. Stack ----- PR-E2 is branched off main, independent of the cleanup PRs (#49, #50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at launch even before PR-E1 lands; it just won't find any tests under tests/integration/ until that PR is merged. Recommended merge order: cleanup PRs first (so the workflow has tests to run), then PR-E2. Per ADR 0008 \u00a79 ---------------- PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source changes. No Mac M4 evidence required for this PR (the workflow itself becomes the Mac M4 evidence machinery for ALL future PRs). Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

cursor Bot force-pushed the AgentMemory/v030-pr-e1-integration-suite-8e7f branch from 66eb1cd to 5aa648c Compare June 1, 2026 16:03

cursor Bot changed the base branch from AgentMemory/v030-pr-d1-remove-adr-0007-server-deadcode-8e7f to main June 1, 2026 16:03

Trigger CI on PR-E1 after rebase to main

ae0947c

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>

This was referenced Jun 1, 2026

PR-E1b (ADR 0008 §6.5): gRPC long-session bench + server CLI #51

Merged

PR-E1c: fix kv_live_bytes reporting path #52

Merged

FluffyAIcode marked this pull request as ready for review June 2, 2026 04:02

FluffyAIcode merged commit 6e9e9e4 into main Jun 2, 2026
6 checks passed

FluffyAIcode deleted the AgentMemory/v030-pr-e1-integration-suite-8e7f branch June 2, 2026 04:02

FluffyAIcode mentioned this pull request Jun 2, 2026

PR-E2 (ADR 0008 §6.5): self-hosted Mac M4 GitHub Actions integration workflow #57

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR-E1 (ADR 0008 §6.5): integration suite + INV-3 byte-exact GA gate#50

PR-E1 (ADR 0008 §6.5): integration suite + INV-3 byte-exact GA gate#50
FluffyAIcode merged 3 commits into
mainfrom
AgentMemory/v030-pr-e1-integration-suite-8e7f

FluffyAIcode commented Jun 1, 2026 •

edited by cursor Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

FluffyAIcode commented Jun 1, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What ships

1. tests/integration/ directory

2. tests/integration/test_inv3_session_determinism_gate.py

3. pytest.ini

4. scripts/review_pr_e1_on_mac.sh

Independence from PR-D1

Not in this PR (deferred)

Linux verification

Mac M4 evidence (REQUIRED for merge — load-bearing for v0.3 GA)

Next PR after merge

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FluffyAIcode commented Jun 1, 2026 •

edited by cursor Bot

Loading

1. `tests/integration/` directory

2. `tests/integration/test_inv3_session_determinism_gate.py`

3. `pytest.ini`

4. `scripts/review_pr_e1_on_mac.sh`