PR-E1 (ADR 0008 §6.5): integration suite + INV-3 byte-exact GA gate#50
Merged
Merged
Conversation
Stacks on PR-D1 (#49). When this PR merges, PR-D1 lands along with it. Per ADR 0008 \u00a76.5, PR-E1 ships: 1. tests/integration/ \u2014 new test directory with pytest.mark.integration marker auto-applied by tests/integration/conftest.py. Every test in this directory is opt-in via 'pytest -m integration'; a bare pytest invocation skips them. 2. tests/integration/test_inv3_session_determinism_gate.py \u2014 the INV-3 byte-exact GA gate (ADR 0008 \u00a77 G3). Drives two independent SinkWindowVerifier instances (real Qwen3-0.6B weights) through identical history fed via different chunkings, asserts the resulting greedy token streams are byte-identical. Three tests: test_one_call_vs_two_calls_yield_byte_identical_tokens Minimal gate: 1\u00d710 vs 2\u00d75 chunking on a 10-token history, 12 tokens of greedy generation. test_chunking_invariance_across_three_splits Stronger version: 1\u00d720 / 3\u00d7medium / 10\u00d7small chunkings on a 20-token history, 8 tokens of greedy generation. Catches chunk-boundary bugs the 1-vs-2 case might miss (e.g., a bug that only triggers when a chunk crosses a sink+window trim boundary). test_repeated_runs_with_same_history_byte_identical Sanity: same workload run twice on the same verifier produces the same output. Greedy decoding has no legitimate source of nondeterminism. 3. pytest.ini \u2014 minimal new file registering the 'integration' marker so it doesn't trigger PytestUnknownMarkWarning. Tests opt in via 'pytest -m integration'. Replaces the deleted tests/core/test_determinism_gate.py (PR-A3 removed it together with verifier.path_select; per ADR 0008 \u00a76.6 PR-E1's replacement is in tests/integration/, not tests/core/, because integration is where Mac-M4-only GA gates belong per \u00a79). NOT in this PR (deferred): scripts/bench_agentic/bench_session_long_run.py \u00a76.5 also mentions a Mac M4 long-session bench using the gRPC SDK. Splitting that out as PR-E1b (or rolled into PR-E2's CI workflow PR) so PR-E1's diff stays focused on the GA gate. PR-E2: GitHub Actions self-hosted Mac M4 runner workflow that invokes 'pytest -m integration' on every PR labelled 'needs-mac-m4'. Until that lands, the gate is run manually via scripts/review_pr_e1_on_mac.sh. Mac M4 reviewer aid: scripts/review_pr_e1_on_mac.sh Runs 'pytest -m integration tests/integration/' and produces one JSON artifact (pr-e1-mac-integration-tests-<unix>.json) under results/platform-tests/. Same coverage-free pattern as review_pr_b3_on_mac.sh + commit 9d1a250 hotfixes already folded in. Local verification (Linux VM, py3.12): Linux CI gate (existing 8 paths): 682 passed, 100% coverage, UNCHANGED. tests/integration/ is not in the Linux paths. Integration suite collection: 3 tests collected from tests/integration/, marker auto-applied via conftest. Bare 'pytest' from repo root: would still pick up tests/integration/ if discovered, but the existing project convention is explicit test paths in CI; the marker is the safety net. Per ADR 0008 \u00a79: this PR ships the test suite that IS the GA gate. Linux CI gate does not exercise the integration tests (HF-cache- bound, real-model-numerics-dependent), so PR-E1's true validation happens on Mac M4. Reviewer pushes scripts/review_pr_e1_on_mac.sh's JSON output to the PR branch as evidence. Next PR after merge: PR-E2: GitHub Actions self-hosted Mac M4 runner workflow. PR-E1b (or rolled into PR-E2): bench_session_long_run.py. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
66eb1cd to
5aa648c
Compare
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This was referenced Jun 1, 2026
The Mac smoke run reported INV-3 gate fixture scope mismatch: session_verifier_pair was @pytest.fixture(scope="module") but depended on fresh_verifier_factory which is function-scoped in tests/conftest.py. Pytest forbids module-scoped fixtures from depending on function-scoped ones \u2014 raises ScopeMismatch. Inlined the verifier build inside session_verifier_pair so the module scope is self-contained. No fixture dependency on the function-scoped factory. Behavior identical: same VerifierConfig (sink=4, window=64, bf16, CPU), same shared-pair pattern across the 3 tests in this file. Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
4 tasks
FluffyAIcode
added a commit
that referenced
this pull request
Jun 2, 2026
…tion workflow
Closes the loop on automated GA gating. After PR-N1..N4 retired all
verifier-protocol test doubles from the Linux gate, the integration
suite (tests/integration/) became the binding correctness gate for
runtime modules \u2014 inference_engine.session.coordinator,
inference_engine.session.generator,
inference_engine.scheduler.scheduler,
inference_engine.server.{app,engine,tokenizer,streaming}, and
kakeya.{client,session}. Until this PR, that suite ran manually
via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every
PR labelled needs-mac-m4.
Three artifacts ship:
.github/workflows/integration.yaml +136 lines
Self-hosted runner workflow targeting [self-hosted, macOS,
ARM64, kakeya-mac-m4]. Triggers on PR events when the
needs-mac-m4 label is present, plus on workflow_dispatch
for manual re-runs. Steps:
1. Checkout (full history).
2. Verify host shape (chip, memory, python version).
3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1
at test time \u2014 no downloads in CI; cache miss fails
fast with a clear pre-warm command).
4. pip install -e . + pytest dependencies (warm pip cache
keeps this <30 s).
5. pytest -m integration tests/integration/ \u2014 expected
runtime 60-120 s on M4 with warm cache. 90-min timeout
is a safety margin, not the operating point.
6. Upload JUnit XML artifact.
7. On failure, inline the test names + first-line error
messages into the Action log so triage doesn't require
downloading the artifact.
Concurrency: cancel-in-progress per PR, so a new push
supersedes the previous run.
.github/workflows/auto-label-mac.yaml +89 lines
pull_request_target workflow that auto-applies (or removes)
the needs-mac-m4 label based on which paths the PR touches.
Trigger paths:
inference_engine/ \u2014 runtime, scheduler, session, server
sdks/ \u2014 Python + TypeScript SDK
proto/ \u2014 wire contract
tests/integration/ \u2014 the integration suite itself
kv_cache_proposer/ \u2014 verifier + decoder
Doc-only or CI-only PRs are NOT labelled \u2014 they skip the
integration gate entirely, saving runner time. The label is
automatically dropped if a subsequent push removes all
verifier-dependent edits.
docs/ops/mac-m4-runner-setup.md +137 lines
Operator runbook for the self-hosted runner: hardware
requirements (24 GB minimum, ~50 GB free disk), runner
registration with the kakeya-mac-m4 label, HF cache
pre-warm command (Qwen3-0.6B), Python toolchain setup,
runtime expectations, cache hygiene cron, runner upgrade
procedure, and failure triage steps.
CI workflow split rationale
---------------------------
The pre-existing .github/workflows/ci.yaml stays as the Linux gate
(verifier-independent, runs on github-hosted ubuntu-latest, fires
on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow
because:
1. Self-hosted runners are slow / few; doc-only PRs shouldn't
touch them.
2. The integration gate is intentionally OPT-IN by label; ci.yaml
is non-optional.
3. Failure semantics differ: Linux gate failure blocks merge
unconditionally; Mac M4 gate failure surfaces a structured
report but the merge decision is a human one until v0.3.0
final ships.
Together the two workflows form the post-cleanup gating model:
- Linux gate (ci.yaml):
verifier-independent code; 100% coverage; every PR.
- Mac M4 gate (integration.yaml):
verifier-dependent code; binding GA gate; PRs touching
runtime / SDK / proto / integration tests.
Stack
-----
PR-E2 is branched off main, independent of the cleanup PRs (#49,
#50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at
launch even before PR-E1 lands; it just won't find any tests
under tests/integration/ until that PR is merged. Recommended
merge order: cleanup PRs first (so the workflow has tests to
run), then PR-E2.
Per ADR 0008 \u00a79
----------------
PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source
changes. No Mac M4 evidence required for this PR (the workflow
itself becomes the Mac M4 evidence machinery for ALL future PRs).
Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What ships
Per ADR 0008 §6.5, PR-E1 delivers:
1.
tests/integration/directory__init__.pyconftest.py— auto-applies@pytest.mark.integrationto every test in the directory; barepytestskips them, contributors opt in viapytest -m integration.2.
tests/integration/test_inv3_session_determinism_gate.pyThe INV-3 byte-exact GA gate (ADR 0008 §7 G3). Drives two independent
SinkWindowVerifierinstances (real Qwen3-0.6B weights viafresh_verifier_factory) through identical history fed via different chunkings, asserts the resulting greedy token streams are byte-identical. Three tests:test_one_call_vs_two_calls_yield_byte_identical_tokenstest_chunking_invariance_across_three_splitstest_repeated_runs_with_same_history_byte_identicalThis replaces
tests/core/test_determinism_gate.py(deleted in PR-A3 along withverifier.path_select). Per ADR 0008 §6.6, the replacement lives intests/integration/rather thantests/core/because integration is where Mac-M4-only GA gates belong per §9.3.
pytest.iniMinimal new file registering the
integrationmarker so opt-in invocations (pytest -m integration) don't triggerPytestUnknownMarkWarning.4.
scripts/review_pr_e1_on_mac.shMac M4 reviewer aid. Runs
pytest -m integration tests/integration/and producespr-e1-mac-integration-tests-<unix>.jsonunderresults/platform-tests/. Same coverage-free pattern asreview_pr_b3_on_mac.sh.Independence from PR-D1
PR-E1 was originally stacked on PR-D1 (#49) but reviewed-once it's clear the two are file-disjoint: PR-E1 only adds new files under
tests/integration/, pluspytest.iniandscripts/review_pr_e1_on_mac.sh. It does not depend on PR-D1's deletions. Rebased ontomaindirectly so CI triggers normally.The two PRs can merge in either order.
Not in this PR (deferred)
scripts/bench_agentic/bench_session_long_run.py: §6.5 also mentions a Mac M4 long-session bench using the gRPC SDK. Splitting it out so PR-E1's diff stays focused on the GA gate. Will land as PR-E1b or rolled into PR-E2.pytest -m integrationon every PR labelledneeds-mac-m4. Until that workflow lands, the gate runs manually viascripts/review_pr_e1_on_mac.sh.Linux verification
Mac M4 evidence (REQUIRED for merge — load-bearing for v0.3 GA)
Per ADR 0008 §9, this PR's true validation happens on Mac M4. Linux CI cannot validate INV-3 against real Qwen3 numerics. Reviewer runs:
…and pushes the JSON evidence to this PR branch before merge. All 3 tests must pass with byte-exact equality. Any failure here means INV-3 is broken on real numerics → BLOCKS v0.3 GA.
Next PR after merge
bench_session_long_run.pyagainst the gRPC SDK.SessionStore.