From a0397c9adec06f867cdb7298ede438af3c63c6e9 Mon Sep 17 00:00:00 2001
From: Cursor Agent <cursoragent@cursor.com>
Date: Tue, 2 Jun 2026 04:11:43 +0000
Subject: [PATCH] PR-E2 (ADR 0008 \u00a76.5): self-hosted Mac M4 GitHub Actions
 integration workflow

Closes the loop on automated GA gating. After PR-N1..N4 retired all
verifier-protocol test doubles from the Linux gate, the integration
suite (tests/integration/) became the binding correctness gate for
runtime modules \u2014 inference_engine.session.coordinator,
inference_engine.session.generator,
inference_engine.scheduler.scheduler,
inference_engine.server.{app,engine,tokenizer,streaming}, and
kakeya.{client,session}. Until this PR, that suite ran manually
via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every
PR labelled needs-mac-m4.

Three artifacts ship:

  .github/workflows/integration.yaml           +136 lines
    Self-hosted runner workflow targeting [self-hosted, macOS,
    ARM64, kakeya-mac-m4]. Triggers on PR events when the
    needs-mac-m4 label is present, plus on workflow_dispatch
    for manual re-runs. Steps:
      1. Checkout (full history).
      2. Verify host shape (chip, memory, python version).
      3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1
         at test time \u2014 no downloads in CI; cache miss fails
         fast with a clear pre-warm command).
      4. pip install -e . + pytest dependencies (warm pip cache
         keeps this <30 s).
      5. pytest -m integration tests/integration/ \u2014 expected
         runtime 60-120 s on M4 with warm cache. 90-min timeout
         is a safety margin, not the operating point.
      6. Upload JUnit XML artifact.
      7. On failure, inline the test names + first-line error
         messages into the Action log so triage doesn't require
         downloading the artifact.
    Concurrency: cancel-in-progress per PR, so a new push
    supersedes the previous run.

  .github/workflows/auto-label-mac.yaml        +89 lines
    pull_request_target workflow that auto-applies (or removes)
    the needs-mac-m4 label based on which paths the PR touches.
    Trigger paths:
      inference_engine/  \u2014 runtime, scheduler, session, server
      sdks/              \u2014 Python + TypeScript SDK
      proto/             \u2014 wire contract
      tests/integration/ \u2014 the integration suite itself
      kv_cache_proposer/ \u2014 verifier + decoder
    Doc-only or CI-only PRs are NOT labelled \u2014 they skip the
    integration gate entirely, saving runner time. The label is
    automatically dropped if a subsequent push removes all
    verifier-dependent edits.

  docs/ops/mac-m4-runner-setup.md              +137 lines
    Operator runbook for the self-hosted runner: hardware
    requirements (24 GB minimum, ~50 GB free disk), runner
    registration with the kakeya-mac-m4 label, HF cache
    pre-warm command (Qwen3-0.6B), Python toolchain setup,
    runtime expectations, cache hygiene cron, runner upgrade
    procedure, and failure triage steps.

CI workflow split rationale
---------------------------
The pre-existing .github/workflows/ci.yaml stays as the Linux gate
(verifier-independent, runs on github-hosted ubuntu-latest, fires
on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow
because:
  1. Self-hosted runners are slow / few; doc-only PRs shouldn't
     touch them.
  2. The integration gate is intentionally OPT-IN by label; ci.yaml
     is non-optional.
  3. Failure semantics differ: Linux gate failure blocks merge
     unconditionally; Mac M4 gate failure surfaces a structured
     report but the merge decision is a human one until v0.3.0
     final ships.

Together the two workflows form the post-cleanup gating model:
  - Linux gate (ci.yaml):
      verifier-independent code; 100% coverage; every PR.
  - Mac M4 gate (integration.yaml):
      verifier-dependent code; binding GA gate; PRs touching
      runtime / SDK / proto / integration tests.

Stack
-----
PR-E2 is branched off main, independent of the cleanup PRs (#49,
#50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at
launch even before PR-E1 lands; it just won't find any tests
under tests/integration/ until that PR is merged. Recommended
merge order: cleanup PRs first (so the workflow has tests to
run), then PR-E2.

Per ADR 0008 \u00a79
----------------
PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source
changes. No Mac M4 evidence required for this PR (the workflow
itself becomes the Mac M4 evidence machinery for ALL future PRs).

Co-authored-by: FluffyAIcode <FluffyAIcode@users.noreply.github.com>
---
 .github/workflows/auto-label-mac.yaml |  89 +++++++++++++++++
 .github/workflows/integration.yaml    | 136 +++++++++++++++++++++++++
 docs/ops/mac-m4-runner-setup.md       | 137 ++++++++++++++++++++++++++
 3 files changed, 362 insertions(+)
 create mode 100644 .github/workflows/auto-label-mac.yaml
 create mode 100644 .github/workflows/integration.yaml
 create mode 100644 docs/ops/mac-m4-runner-setup.md

diff --git a/.github/workflows/auto-label-mac.yaml b/.github/workflows/auto-label-mac.yaml
new file mode 100644
index 0000000..511b144
--- /dev/null
+++ b/.github/workflows/auto-label-mac.yaml
@@ -0,0 +1,89 @@
+name: Auto-label needs-mac-m4
+
+# Auto-applies the ``needs-mac-m4`` label to PRs that touch
+# verifier-dependent code paths so the integration workflow
+# (.github/workflows/integration.yaml) runs without contributors
+# remembering to apply the label by hand.
+#
+# A PR opts INTO Mac M4 review by editing files under any of:
+#   inference_engine/  — runtime, scheduler, session, server, etc.
+#   sdks/              — Python + TypeScript SDK
+#   proto/             — protobuf wire contract
+#   tests/integration/ — the integration suite itself
+#   kv_cache_proposer/ — verifier + decoder
+#
+# A doc-only PR (touching only docs/, README.md, etc.) does NOT
+# trigger Mac M4 review, saving runner time.
+#
+# Once labelled, the integration workflow auto-fires; once a PR
+# lands without the label, the integration workflow auto-skips.
+
+on:
+  pull_request_target:
+    types: [opened, synchronize, reopened]
+    branches: [main]
+
+permissions:
+  pull-requests: write
+
+jobs:
+  label:
+    runs-on: ubuntu-latest
+    steps:
+      - name: Label PRs touching verifier-dependent paths
+        uses: actions/github-script@v7
+        with:
+          github-token: ${{ secrets.GITHUB_TOKEN }}
+          script: |
+            const pr = context.payload.pull_request;
+            const labelName = "needs-mac-m4";
+
+            // Pull the diff file list. github-script gives us the
+            // full octokit; pagination matters for >100-file PRs
+            // but in practice the v0.3 PRs are well under that.
+            const { data: files } = await github.rest.pulls.listFiles({
+              owner: context.repo.owner,
+              repo: context.repo.repo,
+              pull_number: pr.number,
+              per_page: 100,
+            });
+
+            const triggers = [
+              "inference_engine/",
+              "sdks/",
+              "proto/",
+              "tests/integration/",
+              "kv_cache_proposer/",
+            ];
+
+            const matched = files.some(f =>
+              triggers.some(t => f.filename.startsWith(t))
+            );
+
+            const hasLabel = pr.labels.some(l => l.name === labelName);
+
+            if (matched && !hasLabel) {
+              core.info(`Adding ${labelName} (PR touches verifier-dependent paths).`);
+              await github.rest.issues.addLabels({
+                owner: context.repo.owner,
+                repo: context.repo.repo,
+                issue_number: pr.number,
+                labels: [labelName],
+              });
+            } else if (!matched && hasLabel) {
+              // PR was previously labelled; subsequent push removed
+              // all verifier-dependent file edits. Drop the label
+              // so the integration workflow doesn't burn runner
+              // time on doc-only updates.
+              core.info(`Removing ${labelName} (no verifier-dependent paths).`);
+              await github.rest.issues.removeLabel({
+                owner: context.repo.owner,
+                repo: context.repo.repo,
+                issue_number: pr.number,
+                name: labelName,
+              });
+            } else {
+              core.info(
+                `No-op: matched=${matched} hasLabel=${hasLabel}.`
+              );
+            }
diff --git a/.github/workflows/integration.yaml b/.github/workflows/integration.yaml
new file mode 100644
index 0000000..13773a5
--- /dev/null
+++ b/.github/workflows/integration.yaml
@@ -0,0 +1,136 @@
+name: Integration (Mac M4)
+
+# Self-hosted runner workflow that runs the integration suite under
+# tests/integration/ against real Qwen3-0.6B on Apple Silicon.
+#
+# Trigger model:
+#   - Pull-request events. Only fires when the PR carries the
+#     ``needs-mac-m4`` label (auto-applied by .github/workflows/
+#     auto-label-mac.yaml when a PR touches inference_engine/,
+#     sdks/, proto/, or tests/integration/). PRs that don't touch
+#     verifier-dependent code skip this gate entirely so the runner
+#     pool isn't burned on doc-only or CI-only PRs.
+#   - Manual workflow_dispatch for re-runs from the Actions UI.
+#
+# Runner requirements (self-hosted):
+#   - macOS 14+ on Apple Silicon (M-series).
+#   - Labels: [self-hosted, macOS, ARM64, kakeya-mac-m4].
+#   - Pre-warmed HF cache containing Qwen/Qwen3-0.6B at
+#     ~/.cache/huggingface/hub/ (avoids 10-minute first-run download).
+#   - Python 3.12+ on PATH.
+#   - At least 24 GB unified memory and ~50 GB free disk.
+#
+# See docs/ops/mac-m4-runner-setup.md for the one-time runner setup.
+
+on:
+  pull_request:
+    # Only run on PR events for branches targeting main.
+    types: [opened, synchronize, reopened, labeled]
+    branches: [main]
+  workflow_dispatch: {}
+
+# Cancel superseded runs on the same PR — saves runner time when
+# the contributor pushes a new commit before the previous run
+# finishes.
+concurrency:
+  group: integration-${{ github.ref }}
+  cancel-in-progress: true
+
+jobs:
+  integration:
+    name: pytest -m integration on Mac M4
+    # Only fire on labeled PRs (this saves the runner pool from
+    # doc-only / CI-only PRs that don't touch verifier-dependent
+    # code). The auto-label workflow adds 'needs-mac-m4' on file
+    # paths that warrant the GA gate.
+    if: |
+      github.event_name == 'workflow_dispatch' ||
+      contains(github.event.pull_request.labels.*.name, 'needs-mac-m4')
+    runs-on: [self-hosted, macOS, ARM64, kakeya-mac-m4]
+    timeout-minutes: 90
+    steps:
+      - uses: actions/checkout@v4
+        with:
+          # Full history so the runner can compare against base for
+          # any future rebase-based gating.
+          fetch-depth: 0
+
+      - name: Verify host shape
+        run: |
+          echo "=== sysctl ==="
+          sysctl -n hw.model || true
+          sysctl -n hw.memsize || true
+          sysctl -n machdep.cpu.brand_string || true
+          echo "=== python ==="
+          python3 --version
+          python3 -c "import platform; print(platform.machine(), platform.platform())"
+
+      - name: Verify Qwen3-0.6B in HF cache
+        run: |
+          # Don't download here; the runner is expected to be
+          # pre-warmed. If the model isn't cached the test loads
+          # would hit HF and exceed the 90-min timeout. Surface a
+          # clear error early.
+          set -e
+          MODEL_DIR="$HOME/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B"
+          if [ ! -d "$MODEL_DIR" ]; then
+            echo "::error::HF cache miss for Qwen/Qwen3-0.6B."
+            echo "::error::Pre-warm the runner: python3 -c 'from transformers import AutoModelForCausalLM, AutoTokenizer; AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen3-0.6B\"); AutoTokenizer.from_pretrained(\"Qwen/Qwen3-0.6B\")'"
+            exit 1
+          fi
+          echo "Found $MODEL_DIR"
+
+      - name: Install Python dependencies
+        run: |
+          # The runner is expected to have a long-lived venv.
+          # If a per-run venv is preferred, swap to ``python3 -m venv .venv``.
+          python3 -m pip install --upgrade pip
+          python3 -m pip install -e .
+          python3 -m pip install pytest pytest-asyncio pytest-timeout coverage
+
+      - name: Run integration suite
+        env:
+          PYTHONPATH: .:sdks/python
+          # No HF download in tests; if we hit a cache miss it's a
+          # bug or a stale runner.
+          HF_HUB_OFFLINE: "1"
+        run: |
+          mkdir -p results/platform-tests
+          stamp=$(date +%s)
+          python3 -m pytest \
+            -m integration \
+            tests/integration/ \
+            --junitxml="results/platform-tests/integration-mac-m4-${stamp}.junit.xml" \
+            -v
+          # Record the artifact path for the upload step below.
+          echo "artifact_stamp=${stamp}" >> "$GITHUB_OUTPUT"
+        id: pytest_run
+
+      - name: Upload JUnit + log artifacts
+        if: always()
+        uses: actions/upload-artifact@v4
+        with:
+          name: integration-mac-m4-${{ steps.pytest_run.outputs.artifact_stamp || github.run_id }}
+          path: |
+            results/platform-tests/integration-mac-m4-*.junit.xml
+          retention-days: 30
+
+      - name: Surface failure summary
+        if: failure()
+        run: |
+          # Tail the last few lines of the JUnit so the failure is
+          # visible in the action log, not just inside the artifact.
+          for f in results/platform-tests/integration-mac-m4-*.junit.xml; do
+            echo "=== $f ==="
+            python3 - "$f" <<'PY'
+          import sys, xml.etree.ElementTree as ET
+          r = ET.parse(sys.argv[1]).getroot()
+          for tc in r.iter("testcase"):
+              for child in tc:
+                  if child.tag in ("failure", "error"):
+                      print(f"[{child.tag.upper()}] {tc.get('classname')}::{tc.get('name')}")
+                      msg = (child.get("message") or "").splitlines()
+                      if msg:
+                          print(f"    {msg[0][:180]}")
+          PY
+          done
diff --git a/docs/ops/mac-m4-runner-setup.md b/docs/ops/mac-m4-runner-setup.md
new file mode 100644
index 0000000..7c10602
--- /dev/null
+++ b/docs/ops/mac-m4-runner-setup.md
@@ -0,0 +1,137 @@
+# Mac M4 self-hosted runner setup
+
+This runner backs the **Integration (Mac M4)** GitHub Actions workflow
+(`.github/workflows/integration.yaml`). It runs `pytest -m integration`
+against real Qwen3-0.6B on every PR labelled `needs-mac-m4`
+(auto-applied by `.github/workflows/auto-label-mac.yaml` when a PR
+touches `inference_engine/`, `sdks/`, `proto/`, `tests/integration/`,
+or `kv_cache_proposer/`).
+
+## Hardware requirements
+
+| Resource | Minimum |
+| --- | --- |
+| Chip | Apple Silicon (M-series); M4 or newer recommended |
+| Unified memory | 24 GB (16 GB works for Qwen3-0.6B alone but no headroom for concurrent work) |
+| Free disk | ~50 GB (HF cache + venv + checkout history) |
+| Network | Reachable to github.com for runner registration; outbound to HF Hub for the one-time pre-warm |
+| OS | macOS 14 (Sonoma) or newer |
+
+## One-time setup
+
+### 1. Register the self-hosted runner
+
+Follow [GitHub's docs](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners) to add a runner to the repository:
+
+1. Repository → Settings → Actions → Runners → New self-hosted runner.
+2. Choose macOS / ARM64.
+3. Run the install + configure commands GitHub provides.
+4. **Important**: when prompted for labels, add `kakeya-mac-m4`
+   in addition to the default `self-hosted, macOS, ARM64`. The
+   workflow's `runs-on:` clause specifically requires that label.
+5. Run the runner as a launchd service (`./svc.sh install && ./svc.sh start`)
+   so it survives reboots.
+
+### 2. Pre-warm the HF cache
+
+The integration workflow runs with `HF_HUB_OFFLINE=1` so it never
+hits HuggingFace at test time (avoids 90-min runs blocking on a 4 GB
+download). Pre-warm the cache once per runner:
+
+```bash
+python3 -c "
+from transformers import AutoModelForCausalLM, AutoTokenizer
+AutoModelForCausalLM.from_pretrained('Qwen/Qwen3-0.6B')
+AutoTokenizer.from_pretrained('Qwen/Qwen3-0.6B')
+"
+```
+
+The model lands at `~/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/`.
+The workflow's "Verify Qwen3-0.6B in HF cache" step fails fast with
+a clear error if that directory is missing.
+
+If a future test adds a new model id, update the pre-warm command
+(and the workflow's verify step) accordingly.
+
+### 3. Install Python toolchain
+
+The runner needs Python 3.12+. Use Homebrew or pyenv:
+
+```bash
+brew install python@3.12
+# or:
+pyenv install 3.12.7
+pyenv global 3.12.7
+```
+
+Confirm `python3 --version` returns 3.12.x and `python3 -c 'import platform; print(platform.machine())'` returns `arm64`.
+
+### 4. (Optional) long-lived venv
+
+The workflow currently does `pip install -e .` per run, which is
+~30 s on a warm pip cache. If you want to skip even that, create a
+venv at `~/kakeya-runner-venv` and add a step to the workflow that
+activates it before `pytest`. v0.3 keeps the per-run install for
+simplicity.
+
+## Runtime expectations
+
+| Phase | Wall time on M4 24 GB |
+| --- | --- |
+| Checkout + verify host | <5 s |
+| Verify HF cache | <1 s |
+| `pip install -e .` (warm pip) | 20-40 s |
+| `pytest -m integration` (80 tests, post-PR-N1..N4) | 60-120 s |
+| Artifact upload | <5 s |
+| **Total** | **~2-3 min** |
+
+The 90-minute timeout in the workflow is a safety margin. A run
+that exceeds 5 min should be investigated — likely a model-load
+regression or a runaway test.
+
+## Maintenance
+
+### Cache hygiene
+
+The runner's HF cache and pip cache grow over time. Recommend a
+monthly cron:
+
+```bash
+# ~/clean-kakeya-runner.sh
+find ~/.cache/huggingface/hub -type d -mtime +60 -prune -name 'models--*' -exec rm -rf {} +
+python3 -m pip cache purge
+```
+
+The Qwen3-0.6B cache is touched on every run, so `mtime +60` only
+prunes models added by future test additions that aren't currently
+exercised.
+
+### Runner upgrades
+
+GitHub publishes new runner versions ~monthly. Update via:
+
+```bash
+cd ~/actions-runner
+./svc.sh stop
+./config.sh remove --token <repo-config-token>
+# download the new tarball per GitHub UI instructions
+./config.sh --url https://github.com/<owner>/<repo> --token <new-token>
+./svc.sh install && ./svc.sh start
+```
+
+### Failure triage
+
+Workflow failures are visible at `Actions → Integration (Mac M4)`. The "Surface failure summary" step inlines the test names + first-line error messages so triage doesn't require downloading the JUnit XML.
+
+If the runner itself is offline (queue depth grows, no jobs pick up), check on the Mac:
+
+```bash
+cd ~/actions-runner
+sudo ./svc.sh status
+tail -200 ~/Library/Logs/actions-runner/Runner_*.log
+```
+
+Common causes:
+- macOS auto-update rebooted the host; service didn't auto-start (rare with `launchd` but possible).
+- HF cache was purged; the verify step fails. Re-warm.
+- Disk full from accumulated pip downloads; clear cache.