From a0397c9adec06f867cdb7298ede438af3c63c6e9 Mon Sep 17 00:00:00 2001 From: Cursor Agent Date: Tue, 2 Jun 2026 04:11:43 +0000 Subject: [PATCH] PR-E2 (ADR 0008 \u00a76.5): self-hosted Mac M4 GitHub Actions integration workflow Closes the loop on automated GA gating. After PR-N1..N4 retired all verifier-protocol test doubles from the Linux gate, the integration suite (tests/integration/) became the binding correctness gate for runtime modules \u2014 inference_engine.session.coordinator, inference_engine.session.generator, inference_engine.scheduler.scheduler, inference_engine.server.{app,engine,tokenizer,streaming}, and kakeya.{client,session}. Until this PR, that suite ran manually via scripts/review_pr_*_on_mac.sh; PR-E2 wires it into CI on every PR labelled needs-mac-m4. Three artifacts ship: .github/workflows/integration.yaml +136 lines Self-hosted runner workflow targeting [self-hosted, macOS, ARM64, kakeya-mac-m4]. Triggers on PR events when the needs-mac-m4 label is present, plus on workflow_dispatch for manual re-runs. Steps: 1. Checkout (full history). 2. Verify host shape (chip, memory, python version). 3. Verify Qwen/Qwen3-0.6B is in HF cache (HF_HUB_OFFLINE=1 at test time \u2014 no downloads in CI; cache miss fails fast with a clear pre-warm command). 4. pip install -e . + pytest dependencies (warm pip cache keeps this <30 s). 5. pytest -m integration tests/integration/ \u2014 expected runtime 60-120 s on M4 with warm cache. 90-min timeout is a safety margin, not the operating point. 6. Upload JUnit XML artifact. 7. On failure, inline the test names + first-line error messages into the Action log so triage doesn't require downloading the artifact. Concurrency: cancel-in-progress per PR, so a new push supersedes the previous run. .github/workflows/auto-label-mac.yaml +89 lines pull_request_target workflow that auto-applies (or removes) the needs-mac-m4 label based on which paths the PR touches. Trigger paths: inference_engine/ \u2014 runtime, scheduler, session, server sdks/ \u2014 Python + TypeScript SDK proto/ \u2014 wire contract tests/integration/ \u2014 the integration suite itself kv_cache_proposer/ \u2014 verifier + decoder Doc-only or CI-only PRs are NOT labelled \u2014 they skip the integration gate entirely, saving runner time. The label is automatically dropped if a subsequent push removes all verifier-dependent edits. docs/ops/mac-m4-runner-setup.md +137 lines Operator runbook for the self-hosted runner: hardware requirements (24 GB minimum, ~50 GB free disk), runner registration with the kakeya-mac-m4 label, HF cache pre-warm command (Qwen3-0.6B), Python toolchain setup, runtime expectations, cache hygiene cron, runner upgrade procedure, and failure triage steps. CI workflow split rationale --------------------------- The pre-existing .github/workflows/ci.yaml stays as the Linux gate (verifier-independent, runs on github-hosted ubuntu-latest, fires on every PR). PR-E2 adds integration.yaml as a SEPARATE workflow because: 1. Self-hosted runners are slow / few; doc-only PRs shouldn't touch them. 2. The integration gate is intentionally OPT-IN by label; ci.yaml is non-optional. 3. Failure semantics differ: Linux gate failure blocks merge unconditionally; Mac M4 gate failure surfaces a structured report but the merge decision is a human one until v0.3.0 final ships. Together the two workflows form the post-cleanup gating model: - Linux gate (ci.yaml): verifier-independent code; 100% coverage; every PR. - Mac M4 gate (integration.yaml): verifier-dependent code; binding GA gate; PRs touching runtime / SDK / proto / integration tests. Stack ----- PR-E2 is branched off main, independent of the cleanup PRs (#49, #50, #51, #52, #53, #54, #55, #56). The workflow doesn't fail at launch even before PR-E1 lands; it just won't find any tests under tests/integration/ until that PR is merged. Recommended merge order: cleanup PRs first (so the workflow has tests to run), then PR-E2. Per ADR 0008 \u00a79 ---------------- PR-E2 ships ONLY workflow YAML + a runbook \u2014 no Python source changes. No Mac M4 evidence required for this PR (the workflow itself becomes the Mac M4 evidence machinery for ALL future PRs). Co-authored-by: FluffyAIcode --- .github/workflows/auto-label-mac.yaml | 89 +++++++++++++++++ .github/workflows/integration.yaml | 136 +++++++++++++++++++++++++ docs/ops/mac-m4-runner-setup.md | 137 ++++++++++++++++++++++++++ 3 files changed, 362 insertions(+) create mode 100644 .github/workflows/auto-label-mac.yaml create mode 100644 .github/workflows/integration.yaml create mode 100644 docs/ops/mac-m4-runner-setup.md diff --git a/.github/workflows/auto-label-mac.yaml b/.github/workflows/auto-label-mac.yaml new file mode 100644 index 0000000..511b144 --- /dev/null +++ b/.github/workflows/auto-label-mac.yaml @@ -0,0 +1,89 @@ +name: Auto-label needs-mac-m4 + +# Auto-applies the ``needs-mac-m4`` label to PRs that touch +# verifier-dependent code paths so the integration workflow +# (.github/workflows/integration.yaml) runs without contributors +# remembering to apply the label by hand. +# +# A PR opts INTO Mac M4 review by editing files under any of: +# inference_engine/ — runtime, scheduler, session, server, etc. +# sdks/ — Python + TypeScript SDK +# proto/ — protobuf wire contract +# tests/integration/ — the integration suite itself +# kv_cache_proposer/ — verifier + decoder +# +# A doc-only PR (touching only docs/, README.md, etc.) does NOT +# trigger Mac M4 review, saving runner time. +# +# Once labelled, the integration workflow auto-fires; once a PR +# lands without the label, the integration workflow auto-skips. + +on: + pull_request_target: + types: [opened, synchronize, reopened] + branches: [main] + +permissions: + pull-requests: write + +jobs: + label: + runs-on: ubuntu-latest + steps: + - name: Label PRs touching verifier-dependent paths + uses: actions/github-script@v7 + with: + github-token: ${{ secrets.GITHUB_TOKEN }} + script: | + const pr = context.payload.pull_request; + const labelName = "needs-mac-m4"; + + // Pull the diff file list. github-script gives us the + // full octokit; pagination matters for >100-file PRs + // but in practice the v0.3 PRs are well under that. + const { data: files } = await github.rest.pulls.listFiles({ + owner: context.repo.owner, + repo: context.repo.repo, + pull_number: pr.number, + per_page: 100, + }); + + const triggers = [ + "inference_engine/", + "sdks/", + "proto/", + "tests/integration/", + "kv_cache_proposer/", + ]; + + const matched = files.some(f => + triggers.some(t => f.filename.startsWith(t)) + ); + + const hasLabel = pr.labels.some(l => l.name === labelName); + + if (matched && !hasLabel) { + core.info(`Adding ${labelName} (PR touches verifier-dependent paths).`); + await github.rest.issues.addLabels({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: pr.number, + labels: [labelName], + }); + } else if (!matched && hasLabel) { + // PR was previously labelled; subsequent push removed + // all verifier-dependent file edits. Drop the label + // so the integration workflow doesn't burn runner + // time on doc-only updates. + core.info(`Removing ${labelName} (no verifier-dependent paths).`); + await github.rest.issues.removeLabel({ + owner: context.repo.owner, + repo: context.repo.repo, + issue_number: pr.number, + name: labelName, + }); + } else { + core.info( + `No-op: matched=${matched} hasLabel=${hasLabel}.` + ); + } diff --git a/.github/workflows/integration.yaml b/.github/workflows/integration.yaml new file mode 100644 index 0000000..13773a5 --- /dev/null +++ b/.github/workflows/integration.yaml @@ -0,0 +1,136 @@ +name: Integration (Mac M4) + +# Self-hosted runner workflow that runs the integration suite under +# tests/integration/ against real Qwen3-0.6B on Apple Silicon. +# +# Trigger model: +# - Pull-request events. Only fires when the PR carries the +# ``needs-mac-m4`` label (auto-applied by .github/workflows/ +# auto-label-mac.yaml when a PR touches inference_engine/, +# sdks/, proto/, or tests/integration/). PRs that don't touch +# verifier-dependent code skip this gate entirely so the runner +# pool isn't burned on doc-only or CI-only PRs. +# - Manual workflow_dispatch for re-runs from the Actions UI. +# +# Runner requirements (self-hosted): +# - macOS 14+ on Apple Silicon (M-series). +# - Labels: [self-hosted, macOS, ARM64, kakeya-mac-m4]. +# - Pre-warmed HF cache containing Qwen/Qwen3-0.6B at +# ~/.cache/huggingface/hub/ (avoids 10-minute first-run download). +# - Python 3.12+ on PATH. +# - At least 24 GB unified memory and ~50 GB free disk. +# +# See docs/ops/mac-m4-runner-setup.md for the one-time runner setup. + +on: + pull_request: + # Only run on PR events for branches targeting main. + types: [opened, synchronize, reopened, labeled] + branches: [main] + workflow_dispatch: {} + +# Cancel superseded runs on the same PR — saves runner time when +# the contributor pushes a new commit before the previous run +# finishes. +concurrency: + group: integration-${{ github.ref }} + cancel-in-progress: true + +jobs: + integration: + name: pytest -m integration on Mac M4 + # Only fire on labeled PRs (this saves the runner pool from + # doc-only / CI-only PRs that don't touch verifier-dependent + # code). The auto-label workflow adds 'needs-mac-m4' on file + # paths that warrant the GA gate. + if: | + github.event_name == 'workflow_dispatch' || + contains(github.event.pull_request.labels.*.name, 'needs-mac-m4') + runs-on: [self-hosted, macOS, ARM64, kakeya-mac-m4] + timeout-minutes: 90 + steps: + - uses: actions/checkout@v4 + with: + # Full history so the runner can compare against base for + # any future rebase-based gating. + fetch-depth: 0 + + - name: Verify host shape + run: | + echo "=== sysctl ===" + sysctl -n hw.model || true + sysctl -n hw.memsize || true + sysctl -n machdep.cpu.brand_string || true + echo "=== python ===" + python3 --version + python3 -c "import platform; print(platform.machine(), platform.platform())" + + - name: Verify Qwen3-0.6B in HF cache + run: | + # Don't download here; the runner is expected to be + # pre-warmed. If the model isn't cached the test loads + # would hit HF and exceed the 90-min timeout. Surface a + # clear error early. + set -e + MODEL_DIR="$HOME/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B" + if [ ! -d "$MODEL_DIR" ]; then + echo "::error::HF cache miss for Qwen/Qwen3-0.6B." + echo "::error::Pre-warm the runner: python3 -c 'from transformers import AutoModelForCausalLM, AutoTokenizer; AutoModelForCausalLM.from_pretrained(\"Qwen/Qwen3-0.6B\"); AutoTokenizer.from_pretrained(\"Qwen/Qwen3-0.6B\")'" + exit 1 + fi + echo "Found $MODEL_DIR" + + - name: Install Python dependencies + run: | + # The runner is expected to have a long-lived venv. + # If a per-run venv is preferred, swap to ``python3 -m venv .venv``. + python3 -m pip install --upgrade pip + python3 -m pip install -e . + python3 -m pip install pytest pytest-asyncio pytest-timeout coverage + + - name: Run integration suite + env: + PYTHONPATH: .:sdks/python + # No HF download in tests; if we hit a cache miss it's a + # bug or a stale runner. + HF_HUB_OFFLINE: "1" + run: | + mkdir -p results/platform-tests + stamp=$(date +%s) + python3 -m pytest \ + -m integration \ + tests/integration/ \ + --junitxml="results/platform-tests/integration-mac-m4-${stamp}.junit.xml" \ + -v + # Record the artifact path for the upload step below. + echo "artifact_stamp=${stamp}" >> "$GITHUB_OUTPUT" + id: pytest_run + + - name: Upload JUnit + log artifacts + if: always() + uses: actions/upload-artifact@v4 + with: + name: integration-mac-m4-${{ steps.pytest_run.outputs.artifact_stamp || github.run_id }} + path: | + results/platform-tests/integration-mac-m4-*.junit.xml + retention-days: 30 + + - name: Surface failure summary + if: failure() + run: | + # Tail the last few lines of the JUnit so the failure is + # visible in the action log, not just inside the artifact. + for f in results/platform-tests/integration-mac-m4-*.junit.xml; do + echo "=== $f ===" + python3 - "$f" <<'PY' + import sys, xml.etree.ElementTree as ET + r = ET.parse(sys.argv[1]).getroot() + for tc in r.iter("testcase"): + for child in tc: + if child.tag in ("failure", "error"): + print(f"[{child.tag.upper()}] {tc.get('classname')}::{tc.get('name')}") + msg = (child.get("message") or "").splitlines() + if msg: + print(f" {msg[0][:180]}") + PY + done diff --git a/docs/ops/mac-m4-runner-setup.md b/docs/ops/mac-m4-runner-setup.md new file mode 100644 index 0000000..7c10602 --- /dev/null +++ b/docs/ops/mac-m4-runner-setup.md @@ -0,0 +1,137 @@ +# Mac M4 self-hosted runner setup + +This runner backs the **Integration (Mac M4)** GitHub Actions workflow +(`.github/workflows/integration.yaml`). It runs `pytest -m integration` +against real Qwen3-0.6B on every PR labelled `needs-mac-m4` +(auto-applied by `.github/workflows/auto-label-mac.yaml` when a PR +touches `inference_engine/`, `sdks/`, `proto/`, `tests/integration/`, +or `kv_cache_proposer/`). + +## Hardware requirements + +| Resource | Minimum | +| --- | --- | +| Chip | Apple Silicon (M-series); M4 or newer recommended | +| Unified memory | 24 GB (16 GB works for Qwen3-0.6B alone but no headroom for concurrent work) | +| Free disk | ~50 GB (HF cache + venv + checkout history) | +| Network | Reachable to github.com for runner registration; outbound to HF Hub for the one-time pre-warm | +| OS | macOS 14 (Sonoma) or newer | + +## One-time setup + +### 1. Register the self-hosted runner + +Follow [GitHub's docs](https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners/adding-self-hosted-runners) to add a runner to the repository: + +1. Repository → Settings → Actions → Runners → New self-hosted runner. +2. Choose macOS / ARM64. +3. Run the install + configure commands GitHub provides. +4. **Important**: when prompted for labels, add `kakeya-mac-m4` + in addition to the default `self-hosted, macOS, ARM64`. The + workflow's `runs-on:` clause specifically requires that label. +5. Run the runner as a launchd service (`./svc.sh install && ./svc.sh start`) + so it survives reboots. + +### 2. Pre-warm the HF cache + +The integration workflow runs with `HF_HUB_OFFLINE=1` so it never +hits HuggingFace at test time (avoids 90-min runs blocking on a 4 GB +download). Pre-warm the cache once per runner: + +```bash +python3 -c " +from transformers import AutoModelForCausalLM, AutoTokenizer +AutoModelForCausalLM.from_pretrained('Qwen/Qwen3-0.6B') +AutoTokenizer.from_pretrained('Qwen/Qwen3-0.6B') +" +``` + +The model lands at `~/.cache/huggingface/hub/models--Qwen--Qwen3-0.6B/`. +The workflow's "Verify Qwen3-0.6B in HF cache" step fails fast with +a clear error if that directory is missing. + +If a future test adds a new model id, update the pre-warm command +(and the workflow's verify step) accordingly. + +### 3. Install Python toolchain + +The runner needs Python 3.12+. Use Homebrew or pyenv: + +```bash +brew install python@3.12 +# or: +pyenv install 3.12.7 +pyenv global 3.12.7 +``` + +Confirm `python3 --version` returns 3.12.x and `python3 -c 'import platform; print(platform.machine())'` returns `arm64`. + +### 4. (Optional) long-lived venv + +The workflow currently does `pip install -e .` per run, which is +~30 s on a warm pip cache. If you want to skip even that, create a +venv at `~/kakeya-runner-venv` and add a step to the workflow that +activates it before `pytest`. v0.3 keeps the per-run install for +simplicity. + +## Runtime expectations + +| Phase | Wall time on M4 24 GB | +| --- | --- | +| Checkout + verify host | <5 s | +| Verify HF cache | <1 s | +| `pip install -e .` (warm pip) | 20-40 s | +| `pytest -m integration` (80 tests, post-PR-N1..N4) | 60-120 s | +| Artifact upload | <5 s | +| **Total** | **~2-3 min** | + +The 90-minute timeout in the workflow is a safety margin. A run +that exceeds 5 min should be investigated — likely a model-load +regression or a runaway test. + +## Maintenance + +### Cache hygiene + +The runner's HF cache and pip cache grow over time. Recommend a +monthly cron: + +```bash +# ~/clean-kakeya-runner.sh +find ~/.cache/huggingface/hub -type d -mtime +60 -prune -name 'models--*' -exec rm -rf {} + +python3 -m pip cache purge +``` + +The Qwen3-0.6B cache is touched on every run, so `mtime +60` only +prunes models added by future test additions that aren't currently +exercised. + +### Runner upgrades + +GitHub publishes new runner versions ~monthly. Update via: + +```bash +cd ~/actions-runner +./svc.sh stop +./config.sh remove --token +# download the new tarball per GitHub UI instructions +./config.sh --url https://github.com// --token +./svc.sh install && ./svc.sh start +``` + +### Failure triage + +Workflow failures are visible at `Actions → Integration (Mac M4)`. The "Surface failure summary" step inlines the test names + first-line error messages so triage doesn't require downloading the JUnit XML. + +If the runner itself is offline (queue depth grows, no jobs pick up), check on the Mac: + +```bash +cd ~/actions-runner +sudo ./svc.sh status +tail -200 ~/Library/Logs/actions-runner/Runner_*.log +``` + +Common causes: +- macOS auto-update rebooted the host; service didn't auto-start (rare with `launchd` but possible). +- HF cache was purged; the verify step fails. Re-warm. +- Disk full from accumulated pip downloads; clear cache.