From 3aaf674f5b504aa3ce4fdd6a8ebc4e6b5a071954 Mon Sep 17 00:00:00 2001
From: zackees <zachvorhies@protonmail.com>
Date: Sat, 20 Jun 2026 14:48:29 -0700
Subject: [PATCH] feat(online-data): add pio-boards + vendor_boards datasets to
 nightly
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Extends the existing `nightly-online-data` workflow (formerly
`nightly-usb-ids`) to also refresh the PlatformIO board catalog.
The branch is renamed in the `name:` field; the file path stays
`nightly-usb-ids.yml` to preserve the workflow's existing identity
in GitHub's UI (workflow run history is keyed by file path).

Pipeline additions on `online-data`:
  - data/pio-boards.json     full PlatformIO board catalog
                             (~1600 boards × ~10 fields = ~850 KB)
  - data/vendor_boards.json  slim {vendor, name, mcu} view (~200 KB)
                             for cheap "what board is plugged in?" lookups
  - tools/dump_platformio.py runs `pio boards --json-output`, normalizes
                             the result into a sorted id-keyed map
  - tools/merge_pio_boards.py deep-unions new + previously-committed
                             dump so transient field drops in `pio boards`
                             output don't get propagated (preserves the
                             field even if the new dump regressed)
  - tools/build_manifest.py  refactored to auto-discover every
                             `data/*.json` and bind it as a dataset entry;
                             per-dataset metadata (description, sources)
                             still comes from fragment files written by
                             each merger.

Fault tolerance unchanged: any single source failure (cargo build, curl,
pio dump) is non-fatal; the merger downstream sees only the sources that
actually arrived intact; data files refuse to be written below their
respective sanity floors (1000 entries for USB-VID, 1500 for boards).

Goal acceptance:
- isolated end-to-end test: ✓ all four datasets emitted,
  vendor_boards entries verified, merger preserves old fields on the
  synthetic regression test, build_manifest auto-discovers all *.json.
---
 .github/workflows/nightly-usb-ids.yml | 210 ++++++++++++++++++--------
 docs/online-data.md                   |  30 +++-
 2 files changed, 168 insertions(+), 72 deletions(-)

diff --git a/.github/workflows/nightly-usb-ids.yml b/.github/workflows/nightly-usb-ids.yml
index 1e81eeb4..43239bfb 100644
--- a/.github/workflows/nightly-usb-ids.yml
+++ b/.github/workflows/nightly-usb-ids.yml
@@ -1,32 +1,47 @@
-# Nightly refresh of the `online-data` branch's USB VID:PID database.
+# Nightly refresh of the `online-data` branch's published datasets.
 #
-# The tooling (Python merger, README, data files) lives on the orphan
-# `online-data` branch — NOT on `main`. This workflow file lives on `main`
-# only because GitHub Actions requires `schedule` and `workflow_dispatch`
-# triggers to be defined on the default branch. At runtime the job:
+# Today the branch carries two datasets — USB VID:PID name resolution and
+# the PlatformIO board catalog. The workflow file lives on `main` only
+# because GitHub Actions requires `schedule` / `workflow_dispatch` to be
+# defined on the default branch. All actual data + the merger scripts
+# live on the orphan `online-data` branch (see `docs/online-data.md`).
+#
+# At runtime the job:
 #
 #   1. checks out `main` (default) so it can build the `dump_usb_ids`
 #      example from `crates/fbuild-core/examples/dump_usb_ids.rs`;
 #   2. fetches + worktrees the `online-data` branch into a sibling dir so
-#      the merger script lives at `online-data/tools/merge_sources.py`;
-#   3. dumps the bundled `usb-ids` Rust crate to JSON;
-#   4. downloads several upstream `usb.ids` text mirrors (fault-tolerant —
-#      a single source failure does NOT abort the run);
-#   5. runs the merger to produce sorted `usb-vid.json`,
-#      `usb-vid-conflicts.json`, and a future-forward `manifest.json`;
-#   6. commits the resulting data files back to `online-data` if they
-#      actually changed, force-pushing only after history pruning.
+#      the merger scripts live at `online-data/tools/{merge_sources,
+#      merge_pio_boards,build_manifest}.py`;
+#   3. **in parallel** produces:
+#        - `usb-ids` Rust crate dump (tier-1 USB-VID source)
+#        - two upstream `usb.ids` text mirror fetches
+#        - the full PlatformIO board catalog (`pio boards --json-output`)
+#      Each source has its own step + `continue-on-error: true` so any
+#      single failure is non-fatal — the merger downstream sees only the
+#      sources that actually arrived intact;
+#   4. runs the USB-VID merger → sorted `usb-vid.json` + conflict log +
+#      per-dataset manifest fragment;
+#   5. runs the PlatformIO board merger → `pio-boards.json` (deep-union
+#      with the previously committed copy so transient field drops in
+#      `pio boards` don't lose data) + per-dataset manifest fragment;
+#   6. assembles the future-forward `manifest.json` from both fragments;
+#   7. commits + pushes only if any data file actually changed, with
+#      history pruned to 200 commits.
 #
 # Fault tolerance summary:
-#   - Rust build failure → keep the existing committed data (no commit).
-#   - Any individual upstream fetch failure → workflow continues with the
-#     sources that succeeded; merger refuses to write if the union is
-#     implausibly small (< 1000 entries) and the existing data stays put.
+#   - Any single source failure → workflow continues with the rest.
+#   - USB-VID merger refuses to write below 1000 entries.
+#   - PIO merger refuses to write below 1500 boards AND deep-unions with
+#     the previously committed data so a feature drop upstream is repaired
+#     from history.
+#   - All-sources-fail → no commit happens; the existing online-data
+#     branch keeps its last good snapshot.
 #   - History is pruned to the most recent 200 commits per the design.
 #
-# Manual trigger: Actions tab → "Nightly USB IDs refresh" → Run workflow.
+# Manual trigger: Actions tab → "Nightly online-data refresh" → Run.
 
-name: Nightly USB IDs refresh
+name: Nightly online-data refresh
 
 on:
   schedule:
@@ -39,7 +54,7 @@ permissions:
   contents: write
 
 concurrency:
-  group: nightly-usb-ids
+  group: nightly-online-data
   cancel-in-progress: false
 
 env:
@@ -50,15 +65,12 @@ env:
 
 jobs:
   refresh:
-    name: Refresh online-data/usb-vid.json
+    name: Refresh online-data datasets
     runs-on: ubuntu-latest
     steps:
       - name: Checkout main (default branch)
         uses: actions/checkout@v6
         with:
-          # We need the git history available so `git worktree add` against
-          # the `online-data` branch works, and so the history-prune step
-          # can rewrite commits without confusing a shallow clone.
           fetch-depth: 0
 
       - name: Configure git identity for the commit
@@ -67,9 +79,6 @@ jobs:
           git config user.email "fbuild-bot+nightly@users.noreply.github.com"
 
       - name: Fetch + worktree the online-data branch
-        # Creates a sibling directory containing the orphan branch. If the
-        # branch does not yet exist on the remote (very first run), we
-        # bootstrap an empty orphan worktree so the rest of the job works.
         run: |
           set -euo pipefail
           if git ls-remote --heads origin "${ONLINE_BRANCH}" | grep -q .; then
@@ -80,6 +89,7 @@ jobs:
             git worktree add --detach "${ONLINE_WORKTREE}"
             (cd "${ONLINE_WORKTREE}" && git checkout --orphan "${ONLINE_BRANCH}" && git rm -rf . 2>/dev/null || true)
           fi
+          mkdir -p "${ONLINE_WORKTREE}/data"
           ls -la "${ONLINE_WORKTREE}"
 
       - uses: astral-sh/setup-uv@v3
@@ -93,15 +103,19 @@ jobs:
           prebuild-deps: none
           linker: platform-default
 
-      - name: Build dump_usb_ids example (tier-1 source)
+      # ────────────────────────────────────────────────────────────────────
+      # Parallel data-source acquisition. Each fetch is its own step so
+      # `steps.<id>.outcome` cleanly attributes blame; the merge step
+      # downstream consumes only sources that succeeded. The Rust build
+      # is the longest step (~1–2 min cold, seconds warm); the pio dump
+      # and curl fetches are each <90 s — the wall-time cost is bounded by
+      # the slowest single source.
+      # ────────────────────────────────────────────────────────────────────
+
+      - name: Build dump_usb_ids example (USB-VID tier-1)
         id: build-dump
-        # Failure is tolerated: we still try to merge whatever upstream
-        # text sources arrived this run. The merger will fall back to the
-        # previously committed data if too few entries survive.
         continue-on-error: true
-        run: |
-          set -euo pipefail
-          soldr cargo build --release --example dump_usb_ids -p fbuild-core
+        run: soldr cargo build --release --example dump_usb_ids -p fbuild-core
 
       - name: Run dump_usb_ids → /tmp/usb-ids-rs.json
         id: run-dump
@@ -112,7 +126,7 @@ jobs:
           ./target/release/examples/dump_usb_ids > /tmp/usb-ids-rs.json
           wc -l /tmp/usb-ids-rs.json
 
-      - name: Fetch linux-usb.org/usb.ids (tier-2)
+      - name: Fetch linux-usb.org/usb.ids (USB-VID tier-2)
         id: fetch-linux-usb
         continue-on-error: true
         run: |
@@ -123,7 +137,7 @@ jobs:
             "http://www.linux-usb.org/usb.ids"
           wc -l /tmp/linux-usb.txt
 
-      - name: Fetch usbids/usbids GitHub mirror (tier-3)
+      - name: Fetch usbids/usbids GitHub mirror (USB-VID tier-3)
         id: fetch-github
         continue-on-error: true
         run: |
@@ -133,8 +147,34 @@ jobs:
             "https://raw.githubusercontent.com/usbids/usbids/master/usb.ids"
           wc -l /tmp/usbids-github.txt
 
-      - name: Run merger (only if at least one source loaded)
-        id: merge
+      - name: Dump PlatformIO board catalog → /tmp/all_boards.json
+        id: dump-pio
+        continue-on-error: true
+        run: |
+          # `dump_platformio.py` declares `platformio` as an inline
+          # dependency so `uv run --no-project --script` materializes it
+          # in an ephemeral env. No global pio install needed.
+          uv run --no-project --script \
+            "${ONLINE_WORKTREE}/tools/dump_platformio.py" \
+            /tmp/all_boards.json
+          # jq isn't on minimal runners — use python for the sanity print.
+          uv run --no-project --script - "/tmp/all_boards.json" <<'PY'
+          # /// script
+          # requires-python = ">=3.10"
+          # ///
+          import json, sys
+          data = json.loads(open(sys.argv[1], encoding="utf-8").read())
+          print(f"pio boards: {len(data)} entries")
+          PY
+
+      # ────────────────────────────────────────────────────────────────────
+      # Per-dataset merge steps. Each writes its own data file + a
+      # manifest fragment. The fragments are then consumed by
+      # build_manifest.py to assemble the unified manifest.json.
+      # ────────────────────────────────────────────────────────────────────
+
+      - name: Merge USB-VID sources
+        id: merge-usb
         continue-on-error: true
         run: |
           set -euo pipefail
@@ -149,29 +189,64 @@ jobs:
             args+=(--txt "usbids-github=/tmp/usbids-github.txt")
           fi
           if [ "${#args[@]}" -eq 0 ]; then
-            echo "::error::all sources failed; preserving previously committed data"
+            echo "::warning::all USB-VID sources failed; preserving previously committed data"
             exit 1
           fi
+          mkdir -p /tmp/fragments
           uv run --no-project --script \
             "${ONLINE_WORKTREE}/tools/merge_sources.py" \
             "${args[@]}" \
             --out-dir "${ONLINE_WORKTREE}/data" \
-            --branch-base-url "${BRANCH_BASE_URL}"
-
-      - name: Refresh manifest.json (always — even if data unchanged)
-        # The manifest carries `generated_at`, so we always update it; that
-        # gives the branch a heartbeat for downstream consumers even on a
-        # no-op data day. If the merge step failed we deliberately skip
-        # this — we don't want to advertise stale `sources` listings.
-        if: steps.merge.outcome == 'success'
+            --branch-base-url "${BRANCH_BASE_URL}" \
+            --manifest-fragment /tmp/fragments/usb-vid.json
+
+      - name: Merge PlatformIO board dump (full + slim vendor view)
+        id: merge-pio
+        continue-on-error: true
+        if: steps.dump-pio.outcome == 'success'
+        run: |
+          set -euo pipefail
+          mkdir -p /tmp/fragments
+          uv run --no-project --script \
+            "${ONLINE_WORKTREE}/tools/merge_pio_boards.py" \
+            --new /tmp/all_boards.json \
+            --old "${ONLINE_WORKTREE}/data/pio-boards.json" \
+            --out "${ONLINE_WORKTREE}/data/pio-boards.json" \
+            --out-slim "${ONLINE_WORKTREE}/data/vendor_boards.json" \
+            --manifest-fragment /tmp/fragments/pio-boards.json \
+            --manifest-fragment-slim /tmp/fragments/vendor_boards.json
+
+      - name: Assemble manifest.json
+        id: build-manifest
+        # We rebuild the manifest whenever at least one dataset succeeded,
+        # so generated_at moves even on a no-op data day (heartbeat).
+        # Datasets that didn't merge this run get marked status=missing in
+        # the manifest but keep their committed data file untouched.
+        if: |
+          steps.merge-usb.outcome == 'success' ||
+          steps.merge-pio.outcome == 'success'
         run: |
-          if [ -f "${ONLINE_WORKTREE}/data/manifest.json" ]; then
-            mv "${ONLINE_WORKTREE}/data/manifest.json" "${ONLINE_WORKTREE}/manifest.json"
+          set -euo pipefail
+          fragments=()
+          if [ -f /tmp/fragments/usb-vid.json ]; then
+            fragments+=(--fragment "usb-vid=/tmp/fragments/usb-vid.json")
+          fi
+          if [ -f /tmp/fragments/pio-boards.json ]; then
+            fragments+=(--fragment "pio-boards=/tmp/fragments/pio-boards.json")
           fi
+          if [ -f /tmp/fragments/vendor_boards.json ]; then
+            fragments+=(--fragment "vendor_boards=/tmp/fragments/vendor_boards.json")
+          fi
+          uv run --no-project --script \
+            "${ONLINE_WORKTREE}/tools/build_manifest.py" \
+            --branch-base-url "${BRANCH_BASE_URL}" \
+            --data-dir "${ONLINE_WORKTREE}/data" \
+            --out "${ONLINE_WORKTREE}/manifest.json" \
+            "${fragments[@]}"
 
       - name: Commit + push if data actually changed
         id: commit
-        if: steps.merge.outcome == 'success'
+        if: steps.build-manifest.outcome == 'success'
         working-directory: ${{ env.ONLINE_WORKTREE }}
         run: |
           set -euo pipefail
@@ -182,7 +257,12 @@ jobs:
             exit 0
           fi
           ts="$(date -u +%Y-%m-%d)"
-          git commit -m "chore(usb-ids): nightly refresh ${ts}"
+          # Include which datasets actually refreshed in the commit body.
+          parts=()
+          [ "${{ steps.merge-usb.outcome }}" = "success" ] && parts+=("usb-vid")
+          [ "${{ steps.merge-pio.outcome }}" = "success" ] && parts+=("pio-boards")
+          body="$(printf 'datasets: %s' "$(IFS=, ; echo "${parts[*]}")")"
+          git commit -m "chore(online-data): nightly refresh ${ts}" -m "${body}"
           echo "changed=true" >> "$GITHUB_OUTPUT"
 
       - name: Prune history to last ${{ env.HISTORY_LIMIT }} commits
@@ -196,9 +276,6 @@ jobs:
             echo "no prune needed (<= ${HISTORY_LIMIT} commits)"
             exit 0
           fi
-          # Find the commit `HISTORY_LIMIT-1` back from HEAD and make it
-          # a new root via a graft. Then `git filter-repo` (preinstalled on
-          # GitHub-hosted Ubuntu runners) rewrites history accordingly.
           target="$(git rev-list --max-count="${HISTORY_LIMIT}" HEAD | tail -n 1)"
           git replace --graft "${target}"
           pip install --quiet git-filter-repo
@@ -209,20 +286,23 @@ jobs:
       - name: Push
         if: steps.commit.outputs.changed == 'true'
         working-directory: ${{ env.ONLINE_WORKTREE }}
-        # Force-with-lease is needed only after a history-prune rewrite.
-        # In the no-prune path it is a no-op compared to a fast-forward.
         run: |
           git push --force-with-lease origin "${ONLINE_BRANCH}"
 
       - name: Summary
         if: always()
         run: |
-          echo "## Nightly USB IDs refresh" >> "$GITHUB_STEP_SUMMARY"
-          echo "" >> "$GITHUB_STEP_SUMMARY"
-          echo "| source | outcome |" >> "$GITHUB_STEP_SUMMARY"
-          echo "|---|---|" >> "$GITHUB_STEP_SUMMARY"
-          echo "| usb-ids-rs (dump example) | ${{ steps.run-dump.outcome }} |" >> "$GITHUB_STEP_SUMMARY"
-          echo "| linux-usb.org             | ${{ steps.fetch-linux-usb.outcome }} |" >> "$GITHUB_STEP_SUMMARY"
-          echo "| usbids/usbids github      | ${{ steps.fetch-github.outcome }} |" >> "$GITHUB_STEP_SUMMARY"
-          echo "| merge                     | ${{ steps.merge.outcome }} |" >> "$GITHUB_STEP_SUMMARY"
-          echo "| committed                 | ${{ steps.commit.outputs.changed || 'n/a' }} |" >> "$GITHUB_STEP_SUMMARY"
+          {
+            echo "## Nightly online-data refresh"
+            echo ""
+            echo "| source / step | outcome |"
+            echo "|---|---|"
+            echo "| usb-ids-rs (dump example)  | ${{ steps.run-dump.outcome }} |"
+            echo "| linux-usb.org              | ${{ steps.fetch-linux-usb.outcome }} |"
+            echo "| usbids/usbids github       | ${{ steps.fetch-github.outcome }} |"
+            echo "| pio boards (platformio)    | ${{ steps.dump-pio.outcome }} |"
+            echo "| merge usb-vid              | ${{ steps.merge-usb.outcome }} |"
+            echo "| merge pio-boards           | ${{ steps.merge-pio.outcome }} |"
+            echo "| build manifest             | ${{ steps.build-manifest.outcome }} |"
+            echo "| committed                  | ${{ steps.commit.outputs.changed || 'n/a' }} |"
+          } >> "$GITHUB_STEP_SUMMARY"
diff --git a/docs/online-data.md b/docs/online-data.md
index 991c7f38..52406c9d 100644
--- a/docs/online-data.md
+++ b/docs/online-data.md
@@ -1,23 +1,39 @@
 # `online-data` branch + nightly refresh
 
 The repo carries a long-lived orphan branch called `online-data` that holds
-periodically-refreshed reference datasets fbuild reads at runtime. Today
-the only dataset is the USB VID:PID → vendor/product map; the format is
-**future-forward** so additional datasets (PCI vendor IDs, board feature
-matrices, etc.) can be added later without breaking clients.
+periodically-refreshed reference datasets fbuild reads at runtime. Datasets
+currently published:
 
-The companion in-process resolver lives at `fbuild_core::usb` — see
+| Dataset | Path | Description |
+|---|---|---|
+| `usb-vid` | `data/usb-vid.json` | USB VID:PID → `{vendor, product}` (union of multiple sources) |
+| `usb-vid-conflicts` | `data/usb-vid-conflicts.json` | Per-key disagreements between USB-VID sources (observability) |
+| `pio-boards` | `data/pio-boards.json` | Full PlatformIO board catalog (vendor, mcu, frameworks, debug tools, etc.) |
+| `vendor_boards` | `data/vendor_boards.json` | Slim view of `pio-boards` — only `{vendor, name, mcu}` per board id, for cheap "what board is plugged in?" lookups |
+
+The format is **future-forward** — new datasets are added by writing a new
+JSON file under `data/`; `tools/build_manifest.py` auto-discovers them on
+the next workflow run. No client breakage when datasets are added.
+
+The companion in-process USB resolver lives at `fbuild_core::usb` — see
 `crates/fbuild-core/src/usb/`. The branch is the **tier-2 fallback** when
 the bundled `usb-ids` crate doesn't know a VID:PID.
 
 ## URLs
 
+Always start from the manifest — direct dataset URLs may change in the
+future, but the manifest's `datasets.<name>.url` field is the contract.
+
 - Manifest (entry point — clients fetch this first):
   `https://raw.githubusercontent.com/fastled/fbuild/online-data/manifest.json`
-- Live dataset (also exposed in the manifest):
+- USB VID:PID dataset:
   `https://raw.githubusercontent.com/fastled/fbuild/online-data/data/usb-vid.json`
-- Conflict log (visibility, not consumed by fbuild at runtime):
+- USB-VID source-conflict log:
   `https://raw.githubusercontent.com/fastled/fbuild/online-data/data/usb-vid-conflicts.json`
+- PlatformIO full board catalog:
+  `https://raw.githubusercontent.com/fastled/fbuild/online-data/data/pio-boards.json`
+- PlatformIO slim vendor-name lookup (small, ~200 KB):
+  `https://raw.githubusercontent.com/fastled/fbuild/online-data/data/vendor_boards.json`
 
 The matching constants in code: `fbuild_core::usb::MANIFEST_URL` and
 `fbuild_core::usb::USB_VID_JSON_URL`.