Document gcx workflow and add CLAUDE.md#9
Draft
jwmossmoz wants to merge 7 commits into
Draft
Conversation
Adds a `gcx` section to the top-level README covering install, login against the local Yardstick proxy with the 1Password service-account token, and the dashboard commands that work — plus the `--folder` search gotcha against Yardstick's nested-folder layout. Adds CLAUDE.md so future Claude Code sessions land on the right backup tree (yardstick/gdg-based/), know which directories mirror the RelSRE folder hierarchy, and pick up the rules for backup PRs (no `clear`, strip alert id/updated/version, leave Sandboxes untracked).
Yardstick is behind Google IAP, so `http://localhost:3000` only works once `mzcld iap --host yardstick.mozilla.org --proxy --port 3000` is running. Add an install/start step before the `gcx login` instructions and link the SRE Confluence guide. Note `gcloud auth login` as a prereq. Update CLAUDE.md so future sessions know the proxy precondition applies to both `gcx` and the raw `localhost:3000` curl fallback.
Replaces the GDG-based and manual backup trees under yardstick/ with a gcx-native layout backed by a small Makefile and one Python helper. - yardstick/Makefile drives the workflow: backup / push / validate / diff / discover-uids. Scope is hardcoded via FOLDER_UIDS and DASHBOARD_UIDS so contributors can see the tracked surface at a glance and refresh it with `make discover-uids`. - yardstick/resources/ is what `gcx resources pull -p resources` writes (flat by kind + API version, one yaml per resource). Both v0alpha1 and v1beta1 dashboard directories are tracked because gcx splits resources across versions. - yardstick/alerts/ holds one cleaned JSON per RelSRE alert rule. scripts/pull_alerts.py uses /api/v1/provisioning/alert-rules because `gcx resources pull alertrules` embeds live state (lastEvaluation, active alerts) that would churn every backup. Writes are staged in a sibling .alerts.new/ and only swapped in once every in-scope rule has been written, so a fetch/proxy failure cannot leave alerts/ partially populated. - Updates the top-level README to point at the new yardstick/ flow and drops the GDG section; CLAUDE.md describes the new layout, scope rules, and Yardstick/IAP gotchas. Known upstream dashboard issues surfaced by adversarial review of the freshly-backed-up resources (faithful captures of live state, not introduced here): - Pickup-wait formulas in linux/mac/windows/azure/gcp *-pickup-wait-timeline-v1 divide by `(sum(rate(...)) > 0)`, which filters the series instead of acting as a guard. The pickup-wait panel goes blank exactly when a queue is stuck with zero throughput — the case responders need to see. - The Bitbar panel in Android HW - By Provider (beleuqjq6k0zkb) mixes Bitbar pending tasks with running-worker/utilization queries still filtered by `workerType=~".*lambda.*"`. Both belong to the dashboard owners; fixing them via `gcx resources push` is out of scope for this PR.
`gcx login` already writes the service-account token to ~/.config/gcx/config.yaml, so subsequent gcx commands reuse it automatically. Spell that out in both READMEs since it wasn't obvious from the login command. Also let pull_alerts.py and `make discover-uids` honor GRAFANA_TOKEN when set, falling back to `op read $TOKEN_REF` otherwise — exporting the token once per shell removes the per-backup 1Password prompts.
Wraps the one-time `gcx login` against the local IAP proxy in a single Makefile recipe so new contributors don't have to copy the multi-line command from the README. Reads the token from `GRAFANA_TOKEN` if set, otherwise from 1Password via `TOKEN_REF`, and confirms with `gcx config check`. Exposes `CONTEXT` (default `yardstick`) for anyone who needs a different context name.
The gcx resources pull/push layout was a Kubernetes-style GVK tree
(dashboards.v0alpha1.dashboard.grafana.app/, dashboards.v1beta1...)
with UID filenames. PR review against that tree is unworkable — you
can't tell from a diff what dashboard changed without opening the
yaml — and the v0alpha1/v1beta1 split tracks Grafana's API migration
churn rather than anything meaningful here. The only thing it bought
us was round-trip `gcx resources push`, which the team doesn't use:
dashboards are authored in the Grafana UI.
Replace it with a single Python script that hits the Grafana REST API
directly and writes one JSON per dashboard / alert into a folder tree
that mirrors RelSRE's Grafana hierarchy:
yardstick/
dashboards/{relsre,fxci-cloud-workers/{azure,gcp},
fxci-hardware-workers/{linux,mac,windows},
relsre-development}/<title-slug>.json
alerts/<same layout>/<title-slug>.json
scripts/backup.py:
- Hardcoded FOLDERS map (UID -> on-disk slug); intermediate parent
folders map to None and are skipped.
- Fetches dashboards via /api/dashboards/uid/<UID> and alerts via
/api/v1/provisioning/alert-rules, strips server-churn fields,
deduplicates title-slug collisions by appending the UID.
- Stages writes into yardstick/dashboards.new and
yardstick/alerts.new and only swaps them in once every resource
has been written, so a fetch failure cannot leave the working
tree partially populated.
Makefile drops `push` and `validate` (gcx-specific) and gains nothing
load-bearing on gcx — `make login` stays for ad-hoc gcx usage but
isn't required for backups. CLAUDE.md and the top-level README are
rewritten around the new flow and call out gcx as an ad-hoc CLI only.
This reverts commit 9e872de.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
gcxsection to the top-level README covering install (brew install grafana/grafana/gcx), login against the local Yardstick proxy with the 1Password service-account token, and dashboard list/search/get commands.gcx dashboards search --foldergotcha against Yardstick's nested folders, with the curl-based/api/folders+/api/search?folderUIDs=…fallback.CLAUDE.mdfor future Claude Code sessions: which tree is active (yardstick/gdg-based/), how the directories mirror the RelSRE folder hierarchy, and the rules for backup PRs (noclear, strip alertid/updated/version, leave Sandboxes untracked).Test plan
brew install grafana/grafana/gcxthengcx login yardstick --server http://localhost:3000 --token "$(op read 'op://RelOps/Grafana Yardstick Service Account Token/credential')" --yessucceeds with the local proxy running.gcx config checkreports ✔ Connectivity and a Grafana version.gcx dashboards listandgcx dashboards search "workers"return results./api/folders+/api/search?folderUIDs=…curl pair lists the RelSRE folder tree.