Skip to content

Document gcx workflow and add CLAUDE.md#9

Draft
jwmossmoz wants to merge 7 commits into
masterfrom
add-gcx-readme-and-claude-md
Draft

Document gcx workflow and add CLAUDE.md#9
jwmossmoz wants to merge 7 commits into
masterfrom
add-gcx-readme-and-claude-md

Conversation

@jwmossmoz

Copy link
Copy Markdown
Contributor

Summary

  • Add a gcx section to the top-level README covering install (brew install grafana/grafana/gcx), login against the local Yardstick proxy with the 1Password service-account token, and dashboard list/search/get commands.
  • Document the gcx dashboards search --folder gotcha against Yardstick's nested folders, with the curl-based /api/folders + /api/search?folderUIDs=… fallback.
  • Add CLAUDE.md for future Claude Code sessions: which tree is active (yardstick/gdg-based/), how the directories mirror the RelSRE folder hierarchy, and the rules for backup PRs (no clear, strip alert id/updated/version, leave Sandboxes untracked).

Test plan

  • brew install grafana/grafana/gcx then gcx login yardstick --server http://localhost:3000 --token "$(op read 'op://RelOps/Grafana Yardstick Service Account Token/credential')" --yes succeeds with the local proxy running.
  • gcx config check reports ✔ Connectivity and a Grafana version.
  • gcx dashboards list and gcx dashboards search "workers" return results.
  • The documented /api/folders + /api/search?folderUIDs=… curl pair lists the RelSRE folder tree.

jwmossmoz added 4 commits May 26, 2026 10:45
Adds a `gcx` section to the top-level README covering install, login
against the local Yardstick proxy with the 1Password service-account
token, and the dashboard commands that work — plus the
`--folder` search gotcha against Yardstick's nested-folder layout.

Adds CLAUDE.md so future Claude Code sessions land on the right
backup tree (yardstick/gdg-based/), know which directories mirror
the RelSRE folder hierarchy, and pick up the rules for backup PRs
(no `clear`, strip alert id/updated/version, leave Sandboxes
untracked).
Yardstick is behind Google IAP, so `http://localhost:3000` only works
once `mzcld iap --host yardstick.mozilla.org --proxy --port 3000` is
running. Add an install/start step before the `gcx login` instructions
and link the SRE Confluence guide. Note `gcloud auth login` as a
prereq.

Update CLAUDE.md so future sessions know the proxy precondition
applies to both `gcx` and the raw `localhost:3000` curl fallback.
Replaces the GDG-based and manual backup trees under yardstick/ with a
gcx-native layout backed by a small Makefile and one Python helper.

- yardstick/Makefile drives the workflow: backup / push / validate /
  diff / discover-uids. Scope is hardcoded via FOLDER_UIDS and
  DASHBOARD_UIDS so contributors can see the tracked surface at a
  glance and refresh it with `make discover-uids`.
- yardstick/resources/ is what `gcx resources pull -p resources` writes
  (flat by kind + API version, one yaml per resource). Both
  v0alpha1 and v1beta1 dashboard directories are tracked because gcx
  splits resources across versions.
- yardstick/alerts/ holds one cleaned JSON per RelSRE alert rule.
  scripts/pull_alerts.py uses /api/v1/provisioning/alert-rules because
  `gcx resources pull alertrules` embeds live state (lastEvaluation,
  active alerts) that would churn every backup. Writes are staged in
  a sibling .alerts.new/ and only swapped in once every in-scope rule
  has been written, so a fetch/proxy failure cannot leave alerts/
  partially populated.
- Updates the top-level README to point at the new yardstick/ flow and
  drops the GDG section; CLAUDE.md describes the new layout, scope
  rules, and Yardstick/IAP gotchas.

Known upstream dashboard issues surfaced by adversarial review of the
freshly-backed-up resources (faithful captures of live state, not
introduced here):

- Pickup-wait formulas in linux/mac/windows/azure/gcp
  *-pickup-wait-timeline-v1 divide by `(sum(rate(...)) > 0)`, which
  filters the series instead of acting as a guard. The pickup-wait
  panel goes blank exactly when a queue is stuck with zero
  throughput — the case responders need to see.
- The Bitbar panel in Android HW - By Provider (beleuqjq6k0zkb) mixes
  Bitbar pending tasks with running-worker/utilization queries still
  filtered by `workerType=~".*lambda.*"`.

Both belong to the dashboard owners; fixing them via `gcx resources
push` is out of scope for this PR.
`gcx login` already writes the service-account token to
~/.config/gcx/config.yaml, so subsequent gcx commands reuse it
automatically. Spell that out in both READMEs since it wasn't obvious
from the login command.

Also let pull_alerts.py and `make discover-uids` honor GRAFANA_TOKEN
when set, falling back to `op read $TOKEN_REF` otherwise — exporting
the token once per shell removes the per-backup 1Password prompts.
@jwmossmoz jwmossmoz marked this pull request as draft May 26, 2026 15:22
jwmossmoz added 3 commits May 26, 2026 11:23
Wraps the one-time `gcx login` against the local IAP proxy in a single
Makefile recipe so new contributors don't have to copy the multi-line
command from the README. Reads the token from `GRAFANA_TOKEN` if set,
otherwise from 1Password via `TOKEN_REF`, and confirms with
`gcx config check`.

Exposes `CONTEXT` (default `yardstick`) for anyone who needs a
different context name.
The gcx resources pull/push layout was a Kubernetes-style GVK tree
(dashboards.v0alpha1.dashboard.grafana.app/, dashboards.v1beta1...)
with UID filenames. PR review against that tree is unworkable — you
can't tell from a diff what dashboard changed without opening the
yaml — and the v0alpha1/v1beta1 split tracks Grafana's API migration
churn rather than anything meaningful here. The only thing it bought
us was round-trip `gcx resources push`, which the team doesn't use:
dashboards are authored in the Grafana UI.

Replace it with a single Python script that hits the Grafana REST API
directly and writes one JSON per dashboard / alert into a folder tree
that mirrors RelSRE's Grafana hierarchy:

  yardstick/
    dashboards/{relsre,fxci-cloud-workers/{azure,gcp},
                fxci-hardware-workers/{linux,mac,windows},
                relsre-development}/<title-slug>.json
    alerts/<same layout>/<title-slug>.json

scripts/backup.py:
  - Hardcoded FOLDERS map (UID -> on-disk slug); intermediate parent
    folders map to None and are skipped.
  - Fetches dashboards via /api/dashboards/uid/<UID> and alerts via
    /api/v1/provisioning/alert-rules, strips server-churn fields,
    deduplicates title-slug collisions by appending the UID.
  - Stages writes into yardstick/dashboards.new and
    yardstick/alerts.new and only swaps them in once every resource
    has been written, so a fetch failure cannot leave the working
    tree partially populated.

Makefile drops `push` and `validate` (gcx-specific) and gains nothing
load-bearing on gcx — `make login` stays for ad-hoc gcx usage but
isn't required for backups. CLAUDE.md and the top-level README are
rewritten around the new flow and call out gcx as an ad-hoc CLI only.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant