Skip to content

feat(aro-hcp): add periodic Grafana datasource cleanup job (AROSLSRE-1138)#80947

Closed
cssjr wants to merge 5 commits into
openshift:mainfrom
cssjr:aroslsre-1138-grafana-cleanup-periodic
Closed

feat(aro-hcp): add periodic Grafana datasource cleanup job (AROSLSRE-1138)#80947
cssjr wants to merge 5 commits into
openshift:mainfrom
cssjr:aroslsre-1138-grafana-cleanup-periodic

Conversation

@cssjr

@cssjr cssjr commented Jun 24, 2026

Copy link
Copy Markdown

Summary

  • Adds a monthly Prow periodic job that runs grafanactl clean datasources and grafanactl clean fixup-datasources against the DEV Grafana instance (arohcp-dev in resource group global, subscription 1d3378d3-...)
  • Removes orphaned Prometheus datasources left by personal dev environments
  • Follows the existing cleanup-sweeper step pattern in the step registry
  • Reports failures to #aro-hcp-failures-dev Slack channel

New files

  • ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/ — step registry entry (commands script, ref YAML, metadata, OWNERS)

Modified files

  • ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__periodic-cleanup.yaml — added clean-grafana-datasources entry with monthly cron (0 6 1 * *)
  • ci-operator/jobs/Azure/ARO-HCP/Azure-ARO-HCP-main-periodics.yaml — auto-regenerated via make jobs

Context

  • Jira: AROSLSRE-1138
  • One-time manual cleanup removed ~3,450 orphaned datasources on 2026-06-08
  • Growth source is personal dev environments only (CI no longer creates datasources)
  • grafanactl lives in Azure/ARO-Tools, entry point in Azure/ARO-HCP

Test plan

  • pj-rehearse validates the job can be scheduled and run
  • Verify grafanactl builds and authenticates in the Prow container
  • Confirm both clean datasources and clean fixup-datasources execute successfully
  • After merge, manually trigger and verify via Prow UI

🤖 Generated with Claude Code

Summary by CodeRabbit

This PR extends the ARO-HCP CI configuration with a monthly automated Grafana cleanup for the DEV Azure Managed Grafana instance, preventing orphaned Prometheus datasource entries from accumulating.

What changes (practically):

  • New monthly periodic job: Updates ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__periodic-cleanup.yaml to schedule clean-grafana-datasources on the 1st of every month at 06:00 UTC (0 6 1 * *). It reports failure and error states to #aro-hcp-failures-dev and runs the new deprovision step.
  • New step-registry deprovision step: Adds aro-hcp-deprovision-grafana-datasources under ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/ with:
    • A bash commands script that enables strict mode, derives CLUSTER_PROFILE_DIR from VAULT_SECRET_PROFILE, reads Azure service principal credentials plus GLOBAL_INFRA_SUBSCRIPTION_ID from mounted vault-secret files, logs into Azure via az login --service-principal, builds grafanactl to /tmp/grafanactl, and then runs (in order):
      1. grafanactl clean datasources
      2. grafanactl clean fixup-datasources
    • Step defaults/targeting via env vars for Grafana name arohcp-dev and resource group global (with GRAFANA_NAME/GRAFANA_RESOURCE_GROUP configured in the step ref).
    • Cleanup intent: remove stale AMW integrations and orphaned Managed_Prometheus_* datasources not backed by a live Azure Monitor Workspace.
    • Updated step metadata and OWNERS to include geoberle and deads2k (alongside the existing team reviewer/approver placeholders).
  • Regenerated Prow wiring: Rebuilds ci-operator/jobs/Azure/ARO-HCP/Azure-ARO-HCP-main-periodics.yaml via make jobs so the new periodic entry is reflected in generated Prow job definitions.

Rehearsal request:

  • The author invoked /pj-rehearse to validate the periodic job scheduling/execution before merge.

…1138)

Add a monthly Prow periodic that runs grafanactl clean datasources and
clean fixup-datasources against the DEV Grafana instance to remove
orphaned Prometheus datasources left by personal dev environments.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026
@openshift-ci

openshift-ci Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 271d6877-7319-4fe7-b7b9-7cba5551eb3b

📥 Commits

Reviewing files that changed from the base of the PR and between 58d10aa and 2260d32.

📒 Files selected for processing (2)
  • ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/OWNERS
  • ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/aro-hcp-deprovision-grafana-datasources-ref.metadata.json
✅ Files skipped from review due to trivial changes (2)
  • ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/aro-hcp-deprovision-grafana-datasources-ref.metadata.json
  • ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/OWNERS

Walkthrough

A new Grafana datasource deprovisioning step is added with Azure authentication and grafanactl cleanup commands, and a monthly periodic CI job is configured to run it.

Changes

ARO HCP Grafana Datasource Cleanup

Layer / File(s) Summary
Step registry and command
ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/...
Defines the new deprovisioning step: the ref YAML sets the command script, resource requests, credentials mount, and Grafana-related environment variables; the bash script logs into Azure, builds grafanactl, and runs clean datasources plus clean fixup-datasources; the metadata and OWNERS files set the reference target and reviewers.
Periodic cleanup job
ci-operator/config/Azure/ARO-HCP/Azure-ARO-HCP-main__periodic-cleanup.yaml
Adds the clean-grafana-datasources test entry with cron 0 6 1 * *, failure reporting to #aro-hcp-failures-dev on failure and error, and a single step invoking aro-hcp-deprovision-grafana-datasources.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Suggested labels

rehearsals-ack, jira/valid-reference

🚥 Pre-merge checks | ✅ 15
✅ Passed checks (15 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly describes the main change: adding a periodic Grafana datasource cleanup job.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed No Ginkgo test files or It/Describe/When titles were added; the PR only changes CI YAML, shell, metadata, and OWNERS files.
Test Structure And Quality ✅ Passed No Ginkgo test code was changed; the PR only adds CI YAML, shell, JSON, and OWNERS files.
Microshift Test Compatibility ✅ Passed PR only adds CI job/step-registry YAML and a shell cleanup script; no new Ginkgo tests or unsupported OpenShift APIs/resources are introduced.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo/e2e tests were added; the PR only adds YAML/step-registry shell cleanup, and I found no SNO-topology assumptions.
Topology-Aware Scheduling Compatibility ✅ Passed Only CI job/step-registry files were added; they contain no node selectors, affinity, spread constraints, replicas, or PDBs.
Ote Binary Stdout Contract ✅ Passed The PR only adds CI YAML and a shell step; no Go/TestMain/RunSpecs code was changed, and the new script’s stdout is just step logging, not an OTE JSON stream.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed No new Ginkgo/e2e tests were added; the PR only adds CI config and a shell cleanup step, so the IPv4/disconnected-network test check is not applicable.
No-Weak-Crypto ✅ Passed Inspected all new/updated files and found no MD5/SHA1/DES/RC4/3DES/Blowfish/ECB use or secret/token comparisons.
Container-Privileges ✅ Passed Reviewed the new step-registry and periodic job YAMLs; none set privileged, hostPID/Network/IPC, allowPrivilegeEscalation, SYS_ADMIN, or runAsUser: 0.
No-Sensitive-Data-In-Logs ✅ Passed New script only logs generic status messages; no secrets, tokens, PII, or customer data are printed, and az login uses --output none.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@cssjr

cssjr commented Jun 24, 2026

Copy link
Copy Markdown
Author

/pj-rehearse

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@cssjr: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

The Azure SDK's DefaultAzureCredential with RequireAzureTokenCredentials
requires AZURE_TOKEN_CREDENTIALS to select credential sources. Setting
it to "prod" enables EnvironmentCredential (AZURE_CLIENT_ID/SECRET/TENANT),
which is how all ARO-HCP Prow steps authenticate.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cssjr

cssjr commented Jun 24, 2026

Copy link
Copy Markdown
Author

/pj-rehearse

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@cssjr: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@cssjr cssjr marked this pull request as ready for review June 24, 2026 02:14
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 24, 2026
@openshift-ci openshift-ci Bot requested review from geoberle and mmazur June 24, 2026 02:14
@cssjr

cssjr commented Jun 24, 2026

Copy link
Copy Markdown
Author

/pj-rehearse

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@cssjr: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

Comment thread ci-operator/step-registry/aro-hcp/deprovision/grafana-datasources/OWNERS Outdated
@roivaz

roivaz commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

/approve

@roivaz

roivaz commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added lgtm Indicates that a PR is ready to be merged. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jun 25, 2026
@openshift-ci openshift-ci Bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 25, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

[REHEARSALNOTIFIER]
@cssjr: the pj-rehearse plugin accommodates running rehearsal tests for the changes in this PR. Expand 'Interacting with pj-rehearse' for usage details. The following rehearsable tests have been affected by this change:

Test name Repo Type Reason
periodic-ci-Azure-ARO-HCP-main-periodic-cleanup-clean-grafana-datasources N/A periodic Periodic changed
Interacting with pj-rehearse

Comment: /pj-rehearse to run up to 5 rehearsals
Comment: /pj-rehearse skip to opt-out of rehearsals
Comment: /pj-rehearse {test-name}, with each test separated by a space, to run one or more specific rehearsals
Comment: /pj-rehearse more to run up to 10 rehearsals
Comment: /pj-rehearse max to run up to 25 rehearsals
Comment: /pj-rehearse auto-ack to run up to 5 rehearsals, and add the rehearsals-ack label on success
Comment: /pj-rehearse list to get an up-to-date list of affected jobs
Comment: /pj-rehearse abort to abort all active rehearsals
Comment: /pj-rehearse network-access-allowed to allow rehearsals of tests that have the restrict_network_access field set to false. This must be executed by an openshift org member who is not the PR author

Once you are satisfied with the results of the rehearsals, comment: /pj-rehearse ack to unblock merge. When the rehearsals-ack label is present on your PR, merge will no longer be blocked by rehearsals.
If you would like the rehearsals-ack label removed, comment: /pj-rehearse reject to re-block merging.

@roivaz

roivaz commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

/lgtm

@openshift-ci openshift-ci Bot added the lgtm Indicates that a PR is ready to be merged. label Jun 25, 2026
@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@cssjr: pj-rehearse could not automatically process this event because the request waited in queue for longer than 5 minutes. Use /pj-rehearse to trigger rehearsals manually.

@openshift-ci

openshift-ci Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cssjr, roivaz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@cssjr

cssjr commented Jun 25, 2026

Copy link
Copy Markdown
Author

/pj-rehearse

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@cssjr: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@cssjr

cssjr commented Jun 26, 2026

Copy link
Copy Markdown
Author

/pj-rehearse ack

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

@cssjr: now processing your pj-rehearse request. Please allow up to 10 minutes for jobs to trigger or cancel.

@openshift-merge-bot openshift-merge-bot Bot added the rehearsals-ack Signifies that rehearsal jobs have been acknowledged label Jun 26, 2026
@cssjr

cssjr commented Jun 26, 2026

Copy link
Copy Markdown
Author

/retest

@openshift-ci

openshift-ci Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

@cssjr: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/generated-config dbe206e link true /test generated-config

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@cssjr

cssjr commented Jun 26, 2026

Copy link
Copy Markdown
Author

/hold

@openshift-ci openshift-ci Bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 26, 2026
@cssjr

cssjr commented Jun 26, 2026

Copy link
Copy Markdown
Author

On hold while we consider an alternative approach which may not require a Prow job.

@cssjr

cssjr commented Jun 26, 2026

Copy link
Copy Markdown
Author

Abandoning in favor of Azure/ARO-Tools#258

@cssjr cssjr closed this Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm Indicates that a PR is ready to be merged. rehearsals-ack Signifies that rehearsal jobs have been acknowledged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants