Skip to content

feat(manifests): add optional Grafana dashboard for controller observability#3653

Open
h0pers wants to merge 4 commits into
kubeflow:masterfrom
h0pers:feat/grafana-dashboard
Open

feat(manifests): add optional Grafana dashboard for controller observability#3653
h0pers wants to merge 4 commits into
kubeflow:masterfrom
h0pers:feat/grafana-dashboard

Conversation

@h0pers

@h0pers h0pers commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Adds a default Grafana dashboard shipped as a ConfigMap in the Kubeflow Trainer Helm chart, providing out-of-the-box visibility into controller health and TrainJob lifecycle.

The dashboard is disabled by default (grafanaDashboard.enabled: false) - zero impact on existing deployments. When enabled, the ConfigMap is labeled for Grafana sidecar auto-discovery (grafana_dashboard: "1"). It includes 10 panels across 3 rows (Controller Health, Queue & Backlog, TrainJob Lifecycle) using controller-runtime metrics already exposed by the Trainer controller.

Changes:

  • charts/kubeflow-trainer/dashboards/kubeflow-trainer-dashboard.json - Grafana dashboard JSON model
  • charts/kubeflow-trainer/templates/grafana/configmap.yaml - Helm template gated by grafanaDashboard.enabled
  • charts/kubeflow-trainer/tests/grafana/configmap_test.yaml - 7 Helm unit tests for the ConfigMap
  • charts/kubeflow-trainer/values.yaml - new grafanaDashboard values section
  • charts/kubeflow-trainer/README.md.gotmpl - documentation for enabling the dashboard

Which issue(s) this PR fixes:
Fixes #3430

Checklist:

  • Docs included if any changes are user facing

…lity

Add a default Grafana dashboard shipped as a Helm-gated ConfigMap
that provides out-of-the-box visibility into controller health and
TrainJob lifecycle using existing controller-runtime metrics.

The dashboard is disabled by default (grafanaDashboard.enabled=false)
and uses the standard grafana_dashboard label for sidecar
auto-discovery. It includes panels for controller scrape status,
goroutines, memory, reconcile rates, workqueue depth, and
TrainJob-specific reconciliation metrics.

Signed-off-by: Dmytro Hryshchenko <dhryshch@redhat.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 26, 2026 15:44
@google-oss-prow google-oss-prow Bot requested a review from akshaychitneni June 26, 2026 15:44
@google-oss-prow

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign johnugeorge for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow Bot requested a review from jinchihe June 26, 2026 15:44
@h0pers h0pers changed the title feat(charts): add optional Grafana dashboard for controller observability feat(manifests): add optional Grafana dashboard for controller observability Jun 26, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an optional, disabled-by-default Grafana dashboard to the Kubeflow Trainer Helm chart by shipping the dashboard JSON in-chart and rendering it as a labeled ConfigMap for Grafana sidecar auto-discovery when enabled.

Changes:

  • Introduces a new grafanaDashboard values section (default enabled: false) to gate dashboard installation and allow label/annotation customization.
  • Adds a Helm template that conditionally renders a dashboard ConfigMap populated from an in-chart JSON file.
  • Adds Helm unit tests covering the rendered/non-rendered behavior and key metadata/data expectations, plus README documentation for enabling.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
charts/kubeflow-trainer/values.yaml Adds grafanaDashboard values to gate and configure dashboard installation.
charts/kubeflow-trainer/templates/grafana/configmap.yaml Renders the dashboard ConfigMap only when grafanaDashboard.enabled is true and injects the JSON from chart files.
charts/kubeflow-trainer/dashboards/kubeflow-trainer-dashboard.json Provides the default Grafana dashboard model using controller-runtime / workqueue metrics.
charts/kubeflow-trainer/tests/grafana/configmap_test.yaml Adds helm-unittest coverage for enablement gating, labels/annotations, and JSON data key presence.
charts/kubeflow-trainer/README.md.gotmpl Documents how to enable the dashboard via Helm values.

@Sridhar1030

Copy link
Copy Markdown
Member

Thanks for the PR @h0pers
ill review it promptly
/assign

@Sridhar1030

Copy link
Copy Markdown
Member

Could you add a screenshot of the dashboard with sample data? Helps verify layout, panel sizing, and threshold colors render as intended.

Also, the "Docs included" checkbox is checked but the only docs change is to the Helm chart's README.md.gotmpl. The Kubeflow Trainer docs site has no monitoring/observability section for V2, only a legacy V1 Prometheus page with different metrics. Could you either add a page to the website docs or open a follow-up issue tracking it?

@Sridhar1030 Sridhar1030 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed with the help of AI.

Overall this is a solid addition -- the PromQL queries are correct against controller-runtime metrics, the Helm template follows existing chart patterns, and the opt-in gating is clean. A few things to address inline.

Comment thread charts/kubeflow-trainer/dashboards/kubeflow-trainer-dashboard.json
Comment thread charts/kubeflow-trainer/README.md.gotmpl
h0pers and others added 2 commits June 30, 2026 13:53
…er-dashboard.json

Co-authored-by: Sridhar Pillai <77533416+Sridhar1030@users.noreply.github.com>
Signed-off-by: Dmytro Hryshchenko <dmytrohryshchenkowork@gmail.com>
…rd section

Signed-off-by: Dmytro Hryshchenko <dhryshch@redhat.com>
@h0pers

h0pers commented Jun 30, 2026

Copy link
Copy Markdown
Contributor Author
image image Kubeflow Trainer - Controller Health   TrainJob Lifecycle-1782828456331

@Sridhar1030 here is some screenshots

@h0pers h0pers requested a review from Sridhar1030 June 30, 2026 14:16
@Sridhar1030

Copy link
Copy Markdown
Member

Thanks for the screenshots just a minor change
the suggestion in commit 2e41cb1 landed on the wrong line. It replaced "type": "prometheus" in the controller variable's datasource block instead of the "regex": ".trainer." in the job variable.

Current state of the dashboard JSON has two issues:

The job variable still has "regex": ".trainer." (original bug, unfixed)
The controller variable's datasource lost "type": "prometheus" and now has a stray "regex": "" inside it, which will break the variable query
Fix needed:

Restore "type": "prometheus" in the controller variable's datasource
Change "regex": ".trainer." to "regex": "" in the job variable

Signed-off-by: Dmytro Hryshchenko <dhryshch@redhat.com>
@h0pers h0pers force-pushed the feat/grafana-dashboard branch from d9ac90a to 6994ed1 Compare June 30, 2026 16:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Ship a default Grafana dashboard for Kubeflow Trainer

3 participants