feat(manifests): add optional Grafana dashboard for controller observability#3653
feat(manifests): add optional Grafana dashboard for controller observability#3653h0pers wants to merge 4 commits into
Conversation
…lity Add a default Grafana dashboard shipped as a Helm-gated ConfigMap that provides out-of-the-box visibility into controller health and TrainJob lifecycle using existing controller-runtime metrics. The dashboard is disabled by default (grafanaDashboard.enabled=false) and uses the standard grafana_dashboard label for sidecar auto-discovery. It includes panels for controller scrape status, goroutines, memory, reconcile rates, workqueue depth, and TrainJob-specific reconciliation metrics. Signed-off-by: Dmytro Hryshchenko <dhryshch@redhat.com> Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Pull request overview
This PR adds an optional, disabled-by-default Grafana dashboard to the Kubeflow Trainer Helm chart by shipping the dashboard JSON in-chart and rendering it as a labeled ConfigMap for Grafana sidecar auto-discovery when enabled.
Changes:
- Introduces a new
grafanaDashboardvalues section (defaultenabled: false) to gate dashboard installation and allow label/annotation customization. - Adds a Helm template that conditionally renders a dashboard ConfigMap populated from an in-chart JSON file.
- Adds Helm unit tests covering the rendered/non-rendered behavior and key metadata/data expectations, plus README documentation for enabling.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| charts/kubeflow-trainer/values.yaml | Adds grafanaDashboard values to gate and configure dashboard installation. |
| charts/kubeflow-trainer/templates/grafana/configmap.yaml | Renders the dashboard ConfigMap only when grafanaDashboard.enabled is true and injects the JSON from chart files. |
| charts/kubeflow-trainer/dashboards/kubeflow-trainer-dashboard.json | Provides the default Grafana dashboard model using controller-runtime / workqueue metrics. |
| charts/kubeflow-trainer/tests/grafana/configmap_test.yaml | Adds helm-unittest coverage for enablement gating, labels/annotations, and JSON data key presence. |
| charts/kubeflow-trainer/README.md.gotmpl | Documents how to enable the dashboard via Helm values. |
|
Thanks for the PR @h0pers |
|
Could you add a screenshot of the dashboard with sample data? Helps verify layout, panel sizing, and threshold colors render as intended. Also, the "Docs included" checkbox is checked but the only docs change is to the Helm chart's README.md.gotmpl. The Kubeflow Trainer docs site has no monitoring/observability section for V2, only a legacy V1 Prometheus page with different metrics. Could you either add a page to the website docs or open a follow-up issue tracking it? |
Sridhar1030
left a comment
There was a problem hiding this comment.
Reviewed with the help of AI.
Overall this is a solid addition -- the PromQL queries are correct against controller-runtime metrics, the Helm template follows existing chart patterns, and the opt-in gating is clean. A few things to address inline.
…er-dashboard.json Co-authored-by: Sridhar Pillai <77533416+Sridhar1030@users.noreply.github.com> Signed-off-by: Dmytro Hryshchenko <dmytrohryshchenkowork@gmail.com>
…rd section Signed-off-by: Dmytro Hryshchenko <dhryshch@redhat.com>
@Sridhar1030 here is some screenshots |
|
Thanks for the screenshots just a minor change Current state of the dashboard JSON has two issues: The job variable still has "regex": ".trainer." (original bug, unfixed) Restore "type": "prometheus" in the controller variable's datasource |
Signed-off-by: Dmytro Hryshchenko <dhryshch@redhat.com>
d9ac90a to
6994ed1
Compare



What this PR does / why we need it:
Adds a default Grafana dashboard shipped as a ConfigMap in the Kubeflow Trainer Helm chart, providing out-of-the-box visibility into controller health and TrainJob lifecycle.
The dashboard is disabled by default (
grafanaDashboard.enabled: false) - zero impact on existing deployments. When enabled, the ConfigMap is labeled for Grafana sidecar auto-discovery (grafana_dashboard: "1"). It includes 10 panels across 3 rows (Controller Health, Queue & Backlog, TrainJob Lifecycle) using controller-runtime metrics already exposed by the Trainer controller.Changes:
charts/kubeflow-trainer/dashboards/kubeflow-trainer-dashboard.json- Grafana dashboard JSON modelcharts/kubeflow-trainer/templates/grafana/configmap.yaml- Helm template gated bygrafanaDashboard.enabledcharts/kubeflow-trainer/tests/grafana/configmap_test.yaml- 7 Helm unit tests for the ConfigMapcharts/kubeflow-trainer/values.yaml- newgrafanaDashboardvalues sectioncharts/kubeflow-trainer/README.md.gotmpl- documentation for enabling the dashboardWhich issue(s) this PR fixes:
Fixes #3430
Checklist: