Skip to content

Commit d120a3d

Browse files
zhoward-1claudeaustingreco
authored
docs: add monitoring and observability guide (#1040)
## Summary Adds `docs/operator-guides/monitoring.md` covering the full observability setup for Michelangelo deployments. Covers: - **Prometheus scrape configuration**: `ServiceMonitor` for the controller manager (port 8091), health probe endpoints (port 8081), API server gRPC metrics, and Envoy admin stats (port 9901) - **Key metrics** organized by subsystem: job scheduling, Temporal workflow engine, model serving (Envoy upstream metrics), and controller-runtime health metrics - **5 alerting rules**: job scheduling backlog, no healthy compute clusters (critical), controller reconcile error rate, inference latency P99, inference 5xx error rate - **Grafana dashboard** panel recommendations by row (overview, jobs, serving, controller health) with PromQL queries - **Structured logging** configuration and a table of important log fields to index for log aggregation systems Part of the operator/contributor guide improvements proposed in #1033. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Austin Greco <austingreco@gmail.com>
1 parent adcb692 commit d120a3d

1 file changed

Lines changed: 233 additions & 0 deletions

File tree

docs/operator-guides/monitoring.md

Lines changed: 233 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,233 @@
1+
# Monitoring & Observability
2+
3+
Michelangelo components expose Prometheus metrics that integrate with a standard Kubernetes observability stack. This guide covers scrape configuration, key metrics to monitor, alerting rules, and logging configuration.
4+
5+
## Prometheus Scrape Configuration
6+
7+
### Controller Manager
8+
9+
The controller manager exposes metrics on port `8091` (configured via `metricsBindAddress`). If you are using the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator), create a `ServiceMonitor`:
10+
11+
```yaml
12+
apiVersion: monitoring.coreos.com/v1
13+
kind: ServiceMonitor
14+
metadata:
15+
name: michelangelo-controllermgr
16+
namespace: ma-system
17+
labels:
18+
app: michelangelo-controllermgr
19+
spec:
20+
selector:
21+
matchLabels:
22+
app: michelangelo-controllermgr
23+
endpoints:
24+
- port: metrics # Must match the Service port name for port 8091
25+
path: /metrics
26+
interval: 30s
27+
```
28+
29+
### Health Probes
30+
31+
The controller manager exposes health endpoints on port `8083` (configured via `healthProbeBindAddress`):
32+
33+
| Endpoint | Purpose |
34+
|----------|---------|
35+
| `GET :8083/healthz` | Liveness — is the process alive? |
36+
| `GET :8083/readyz` | Readiness — is the controller ready to reconcile? |
37+
38+
These are used by Kubernetes liveness and readiness probes, but you can also poll them from your monitoring stack for coarser-grained health checks.
39+
40+
### API Server
41+
42+
The API server (port `15566`) exposes standard gRPC metrics. If you have a Prometheus scrape job for gRPC services, point it at the API server pod.
43+
44+
### Envoy Proxy
45+
46+
Envoy can expose an admin stats interface for scraping request counts, latency histograms, and upstream error rates. The admin interface is **not enabled by default** in the Michelangelo Envoy configuration — you must add an `admin:` block to your Envoy ConfigMap to enable it. See the [Envoy admin documentation](https://www.envoyproxy.io/docs/envoy/latest/operations/admin) for setup instructions. Once enabled, add a Prometheus scrape job targeting the admin port.
47+
48+
---
49+
50+
## Key Metrics
51+
52+
### Pipeline Runs
53+
54+
| Metric | Description | Unit |
55+
|--------|-------------|------|
56+
| `pipelinerun_result_total` | Pipeline run results, by `state`, `pipeline_type`, `environment`, `tier` | Count |
57+
| `pipelinerun_result_failure_total` | Failed pipeline runs, with `failure_reason` label | Count |
58+
| `pipelinerun_duration_seconds` | Pipeline run execution duration (histogram) | Seconds |
59+
| `pipelinerun_failed` | Gauge: 1 if most recent run failed, 0 if succeeded | Gauge |
60+
| `pipelinerun_step_success_total` | Step completions, by `step_name` and `pipeline_type` | Count |
61+
| `pipeline_ready_total` | Pipelines reaching Ready state | Count |
62+
63+
### Workflow Engine
64+
65+
Workflow metrics are emitted by the Cadence or Temporal server, not by Michelangelo. Consult your workflow engine's documentation for its native Prometheus metrics. Michelangelo's worker-side reconcile metrics are captured under the `pipelinerun_*` counters above.
66+
67+
### Model Serving (Envoy)
68+
69+
If you have enabled the Envoy admin interface, these standard Envoy metrics are available:
70+
71+
| Metric | Description | Unit |
72+
|--------|-------------|------|
73+
| `envoy_cluster_upstream_rq_total` | Total requests to inference backends | Count |
74+
| `envoy_cluster_upstream_rq_5xx` | 5xx error responses from inference backends | Count |
75+
| `envoy_cluster_upstream_rq_time` | Request latency histogram to inference servers | Seconds |
76+
77+
### Controller Manager Health
78+
79+
The controller manager uses `controller-runtime` metrics — these are standard across all Kubernetes operators:
80+
81+
| Metric | Description | Unit |
82+
|--------|-------------|------|
83+
| `controller_runtime_reconcile_errors_total` | Reconcile errors, by `controller` label | Count |
84+
| `controller_runtime_reconcile_time_seconds` | Reconcile duration histogram | Seconds |
85+
| `workqueue_depth` | Work items queued, by `name` label (one per controller) | Count |
86+
| `workqueue_retries_total` | Work item retries — elevated value indicates persistent failures | Count |
87+
88+
---
89+
90+
## Alerting Rules
91+
92+
Add these rules to your Prometheus configuration:
93+
94+
```yaml
95+
groups:
96+
- name: michelangelo
97+
rules:
98+
99+
# Pipeline run failure rate
100+
- alert: PipelineRunFailureRateHigh
101+
expr: rate(pipelinerun_result_failure_total[5m]) > 0
102+
for: 5m
103+
labels:
104+
severity: warning
105+
annotations:
106+
summary: "Pipeline run failures detected"
107+
description: >
108+
Pipeline runs are failing at {{ $value | humanize }} failures/sec.
109+
Check failure reasons: kubectl -n ma-system get pipelineruns --field-selector status.phase=Failed
110+
111+
# Pipeline run duration: P99 above 1 hour
112+
- alert: PipelineRunDurationHigh
113+
expr: >
114+
histogram_quantile(0.99,
115+
rate(pipelinerun_duration_seconds_bucket[5m])
116+
) > 3600
117+
for: 10m
118+
labels:
119+
severity: warning
120+
annotations:
121+
summary: "Pipeline run P99 duration above 1 hour"
122+
description: >
123+
The 99th percentile pipeline run duration is {{ $value | humanize }}s.
124+
125+
# Controller reconcile errors — sustained error rate from any controller
126+
- alert: ControllerReconcileErrorRate
127+
expr: rate(controller_runtime_reconcile_errors_total[5m]) > 0.1
128+
for: 5m
129+
labels:
130+
severity: warning
131+
annotations:
132+
summary: "Controller {{ $labels.controller }} has high reconcile error rate"
133+
description: >
134+
The {{ $labels.controller }} controller is failing reconciles at
135+
{{ $value | humanize }} errors/sec. Check logs:
136+
kubectl -n ma-system logs deployment/michelangelo-controllermgr
137+
138+
# Inference latency: P99 above 500ms for 5 minutes
139+
- alert: InferenceLatencyHigh
140+
expr: >
141+
histogram_quantile(0.99,
142+
rate(envoy_cluster_upstream_rq_time_bucket[5m])
143+
) > 500
144+
for: 5m
145+
labels:
146+
severity: warning
147+
annotations:
148+
summary: "Inference P99 latency is above 500ms"
149+
description: >
150+
The 99th percentile inference request latency is {{ $value }}ms.
151+
Check InferenceServer and model-sync sidecar logs.
152+
153+
# Inference error rate: more than 1% of requests returning 5xx
154+
- alert: InferenceErrorRateHigh
155+
expr: >
156+
rate(envoy_cluster_upstream_rq_5xx[5m])
157+
/ rate(envoy_cluster_upstream_rq_total[5m]) > 0.01
158+
for: 5m
159+
labels:
160+
severity: warning
161+
annotations:
162+
summary: "Inference 5xx error rate above 1%"
163+
description: >
164+
{{ $value | humanizePercentage }} of inference requests are returning 5xx errors.
165+
```
166+
167+
---
168+
169+
## Grafana Dashboard
170+
171+
Create a Grafana dashboard with these panels to get operational visibility at a glance.
172+
173+
### Overview row
174+
175+
| Panel | Query | Visualization |
176+
|-------|-------|---------------|
177+
| Pipeline run results | `rate(pipelinerun_result_total[5m])` | Time series |
178+
| Pipeline run failures | `pipelinerun_failed` | Stat |
179+
| Pipeline readiness | `pipeline_ready_total` | Stat |
180+
| Reconcile errors | `rate(controller_runtime_reconcile_errors_total[5m])` | Time series |
181+
182+
### Jobs row
183+
184+
| Panel | Query | Visualization |
185+
|-------|-------|---------------|
186+
| Pipeline run duration P50/P99 | `histogram_quantile(0.5/0.99, rate(pipelinerun_duration_seconds_bucket[5m]))` | Time series |
187+
| Failure rate by reason | `rate(pipelinerun_result_failure_total[5m])` by `failure_reason` | Time series |
188+
189+
### Serving row
190+
191+
| Panel | Query | Visualization |
192+
|-------|-------|---------------|
193+
| Request rate | `rate(envoy_cluster_upstream_rq_total[5m])` | Time series |
194+
| Request latency P50/P99 | `histogram_quantile(0.5/0.99, rate(envoy_cluster_upstream_rq_time_bucket[5m]))` | Time series |
195+
| 5xx error rate | `rate(envoy_cluster_upstream_rq_5xx[5m])` | Time series |
196+
| Active model deployments | `envoy_cluster_upstream_rq_total` (by cluster) | Table |
197+
198+
### Controller health row
199+
200+
| Panel | Query | Visualization |
201+
|-------|-------|---------------|
202+
| Reconcile error rate by controller | `rate(controller_runtime_reconcile_errors_total[5m])` | Time series |
203+
| Reconcile latency P99 | `histogram_quantile(0.99, rate(controller_runtime_reconcile_time_seconds_bucket[5m]))` | Time series |
204+
| Work queue depth | `workqueue_depth` | Time series |
205+
206+
---
207+
208+
## Structured Logging
209+
210+
All Michelangelo components emit structured logs. Configure log format and level in the relevant ConfigMap:
211+
212+
```yaml
213+
logging:
214+
level: info # debug | info | warn | error
215+
development: false # true enables human-readable console output
216+
encoding: json # json for production; console for development
217+
```
218+
219+
For production deployments use `encoding: json` so your log aggregation system (Loki, Elasticsearch, CloudWatch Logs, etc.) can parse and query fields natively.
220+
221+
### Important log fields to index
222+
223+
| Field | Description |
224+
|-------|-------------|
225+
| `level` | Log severity |
226+
| `logger` | Component/controller name |
227+
| `msg` | Log message |
228+
| `namespace` | Kubernetes resource namespace |
229+
| `name` | Kubernetes resource name |
230+
| `operation` | Controller operation (e.g., `create_ray_cluster`, `schedule_job`) |
231+
| `error` | Error message (present on error-level logs) |
232+
233+
Indexing these fields allows you to efficiently query all events for a specific resource (`namespace` + `name`), filter by controller (`logger`), or find all failures across the control plane (`level: error`).

0 commit comments

Comments
 (0)