docs: add monitoring and observability guide (#1040)

zhoward-1 · claude · austingreco · web-flow · commit d120a3de1d52 · 2026-04-14T16:23:06.000-07:00
## Summary Adds `docs/operator-guides/monitoring.md` covering the full observability setup for Michelangelo deployments. Covers: - **Prometheus scrape configuration**: `ServiceMonitor` for the controller manager (port 8091), health probe endpoints (port 8081), API server gRPC metrics, and Envoy admin stats (port 9901) - **Key metrics** organized by subsystem: job scheduling, Temporal workflow engine, model serving (Envoy upstream metrics), and controller-runtime health metrics - **5 alerting rules**: job scheduling backlog, no healthy compute clusters (critical), controller reconcile error rate, inference latency P99, inference 5xx error rate - **Grafana dashboard** panel recommendations by row (overview, jobs, serving, controller health) with PromQL queries - **Structured logging** configuration and a table of important log fields to index for log aggregation systems Part of the operator/contributor guide improvements proposed in #1033. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Austin Greco <austingreco@gmail.com>
diff --git a/docs/operator-guides/monitoring.md b/docs/operator-guides/monitoring.md
@@ -0,0 +1,233 @@
+# Monitoring & Observability
+
+Michelangelo components expose Prometheus metrics that integrate with a standard Kubernetes observability stack. This guide covers scrape configuration, key metrics to monitor, alerting rules, and logging configuration.
+
+## Prometheus Scrape Configuration
+
+### Controller Manager
+
+The controller manager exposes metrics on port `8091` (configured via `metricsBindAddress`). If you are using the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator), create a `ServiceMonitor`:
+
+```yaml
+apiVersion: monitoring.coreos.com/v1
+kind: ServiceMonitor
+metadata:
+  name: michelangelo-controllermgr
+  namespace: ma-system
+  labels:
+    app: michelangelo-controllermgr
+spec:
+  selector:
+    matchLabels:
+      app: michelangelo-controllermgr
+  endpoints:
+  - port: metrics          # Must match the Service port name for port 8091
+    path: /metrics
+    interval: 30s
+```
+
+### Health Probes
+
+The controller manager exposes health endpoints on port `8083` (configured via `healthProbeBindAddress`):
+
+| Endpoint | Purpose |
+|----------|---------|
+| `GET :8083/healthz` | Liveness — is the process alive? |
+| `GET :8083/readyz` | Readiness — is the controller ready to reconcile? |
+
+These are used by Kubernetes liveness and readiness probes, but you can also poll them from your monitoring stack for coarser-grained health checks.
+
+### API Server
+
+The API server (port `15566`) exposes standard gRPC metrics. If you have a Prometheus scrape job for gRPC services, point it at the API server pod.
+
+### Envoy Proxy
+
+Envoy can expose an admin stats interface for scraping request counts, latency histograms, and upstream error rates. The admin interface is **not enabled by default** in the Michelangelo Envoy configuration — you must add an `admin:` block to your Envoy ConfigMap to enable it. See the [Envoy admin documentation](https://www.envoyproxy.io/docs/envoy/latest/operations/admin) for setup instructions. Once enabled, add a Prometheus scrape job targeting the admin port.
+
+---
+
+## Key Metrics
+
+### Pipeline Runs
+
+| Metric | Description | Unit |
+|--------|-------------|------|
+| `pipelinerun_result_total` | Pipeline run results, by `state`, `pipeline_type`, `environment`, `tier` | Count |
+| `pipelinerun_result_failure_total` | Failed pipeline runs, with `failure_reason` label | Count |
+| `pipelinerun_duration_seconds` | Pipeline run execution duration (histogram) | Seconds |
+| `pipelinerun_failed` | Gauge: 1 if most recent run failed, 0 if succeeded | Gauge |
+| `pipelinerun_step_success_total` | Step completions, by `step_name` and `pipeline_type` | Count |
+| `pipeline_ready_total` | Pipelines reaching Ready state | Count |
+
+### Workflow Engine
+
+Workflow metrics are emitted by the Cadence or Temporal server, not by Michelangelo. Consult your workflow engine's documentation for its native Prometheus metrics. Michelangelo's worker-side reconcile metrics are captured under the `pipelinerun_*` counters above.
+
+### Model Serving (Envoy)
+
+If you have enabled the Envoy admin interface, these standard Envoy metrics are available:
+
+| Metric | Description | Unit |
+|--------|-------------|------|
+| `envoy_cluster_upstream_rq_total` | Total requests to inference backends | Count |
+| `envoy_cluster_upstream_rq_5xx` | 5xx error responses from inference backends | Count |
+| `envoy_cluster_upstream_rq_time` | Request latency histogram to inference servers | Seconds |
+
+### Controller Manager Health
+
+The controller manager uses `controller-runtime` metrics — these are standard across all Kubernetes operators:
+
+| Metric | Description | Unit |
+|--------|-------------|------|
+| `controller_runtime_reconcile_errors_total` | Reconcile errors, by `controller` label | Count |
+| `controller_runtime_reconcile_time_seconds` | Reconcile duration histogram | Seconds |
+| `workqueue_depth` | Work items queued, by `name` label (one per controller) | Count |
+| `workqueue_retries_total` | Work item retries — elevated value indicates persistent failures | Count |
+
+---
+
+## Alerting Rules
+
+Add these rules to your Prometheus configuration:
+
+```yaml
+groups:
+- name: michelangelo
+  rules:
+
+  # Pipeline run failure rate
+  - alert: PipelineRunFailureRateHigh
+    expr: rate(pipelinerun_result_failure_total[5m]) > 0
+    for: 5m
+    labels:
+      severity: warning
+    annotations:
+      summary: "Pipeline run failures detected"
+      description: >
+        Pipeline runs are failing at {{ $value | humanize }} failures/sec.
+        Check failure reasons: kubectl -n ma-system get pipelineruns --field-selector status.phase=Failed
+
+  # Pipeline run duration: P99 above 1 hour
+  - alert: PipelineRunDurationHigh
+    expr: >
+      histogram_quantile(0.99,
+        rate(pipelinerun_duration_seconds_bucket[5m])
+      ) > 3600
+    for: 10m
+    labels:
+      severity: warning
+    annotations:
+      summary: "Pipeline run P99 duration above 1 hour"
+      description: >
+        The 99th percentile pipeline run duration is {{ $value | humanize }}s.
+
+  # Controller reconcile errors — sustained error rate from any controller
+  - alert: ControllerReconcileErrorRate
+    expr: rate(controller_runtime_reconcile_errors_total[5m]) > 0.1
+    for: 5m
+    labels:
+      severity: warning
+    annotations:
+      summary: "Controller {{ $labels.controller }} has high reconcile error rate"
+      description: >
+        The {{ $labels.controller }} controller is failing reconciles at
+        {{ $value | humanize }} errors/sec. Check logs:
+        kubectl -n ma-system logs deployment/michelangelo-controllermgr
+
+  # Inference latency: P99 above 500ms for 5 minutes
+  - alert: InferenceLatencyHigh
+    expr: >
+      histogram_quantile(0.99,
+        rate(envoy_cluster_upstream_rq_time_bucket[5m])
+      ) > 500
+    for: 5m
+    labels:
+      severity: warning
+    annotations:
+      summary: "Inference P99 latency is above 500ms"
+      description: >
+        The 99th percentile inference request latency is {{ $value }}ms.
+        Check InferenceServer and model-sync sidecar logs.
+
+  # Inference error rate: more than 1% of requests returning 5xx
+  - alert: InferenceErrorRateHigh
+    expr: >
+      rate(envoy_cluster_upstream_rq_5xx[5m])
+      / rate(envoy_cluster_upstream_rq_total[5m]) > 0.01
+    for: 5m
+    labels:
+      severity: warning
+    annotations:
+      summary: "Inference 5xx error rate above 1%"
+      description: >
+        {{ $value | humanizePercentage }} of inference requests are returning 5xx errors.
+```
+
+---
+
+## Grafana Dashboard
+
+Create a Grafana dashboard with these panels to get operational visibility at a glance.
+
+### Overview row
+
+| Panel | Query | Visualization |
+|-------|-------|---------------|
+| Pipeline run results | `rate(pipelinerun_result_total[5m])` | Time series |
+| Pipeline run failures | `pipelinerun_failed` | Stat |
+| Pipeline readiness | `pipeline_ready_total` | Stat |
+| Reconcile errors | `rate(controller_runtime_reconcile_errors_total[5m])` | Time series |
+
+### Jobs row
+
+| Panel | Query | Visualization |
+|-------|-------|---------------|
+| Pipeline run duration P50/P99 | `histogram_quantile(0.5/0.99, rate(pipelinerun_duration_seconds_bucket[5m]))` | Time series |
+| Failure rate by reason | `rate(pipelinerun_result_failure_total[5m])` by `failure_reason` | Time series |
+
+### Serving row
+
+| Panel | Query | Visualization |
+|-------|-------|---------------|
+| Request rate | `rate(envoy_cluster_upstream_rq_total[5m])` | Time series |
+| Request latency P50/P99 | `histogram_quantile(0.5/0.99, rate(envoy_cluster_upstream_rq_time_bucket[5m]))` | Time series |
+| 5xx error rate | `rate(envoy_cluster_upstream_rq_5xx[5m])` | Time series |
+| Active model deployments | `envoy_cluster_upstream_rq_total` (by cluster) | Table |
+
+### Controller health row
+
+| Panel | Query | Visualization |
+|-------|-------|---------------|
+| Reconcile error rate by controller | `rate(controller_runtime_reconcile_errors_total[5m])` | Time series |
+| Reconcile latency P99 | `histogram_quantile(0.99, rate(controller_runtime_reconcile_time_seconds_bucket[5m]))` | Time series |
+| Work queue depth | `workqueue_depth` | Time series |
+
+---
+
+## Structured Logging
+
+All Michelangelo components emit structured logs. Configure log format and level in the relevant ConfigMap:
+
+```yaml
+logging:
+  level: info          # debug | info | warn | error
+  development: false   # true enables human-readable console output
+  encoding: json       # json for production; console for development
+```
+
+For production deployments use `encoding: json` so your log aggregation system (Loki, Elasticsearch, CloudWatch Logs, etc.) can parse and query fields natively.
+
+### Important log fields to index
+
+| Field | Description |
+|-------|-------------|
+| `level` | Log severity |
+| `logger` | Component/controller name |
+| `msg` | Log message |
+| `namespace` | Kubernetes resource namespace |
+| `name` | Kubernetes resource name |
+| `operation` | Controller operation (e.g., `create_ray_cluster`, `schedule_job`) |
+| `error` | Error message (present on error-level logs) |
+
+Indexing these fields allows you to efficiently query all events for a specific resource (`namespace` + `name`), filter by controller (`logger`), or find all failures across the control plane (`level: error`).