|
| 1 | +# Monitoring & Observability |
| 2 | + |
| 3 | +Michelangelo components expose Prometheus metrics that integrate with a standard Kubernetes observability stack. This guide covers scrape configuration, key metrics to monitor, alerting rules, and logging configuration. |
| 4 | + |
| 5 | +## Prometheus Scrape Configuration |
| 6 | + |
| 7 | +### Controller Manager |
| 8 | + |
| 9 | +The controller manager exposes metrics on port `8091` (configured via `metricsBindAddress`). If you are using the [Prometheus Operator](https://github.com/prometheus-operator/prometheus-operator), create a `ServiceMonitor`: |
| 10 | + |
| 11 | +```yaml |
| 12 | +apiVersion: monitoring.coreos.com/v1 |
| 13 | +kind: ServiceMonitor |
| 14 | +metadata: |
| 15 | + name: michelangelo-controllermgr |
| 16 | + namespace: ma-system |
| 17 | + labels: |
| 18 | + app: michelangelo-controllermgr |
| 19 | +spec: |
| 20 | + selector: |
| 21 | + matchLabels: |
| 22 | + app: michelangelo-controllermgr |
| 23 | + endpoints: |
| 24 | + - port: metrics # Must match the Service port name for port 8091 |
| 25 | + path: /metrics |
| 26 | + interval: 30s |
| 27 | +``` |
| 28 | +
|
| 29 | +### Health Probes |
| 30 | +
|
| 31 | +The controller manager exposes health endpoints on port `8083` (configured via `healthProbeBindAddress`): |
| 32 | + |
| 33 | +| Endpoint | Purpose | |
| 34 | +|----------|---------| |
| 35 | +| `GET :8083/healthz` | Liveness — is the process alive? | |
| 36 | +| `GET :8083/readyz` | Readiness — is the controller ready to reconcile? | |
| 37 | + |
| 38 | +These are used by Kubernetes liveness and readiness probes, but you can also poll them from your monitoring stack for coarser-grained health checks. |
| 39 | + |
| 40 | +### API Server |
| 41 | + |
| 42 | +The API server (port `15566`) exposes standard gRPC metrics. If you have a Prometheus scrape job for gRPC services, point it at the API server pod. |
| 43 | + |
| 44 | +### Envoy Proxy |
| 45 | + |
| 46 | +Envoy can expose an admin stats interface for scraping request counts, latency histograms, and upstream error rates. The admin interface is **not enabled by default** in the Michelangelo Envoy configuration — you must add an `admin:` block to your Envoy ConfigMap to enable it. See the [Envoy admin documentation](https://www.envoyproxy.io/docs/envoy/latest/operations/admin) for setup instructions. Once enabled, add a Prometheus scrape job targeting the admin port. |
| 47 | + |
| 48 | +--- |
| 49 | + |
| 50 | +## Key Metrics |
| 51 | + |
| 52 | +### Pipeline Runs |
| 53 | + |
| 54 | +| Metric | Description | Unit | |
| 55 | +|--------|-------------|------| |
| 56 | +| `pipelinerun_result_total` | Pipeline run results, by `state`, `pipeline_type`, `environment`, `tier` | Count | |
| 57 | +| `pipelinerun_result_failure_total` | Failed pipeline runs, with `failure_reason` label | Count | |
| 58 | +| `pipelinerun_duration_seconds` | Pipeline run execution duration (histogram) | Seconds | |
| 59 | +| `pipelinerun_failed` | Gauge: 1 if most recent run failed, 0 if succeeded | Gauge | |
| 60 | +| `pipelinerun_step_success_total` | Step completions, by `step_name` and `pipeline_type` | Count | |
| 61 | +| `pipeline_ready_total` | Pipelines reaching Ready state | Count | |
| 62 | + |
| 63 | +### Workflow Engine |
| 64 | + |
| 65 | +Workflow metrics are emitted by the Cadence or Temporal server, not by Michelangelo. Consult your workflow engine's documentation for its native Prometheus metrics. Michelangelo's worker-side reconcile metrics are captured under the `pipelinerun_*` counters above. |
| 66 | + |
| 67 | +### Model Serving (Envoy) |
| 68 | + |
| 69 | +If you have enabled the Envoy admin interface, these standard Envoy metrics are available: |
| 70 | + |
| 71 | +| Metric | Description | Unit | |
| 72 | +|--------|-------------|------| |
| 73 | +| `envoy_cluster_upstream_rq_total` | Total requests to inference backends | Count | |
| 74 | +| `envoy_cluster_upstream_rq_5xx` | 5xx error responses from inference backends | Count | |
| 75 | +| `envoy_cluster_upstream_rq_time` | Request latency histogram to inference servers | Seconds | |
| 76 | + |
| 77 | +### Controller Manager Health |
| 78 | + |
| 79 | +The controller manager uses `controller-runtime` metrics — these are standard across all Kubernetes operators: |
| 80 | + |
| 81 | +| Metric | Description | Unit | |
| 82 | +|--------|-------------|------| |
| 83 | +| `controller_runtime_reconcile_errors_total` | Reconcile errors, by `controller` label | Count | |
| 84 | +| `controller_runtime_reconcile_time_seconds` | Reconcile duration histogram | Seconds | |
| 85 | +| `workqueue_depth` | Work items queued, by `name` label (one per controller) | Count | |
| 86 | +| `workqueue_retries_total` | Work item retries — elevated value indicates persistent failures | Count | |
| 87 | + |
| 88 | +--- |
| 89 | + |
| 90 | +## Alerting Rules |
| 91 | + |
| 92 | +Add these rules to your Prometheus configuration: |
| 93 | + |
| 94 | +```yaml |
| 95 | +groups: |
| 96 | +- name: michelangelo |
| 97 | + rules: |
| 98 | +
|
| 99 | + # Pipeline run failure rate |
| 100 | + - alert: PipelineRunFailureRateHigh |
| 101 | + expr: rate(pipelinerun_result_failure_total[5m]) > 0 |
| 102 | + for: 5m |
| 103 | + labels: |
| 104 | + severity: warning |
| 105 | + annotations: |
| 106 | + summary: "Pipeline run failures detected" |
| 107 | + description: > |
| 108 | + Pipeline runs are failing at {{ $value | humanize }} failures/sec. |
| 109 | + Check failure reasons: kubectl -n ma-system get pipelineruns --field-selector status.phase=Failed |
| 110 | +
|
| 111 | + # Pipeline run duration: P99 above 1 hour |
| 112 | + - alert: PipelineRunDurationHigh |
| 113 | + expr: > |
| 114 | + histogram_quantile(0.99, |
| 115 | + rate(pipelinerun_duration_seconds_bucket[5m]) |
| 116 | + ) > 3600 |
| 117 | + for: 10m |
| 118 | + labels: |
| 119 | + severity: warning |
| 120 | + annotations: |
| 121 | + summary: "Pipeline run P99 duration above 1 hour" |
| 122 | + description: > |
| 123 | + The 99th percentile pipeline run duration is {{ $value | humanize }}s. |
| 124 | +
|
| 125 | + # Controller reconcile errors — sustained error rate from any controller |
| 126 | + - alert: ControllerReconcileErrorRate |
| 127 | + expr: rate(controller_runtime_reconcile_errors_total[5m]) > 0.1 |
| 128 | + for: 5m |
| 129 | + labels: |
| 130 | + severity: warning |
| 131 | + annotations: |
| 132 | + summary: "Controller {{ $labels.controller }} has high reconcile error rate" |
| 133 | + description: > |
| 134 | + The {{ $labels.controller }} controller is failing reconciles at |
| 135 | + {{ $value | humanize }} errors/sec. Check logs: |
| 136 | + kubectl -n ma-system logs deployment/michelangelo-controllermgr |
| 137 | +
|
| 138 | + # Inference latency: P99 above 500ms for 5 minutes |
| 139 | + - alert: InferenceLatencyHigh |
| 140 | + expr: > |
| 141 | + histogram_quantile(0.99, |
| 142 | + rate(envoy_cluster_upstream_rq_time_bucket[5m]) |
| 143 | + ) > 500 |
| 144 | + for: 5m |
| 145 | + labels: |
| 146 | + severity: warning |
| 147 | + annotations: |
| 148 | + summary: "Inference P99 latency is above 500ms" |
| 149 | + description: > |
| 150 | + The 99th percentile inference request latency is {{ $value }}ms. |
| 151 | + Check InferenceServer and model-sync sidecar logs. |
| 152 | +
|
| 153 | + # Inference error rate: more than 1% of requests returning 5xx |
| 154 | + - alert: InferenceErrorRateHigh |
| 155 | + expr: > |
| 156 | + rate(envoy_cluster_upstream_rq_5xx[5m]) |
| 157 | + / rate(envoy_cluster_upstream_rq_total[5m]) > 0.01 |
| 158 | + for: 5m |
| 159 | + labels: |
| 160 | + severity: warning |
| 161 | + annotations: |
| 162 | + summary: "Inference 5xx error rate above 1%" |
| 163 | + description: > |
| 164 | + {{ $value | humanizePercentage }} of inference requests are returning 5xx errors. |
| 165 | +``` |
| 166 | + |
| 167 | +--- |
| 168 | + |
| 169 | +## Grafana Dashboard |
| 170 | + |
| 171 | +Create a Grafana dashboard with these panels to get operational visibility at a glance. |
| 172 | + |
| 173 | +### Overview row |
| 174 | + |
| 175 | +| Panel | Query | Visualization | |
| 176 | +|-------|-------|---------------| |
| 177 | +| Pipeline run results | `rate(pipelinerun_result_total[5m])` | Time series | |
| 178 | +| Pipeline run failures | `pipelinerun_failed` | Stat | |
| 179 | +| Pipeline readiness | `pipeline_ready_total` | Stat | |
| 180 | +| Reconcile errors | `rate(controller_runtime_reconcile_errors_total[5m])` | Time series | |
| 181 | + |
| 182 | +### Jobs row |
| 183 | + |
| 184 | +| Panel | Query | Visualization | |
| 185 | +|-------|-------|---------------| |
| 186 | +| Pipeline run duration P50/P99 | `histogram_quantile(0.5/0.99, rate(pipelinerun_duration_seconds_bucket[5m]))` | Time series | |
| 187 | +| Failure rate by reason | `rate(pipelinerun_result_failure_total[5m])` by `failure_reason` | Time series | |
| 188 | + |
| 189 | +### Serving row |
| 190 | + |
| 191 | +| Panel | Query | Visualization | |
| 192 | +|-------|-------|---------------| |
| 193 | +| Request rate | `rate(envoy_cluster_upstream_rq_total[5m])` | Time series | |
| 194 | +| Request latency P50/P99 | `histogram_quantile(0.5/0.99, rate(envoy_cluster_upstream_rq_time_bucket[5m]))` | Time series | |
| 195 | +| 5xx error rate | `rate(envoy_cluster_upstream_rq_5xx[5m])` | Time series | |
| 196 | +| Active model deployments | `envoy_cluster_upstream_rq_total` (by cluster) | Table | |
| 197 | + |
| 198 | +### Controller health row |
| 199 | + |
| 200 | +| Panel | Query | Visualization | |
| 201 | +|-------|-------|---------------| |
| 202 | +| Reconcile error rate by controller | `rate(controller_runtime_reconcile_errors_total[5m])` | Time series | |
| 203 | +| Reconcile latency P99 | `histogram_quantile(0.99, rate(controller_runtime_reconcile_time_seconds_bucket[5m]))` | Time series | |
| 204 | +| Work queue depth | `workqueue_depth` | Time series | |
| 205 | + |
| 206 | +--- |
| 207 | + |
| 208 | +## Structured Logging |
| 209 | + |
| 210 | +All Michelangelo components emit structured logs. Configure log format and level in the relevant ConfigMap: |
| 211 | + |
| 212 | +```yaml |
| 213 | +logging: |
| 214 | + level: info # debug | info | warn | error |
| 215 | + development: false # true enables human-readable console output |
| 216 | + encoding: json # json for production; console for development |
| 217 | +``` |
| 218 | + |
| 219 | +For production deployments use `encoding: json` so your log aggregation system (Loki, Elasticsearch, CloudWatch Logs, etc.) can parse and query fields natively. |
| 220 | + |
| 221 | +### Important log fields to index |
| 222 | + |
| 223 | +| Field | Description | |
| 224 | +|-------|-------------| |
| 225 | +| `level` | Log severity | |
| 226 | +| `logger` | Component/controller name | |
| 227 | +| `msg` | Log message | |
| 228 | +| `namespace` | Kubernetes resource namespace | |
| 229 | +| `name` | Kubernetes resource name | |
| 230 | +| `operation` | Controller operation (e.g., `create_ray_cluster`, `schedule_job`) | |
| 231 | +| `error` | Error message (present on error-level logs) | |
| 232 | + |
| 233 | +Indexing these fields allows you to efficiently query all events for a specific resource (`namespace` + `name`), filter by controller (`logger`), or find all failures across the control plane (`level: error`). |
0 commit comments