Cluster-side OpenTelemetry Collector distribution and MCP-queryable event store for GPU clusters
kubernetes machine-learning gpu sre observability anomaly-detection distributed-training opentelemetry opentelemetry-collector otlp llm-inference gpu-observability straggler-detection
-
Updated
May 16, 2026 - Go