Skip to content

Commit b5226e5

Browse files
zhoward-1claudecraigmarker
authored
docs: add operator troubleshooting guide (#1038)
## Summary Adds `docs/operator-guides/troubleshooting.md` covering 8 common failure scenarios operators encounter when deploying and running Michelangelo. Each scenario has: - **Symptoms** — what the operator observes - **Diagnostics** — `kubectl` commands to run, with expected output explained - **Likely causes** — ordered from most to least common Scenarios covered: 1. Jobs not being scheduled (no cluster assignment) 2. Compute cluster registration failures 3. Ray pods not starting on the compute cluster 4. Worker cannot connect to the API server 5. Temporal/Cadence connectivity issues 6. InferenceServer not becoming healthy 7. Model not loading (Deployment stuck in Asset Preparation) 8. S3/object store errors 9. UI not loading or API calls failing Part of the operator/contributor guide improvements proposed in #1033. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Craig Marker <craig@marker.org>
1 parent bf7d671 commit b5226e5

1 file changed

Lines changed: 232 additions & 0 deletions

File tree

Lines changed: 232 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,232 @@
1+
# Troubleshooting
2+
3+
This guide covers common failure scenarios with diagnostic steps and likely causes. All commands assume `kubectl` is configured for the control plane cluster unless otherwise noted.
4+
5+
---
6+
7+
## Jobs not being scheduled
8+
9+
**Symptoms**: Jobs are submitted but remain in a pending state with no cluster assignment.
10+
11+
**Diagnostics**:
12+
```bash
13+
# Check for scheduler errors in the controller manager
14+
kubectl -n ma-system logs deployment/michelangelo-controllermgr | grep -i "scheduler\|assign\|enqueue"
15+
16+
# List registered compute clusters and their status
17+
kubectl -n ma-system get clusters
18+
19+
# Inspect a specific cluster's status conditions
20+
kubectl -n ma-system describe cluster <cluster-name>
21+
```
22+
23+
**Likely causes**:
24+
- No compute clusters are registered — complete the [cluster registration steps](jobs/register-a-compute-cluster-to-michelangelo-control-plane.md)
25+
- The Cluster CRD has the wrong `host` or `port` — the control plane cannot reach the compute cluster's API server
26+
- The `ray-manager` token Secret in the control plane is missing or has expired
27+
- The job requested resources (GPU, CPU) that no registered cluster can satisfy
28+
29+
---
30+
31+
## Compute cluster registration failures
32+
33+
**Symptoms**: The Cluster CRD is created but the cluster status shows unhealthy or unknown.
34+
35+
**Diagnostics**:
36+
```bash
37+
# Inspect cluster conditions
38+
kubectl -n ma-system describe cluster <cluster-name>
39+
40+
# Verify the token and CA secrets exist in the control plane
41+
kubectl -n default get secret cluster-<cluster-name>-client-token
42+
kubectl -n default get secret cluster-<cluster-name>-ca-data
43+
44+
# Confirm the token secret is populated (output should be > 0)
45+
kubectl -n default get secret cluster-<cluster-name>-client-token \
46+
-o jsonpath='{.data.token}' | wc -c
47+
48+
# Test network connectivity from the control plane to the compute API server
49+
kubectl -n ma-system run connectivity-test --rm -it --restart=Never \
50+
--image=curlimages/curl -- curl -k https://<compute-host>:<port>/healthz
51+
```
52+
53+
**Likely causes**:
54+
- Network policy or firewall is blocking the control plane from reaching the compute cluster's API server
55+
- The token Secret is missing the `token` key or was not populated (check the `kubernetes.io/service-account.name` annotation on the Secret)
56+
- The CA data does not match the compute cluster's TLS certificate (CA data mismatch)
57+
58+
---
59+
60+
## Ray pods not starting on the compute cluster
61+
62+
**Symptoms**: A RayCluster or RayJob resource is created on the compute cluster, but head or worker pods remain Pending or enter CrashLoopBackOff.
63+
64+
**Diagnostics**:
65+
```bash
66+
# Check on the compute cluster (use its kubectl context)
67+
kubectl --context <compute-context> get rayclusters,rayjobs
68+
kubectl --context <compute-context> describe raycluster <name>
69+
70+
# List pods for the cluster
71+
kubectl --context <compute-context> get pods -l ray.io/cluster=<cluster-name>
72+
73+
# Check head pod logs
74+
kubectl --context <compute-context> logs <head-pod-name>
75+
76+
# Verify storage config is present on the compute cluster
77+
kubectl --context <compute-context> get configmap michelangelo-config
78+
kubectl --context <compute-context> get secret aws-credentials
79+
```
80+
81+
**Likely causes**:
82+
- The `michelangelo-config` ConfigMap is missing or has the wrong `AWS_ENDPOINT_URL`
83+
- The container image cannot be pulled (wrong registry, missing imagePullSecret)
84+
- Insufficient CPU or memory quota on the compute cluster — check `kubectl --context <compute-context> describe nodes`
85+
86+
---
87+
88+
## Worker cannot connect to the API server
89+
90+
**Symptoms**: Worker pods crash-loop or restart repeatedly. Logs show connection refused, TLS errors, or authentication failures connecting to the API server.
91+
92+
**Diagnostics**:
93+
```bash
94+
# Check recent worker logs
95+
kubectl -n ma-system logs deployment/michelangelo-worker --tail=100
96+
97+
# Verify the worker's configured API server address
98+
kubectl -n ma-system get configmap michelangelo-worker-config -o yaml | grep -A3 "worker:"
99+
100+
# Confirm the API server deployment is running
101+
kubectl -n ma-system get deployment michelangelo-apiserver
102+
kubectl -n ma-system get pods -l app=michelangelo-apiserver
103+
```
104+
105+
**Likely causes**:
106+
- `worker.address` in the worker ConfigMap points to the wrong hostname or port — it must resolve to the API server from within the `ma-system` namespace
107+
- `worker.useTLS: true` is set but the API server's certificate is not trusted — ensure the CA bundle is mounted into the worker pod
108+
- The API server is not yet ready (check its pod status and readiness probe)
109+
110+
---
111+
112+
## Temporal / Cadence connectivity issues
113+
114+
**Symptoms**: Workflows fail to start. Worker logs contain errors like `failed to connect to temporal`, `context deadline exceeded`, or `domain not found`.
115+
116+
**Diagnostics**:
117+
```bash
118+
# Check worker logs for workflow engine errors
119+
kubectl -n ma-system logs deployment/michelangelo-worker | grep -i "temporal\|cadence\|workflow"
120+
121+
# Inspect the configured workflow engine endpoint
122+
kubectl -n ma-system get configmap michelangelo-worker-config -o yaml \
123+
| grep -A8 "workflow-engine:"
124+
125+
# Test TCP connectivity to Temporal from a worker pod
126+
kubectl -n ma-system exec deployment/michelangelo-worker -- \
127+
nc -zv temporal.your-domain.com 7233
128+
```
129+
130+
**Likely causes**:
131+
- `workflow-engine.host` has the wrong hostname or port (Temporal default is `7233`)
132+
- The Temporal domain (`uniflow`, `default`) has not been created — create it with the Temporal CLI or admin tools
133+
- Network policy in `ma-system` is blocking egress to the Temporal endpoint
134+
135+
---
136+
137+
## InferenceServer not becoming healthy
138+
139+
**Symptoms**: An InferenceServer resource is created but stays in a non-Ready state. The Deployment controller cannot deploy models to it because the server is not healthy.
140+
141+
**Diagnostics**:
142+
```bash
143+
# Check InferenceServer status and conditions
144+
kubectl get inferenceservers
145+
kubectl describe inferenceserver <name>
146+
147+
# Check the underlying Kubernetes Deployment
148+
kubectl get deployment -l app=<inferenceserver-name>
149+
kubectl describe deployment <inferenceserver-deployment>
150+
151+
# Check model-sync sidecar logs
152+
kubectl logs <inferenceserver-pod-name> -c model-sync
153+
```
154+
155+
**Likely causes**:
156+
- The backend type is not registered in the controller manager — check controller manager logs for `unknown backend type`
157+
- The inference server container image cannot be pulled
158+
- The model-sync sidecar cannot connect to S3 to download models (see [S3 errors](#s3--object-store-errors) below)
159+
- Insufficient GPU resources on the node — check `kubectl describe node` for allocatable GPU count
160+
161+
---
162+
163+
## Model not loading (Deployment stuck in Asset Preparation)
164+
165+
**Symptoms**: A Deployment resource is created but remains in the `AssetPreparation` or `ResourceAcquisition` stage indefinitely.
166+
167+
**Diagnostics**:
168+
```bash
169+
# Check Deployment status
170+
kubectl get deployments.michelangelo.api
171+
kubectl describe deployment.michelangelo.api <name>
172+
173+
# Check model-sync sidecar for download errors
174+
kubectl logs <inferenceserver-pod> -c model-sync
175+
176+
# Verify the model config ConfigMap was created
177+
kubectl get configmap <inferenceserver-name>-model-config -o yaml
178+
```
179+
180+
**Likely causes**:
181+
- The model artifact is not at the expected S3 path — verify the registered model's `artifactUri` matches what is actually in S3
182+
- S3 credentials in the inference pod do not have `s3:GetObject` permission on the model bucket
183+
- The inference server has reached its maximum number of loaded models — check the serving framework's capacity limits
184+
185+
---
186+
187+
## S3 / object store errors
188+
189+
**Symptoms**: Jobs fail with access denied or endpoint unreachable errors. Model downloads fail in the model-sync sidecar.
190+
191+
**Diagnostics**:
192+
```bash
193+
# Check controller manager storage config
194+
kubectl -n ma-system get configmap michelangelo-controllermgr-config -o yaml \
195+
| grep -A5 "minio:"
196+
197+
# Test S3 access from a worker pod
198+
kubectl -n ma-system exec deployment/michelangelo-worker -- \
199+
aws s3 ls s3://your-bucket/ --endpoint-url http://your-minio-endpoint
200+
201+
# Check for IAM role annotation on the relevant ServiceAccount
202+
kubectl -n ma-system get serviceaccount michelangelo-controllermgr -o yaml \
203+
| grep -i iam
204+
```
205+
206+
**Likely causes**:
207+
- `useIam: true` is set but the pod's ServiceAccount does not have an IAM role annotation, so no credentials are injected
208+
- `awsEndpointUrl` is missing the URL scheme (`http://` or `https://`) or has the wrong port
209+
- The S3 bucket does not exist or is in a different region than `awsRegion` specifies
210+
- Pod-level network policy is blocking outbound traffic to the S3 endpoint
211+
212+
---
213+
214+
## UI not loading or API calls failing
215+
216+
**Symptoms**: The Michelangelo UI shows a blank page, a CORS error in the browser console, or API calls return 502/504.
217+
218+
**Diagnostics**:
219+
```bash
220+
# Check Envoy and UI pod status
221+
kubectl get pods | grep -E "envoy|ui|apiserver"
222+
kubectl logs deployment/michelangelo-ui
223+
224+
# Check Envoy configuration
225+
kubectl get configmap envoy-config -o yaml
226+
```
227+
228+
**Likely causes**:
229+
- `apiBaseUrl` in the UI's `config.json` does not match the actual Envoy ingress hostname — they must match exactly
230+
- The Envoy cluster's `socket_address.address` for `michelangelo-apiserver` is wrong — it must be the Kubernetes service name for the API server within the cluster
231+
- CORS allowed origins in the Envoy config do not include the origin from which users are accessing the UI
232+
- The Ingress resource for the UI or API server is misconfigured (wrong hostname, missing TLS secret)

0 commit comments

Comments
 (0)