|
| 1 | +# Troubleshooting |
| 2 | + |
| 3 | +This guide covers common failure scenarios with diagnostic steps and likely causes. All commands assume `kubectl` is configured for the control plane cluster unless otherwise noted. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Jobs not being scheduled |
| 8 | + |
| 9 | +**Symptoms**: Jobs are submitted but remain in a pending state with no cluster assignment. |
| 10 | + |
| 11 | +**Diagnostics**: |
| 12 | +```bash |
| 13 | +# Check for scheduler errors in the controller manager |
| 14 | +kubectl -n ma-system logs deployment/michelangelo-controllermgr | grep -i "scheduler\|assign\|enqueue" |
| 15 | + |
| 16 | +# List registered compute clusters and their status |
| 17 | +kubectl -n ma-system get clusters |
| 18 | + |
| 19 | +# Inspect a specific cluster's status conditions |
| 20 | +kubectl -n ma-system describe cluster <cluster-name> |
| 21 | +``` |
| 22 | + |
| 23 | +**Likely causes**: |
| 24 | +- No compute clusters are registered — complete the [cluster registration steps](jobs/register-a-compute-cluster-to-michelangelo-control-plane.md) |
| 25 | +- The Cluster CRD has the wrong `host` or `port` — the control plane cannot reach the compute cluster's API server |
| 26 | +- The `ray-manager` token Secret in the control plane is missing or has expired |
| 27 | +- The job requested resources (GPU, CPU) that no registered cluster can satisfy |
| 28 | + |
| 29 | +--- |
| 30 | + |
| 31 | +## Compute cluster registration failures |
| 32 | + |
| 33 | +**Symptoms**: The Cluster CRD is created but the cluster status shows unhealthy or unknown. |
| 34 | + |
| 35 | +**Diagnostics**: |
| 36 | +```bash |
| 37 | +# Inspect cluster conditions |
| 38 | +kubectl -n ma-system describe cluster <cluster-name> |
| 39 | + |
| 40 | +# Verify the token and CA secrets exist in the control plane |
| 41 | +kubectl -n default get secret cluster-<cluster-name>-client-token |
| 42 | +kubectl -n default get secret cluster-<cluster-name>-ca-data |
| 43 | + |
| 44 | +# Confirm the token secret is populated (output should be > 0) |
| 45 | +kubectl -n default get secret cluster-<cluster-name>-client-token \ |
| 46 | + -o jsonpath='{.data.token}' | wc -c |
| 47 | + |
| 48 | +# Test network connectivity from the control plane to the compute API server |
| 49 | +kubectl -n ma-system run connectivity-test --rm -it --restart=Never \ |
| 50 | + --image=curlimages/curl -- curl -k https://<compute-host>:<port>/healthz |
| 51 | +``` |
| 52 | + |
| 53 | +**Likely causes**: |
| 54 | +- Network policy or firewall is blocking the control plane from reaching the compute cluster's API server |
| 55 | +- The token Secret is missing the `token` key or was not populated (check the `kubernetes.io/service-account.name` annotation on the Secret) |
| 56 | +- The CA data does not match the compute cluster's TLS certificate (CA data mismatch) |
| 57 | + |
| 58 | +--- |
| 59 | + |
| 60 | +## Ray pods not starting on the compute cluster |
| 61 | + |
| 62 | +**Symptoms**: A RayCluster or RayJob resource is created on the compute cluster, but head or worker pods remain Pending or enter CrashLoopBackOff. |
| 63 | + |
| 64 | +**Diagnostics**: |
| 65 | +```bash |
| 66 | +# Check on the compute cluster (use its kubectl context) |
| 67 | +kubectl --context <compute-context> get rayclusters,rayjobs |
| 68 | +kubectl --context <compute-context> describe raycluster <name> |
| 69 | + |
| 70 | +# List pods for the cluster |
| 71 | +kubectl --context <compute-context> get pods -l ray.io/cluster=<cluster-name> |
| 72 | + |
| 73 | +# Check head pod logs |
| 74 | +kubectl --context <compute-context> logs <head-pod-name> |
| 75 | + |
| 76 | +# Verify storage config is present on the compute cluster |
| 77 | +kubectl --context <compute-context> get configmap michelangelo-config |
| 78 | +kubectl --context <compute-context> get secret aws-credentials |
| 79 | +``` |
| 80 | + |
| 81 | +**Likely causes**: |
| 82 | +- The `michelangelo-config` ConfigMap is missing or has the wrong `AWS_ENDPOINT_URL` |
| 83 | +- The container image cannot be pulled (wrong registry, missing imagePullSecret) |
| 84 | +- Insufficient CPU or memory quota on the compute cluster — check `kubectl --context <compute-context> describe nodes` |
| 85 | + |
| 86 | +--- |
| 87 | + |
| 88 | +## Worker cannot connect to the API server |
| 89 | + |
| 90 | +**Symptoms**: Worker pods crash-loop or restart repeatedly. Logs show connection refused, TLS errors, or authentication failures connecting to the API server. |
| 91 | + |
| 92 | +**Diagnostics**: |
| 93 | +```bash |
| 94 | +# Check recent worker logs |
| 95 | +kubectl -n ma-system logs deployment/michelangelo-worker --tail=100 |
| 96 | + |
| 97 | +# Verify the worker's configured API server address |
| 98 | +kubectl -n ma-system get configmap michelangelo-worker-config -o yaml | grep -A3 "worker:" |
| 99 | + |
| 100 | +# Confirm the API server deployment is running |
| 101 | +kubectl -n ma-system get deployment michelangelo-apiserver |
| 102 | +kubectl -n ma-system get pods -l app=michelangelo-apiserver |
| 103 | +``` |
| 104 | + |
| 105 | +**Likely causes**: |
| 106 | +- `worker.address` in the worker ConfigMap points to the wrong hostname or port — it must resolve to the API server from within the `ma-system` namespace |
| 107 | +- `worker.useTLS: true` is set but the API server's certificate is not trusted — ensure the CA bundle is mounted into the worker pod |
| 108 | +- The API server is not yet ready (check its pod status and readiness probe) |
| 109 | + |
| 110 | +--- |
| 111 | + |
| 112 | +## Temporal / Cadence connectivity issues |
| 113 | + |
| 114 | +**Symptoms**: Workflows fail to start. Worker logs contain errors like `failed to connect to temporal`, `context deadline exceeded`, or `domain not found`. |
| 115 | + |
| 116 | +**Diagnostics**: |
| 117 | +```bash |
| 118 | +# Check worker logs for workflow engine errors |
| 119 | +kubectl -n ma-system logs deployment/michelangelo-worker | grep -i "temporal\|cadence\|workflow" |
| 120 | + |
| 121 | +# Inspect the configured workflow engine endpoint |
| 122 | +kubectl -n ma-system get configmap michelangelo-worker-config -o yaml \ |
| 123 | + | grep -A8 "workflow-engine:" |
| 124 | + |
| 125 | +# Test TCP connectivity to Temporal from a worker pod |
| 126 | +kubectl -n ma-system exec deployment/michelangelo-worker -- \ |
| 127 | + nc -zv temporal.your-domain.com 7233 |
| 128 | +``` |
| 129 | + |
| 130 | +**Likely causes**: |
| 131 | +- `workflow-engine.host` has the wrong hostname or port (Temporal default is `7233`) |
| 132 | +- The Temporal domain (`uniflow`, `default`) has not been created — create it with the Temporal CLI or admin tools |
| 133 | +- Network policy in `ma-system` is blocking egress to the Temporal endpoint |
| 134 | + |
| 135 | +--- |
| 136 | + |
| 137 | +## InferenceServer not becoming healthy |
| 138 | + |
| 139 | +**Symptoms**: An InferenceServer resource is created but stays in a non-Ready state. The Deployment controller cannot deploy models to it because the server is not healthy. |
| 140 | + |
| 141 | +**Diagnostics**: |
| 142 | +```bash |
| 143 | +# Check InferenceServer status and conditions |
| 144 | +kubectl get inferenceservers |
| 145 | +kubectl describe inferenceserver <name> |
| 146 | + |
| 147 | +# Check the underlying Kubernetes Deployment |
| 148 | +kubectl get deployment -l app=<inferenceserver-name> |
| 149 | +kubectl describe deployment <inferenceserver-deployment> |
| 150 | + |
| 151 | +# Check model-sync sidecar logs |
| 152 | +kubectl logs <inferenceserver-pod-name> -c model-sync |
| 153 | +``` |
| 154 | + |
| 155 | +**Likely causes**: |
| 156 | +- The backend type is not registered in the controller manager — check controller manager logs for `unknown backend type` |
| 157 | +- The inference server container image cannot be pulled |
| 158 | +- The model-sync sidecar cannot connect to S3 to download models (see [S3 errors](#s3--object-store-errors) below) |
| 159 | +- Insufficient GPU resources on the node — check `kubectl describe node` for allocatable GPU count |
| 160 | + |
| 161 | +--- |
| 162 | + |
| 163 | +## Model not loading (Deployment stuck in Asset Preparation) |
| 164 | + |
| 165 | +**Symptoms**: A Deployment resource is created but remains in the `AssetPreparation` or `ResourceAcquisition` stage indefinitely. |
| 166 | + |
| 167 | +**Diagnostics**: |
| 168 | +```bash |
| 169 | +# Check Deployment status |
| 170 | +kubectl get deployments.michelangelo.api |
| 171 | +kubectl describe deployment.michelangelo.api <name> |
| 172 | + |
| 173 | +# Check model-sync sidecar for download errors |
| 174 | +kubectl logs <inferenceserver-pod> -c model-sync |
| 175 | + |
| 176 | +# Verify the model config ConfigMap was created |
| 177 | +kubectl get configmap <inferenceserver-name>-model-config -o yaml |
| 178 | +``` |
| 179 | + |
| 180 | +**Likely causes**: |
| 181 | +- The model artifact is not at the expected S3 path — verify the registered model's `artifactUri` matches what is actually in S3 |
| 182 | +- S3 credentials in the inference pod do not have `s3:GetObject` permission on the model bucket |
| 183 | +- The inference server has reached its maximum number of loaded models — check the serving framework's capacity limits |
| 184 | + |
| 185 | +--- |
| 186 | + |
| 187 | +## S3 / object store errors |
| 188 | + |
| 189 | +**Symptoms**: Jobs fail with access denied or endpoint unreachable errors. Model downloads fail in the model-sync sidecar. |
| 190 | + |
| 191 | +**Diagnostics**: |
| 192 | +```bash |
| 193 | +# Check controller manager storage config |
| 194 | +kubectl -n ma-system get configmap michelangelo-controllermgr-config -o yaml \ |
| 195 | + | grep -A5 "minio:" |
| 196 | + |
| 197 | +# Test S3 access from a worker pod |
| 198 | +kubectl -n ma-system exec deployment/michelangelo-worker -- \ |
| 199 | + aws s3 ls s3://your-bucket/ --endpoint-url http://your-minio-endpoint |
| 200 | + |
| 201 | +# Check for IAM role annotation on the relevant ServiceAccount |
| 202 | +kubectl -n ma-system get serviceaccount michelangelo-controllermgr -o yaml \ |
| 203 | + | grep -i iam |
| 204 | +``` |
| 205 | + |
| 206 | +**Likely causes**: |
| 207 | +- `useIam: true` is set but the pod's ServiceAccount does not have an IAM role annotation, so no credentials are injected |
| 208 | +- `awsEndpointUrl` is missing the URL scheme (`http://` or `https://`) or has the wrong port |
| 209 | +- The S3 bucket does not exist or is in a different region than `awsRegion` specifies |
| 210 | +- Pod-level network policy is blocking outbound traffic to the S3 endpoint |
| 211 | + |
| 212 | +--- |
| 213 | + |
| 214 | +## UI not loading or API calls failing |
| 215 | + |
| 216 | +**Symptoms**: The Michelangelo UI shows a blank page, a CORS error in the browser console, or API calls return 502/504. |
| 217 | + |
| 218 | +**Diagnostics**: |
| 219 | +```bash |
| 220 | +# Check Envoy and UI pod status |
| 221 | +kubectl get pods | grep -E "envoy|ui|apiserver" |
| 222 | +kubectl logs deployment/michelangelo-ui |
| 223 | + |
| 224 | +# Check Envoy configuration |
| 225 | +kubectl get configmap envoy-config -o yaml |
| 226 | +``` |
| 227 | + |
| 228 | +**Likely causes**: |
| 229 | +- `apiBaseUrl` in the UI's `config.json` does not match the actual Envoy ingress hostname — they must match exactly |
| 230 | +- The Envoy cluster's `socket_address.address` for `michelangelo-apiserver` is wrong — it must be the Kubernetes service name for the API server within the cluster |
| 231 | +- CORS allowed origins in the Envoy config do not include the origin from which users are accessing the UI |
| 232 | +- The Ingress resource for the UI or API server is misconfigured (wrong hostname, missing TLS secret) |
0 commit comments