From 0c674f8df7cf91139c393ea42674d30ad9bdbd56 Mon Sep 17 00:00:00 2001 From: Evan Lezar Date: Tue, 19 May 2026 15:07:39 +0200 Subject: [PATCH] docs(agents): add Docker GPU CDI debug hints Signed-off-by: Evan Lezar --- .../skills/debug-openshell-cluster/SKILL.md | 29 +++++++++++++++++++ 1 file changed, 29 insertions(+) diff --git a/.agents/skills/debug-openshell-cluster/SKILL.md b/.agents/skills/debug-openshell-cluster/SKILL.md index 177c52f31..6c9f09233 100644 --- a/.agents/skills/debug-openshell-cluster/SKILL.md +++ b/.agents/skills/debug-openshell-cluster/SKILL.md @@ -67,6 +67,33 @@ docker run --rm --entrypoint /openshell-sandbox "${OPENSHELL_DOCKER_SUPERVISOR_I openshell status ``` +For Docker GPU failures, check CDI support and NVIDIA CDI discovery separately: + +```bash +docker info --format '{{json .CDISpecDirs}}' +docker info --format '{{json .DiscoveredDevices}}' +for dir in /etc/cdi /var/run/cdi; do + if [ -d "$dir" ]; then + find "$dir" -maxdepth 1 -type f \( -name '*.yaml' -o -name '*.json' \) -print + else + echo "$dir missing" + fi +done +systemctl is-enabled nvidia-cdi-refresh.service nvidia-cdi-refresh.path || true +systemctl is-active nvidia-cdi-refresh.service nvidia-cdi-refresh.path || true +systemctl status nvidia-cdi-refresh.service nvidia-cdi-refresh.path --no-pager --lines=50 +journalctl -u nvidia-cdi-refresh.service --no-pager --lines=100 +``` + +When the NVIDIA Container Toolkit CDI refresh units are not enabled or no NVIDIA CDI spec has been generated, enable them and trigger a refresh: + +```bash +sudo systemctl enable --now nvidia-cdi-refresh.path +sudo systemctl enable --now nvidia-cdi-refresh.service +sudo systemctl restart nvidia-cdi-refresh.service +docker info --format '{{json .DiscoveredDevices}}' +``` + Common findings: - Docker daemon unavailable: start Docker Desktop or Docker Engine. @@ -75,6 +102,7 @@ Common findings: - Docker driver cannot initialize because it cannot find `openshell-sandbox`: verify `OPENSHELL_DOCKER_SUPERVISOR_BIN`, the sibling binary next to `openshell-gateway`, or the configured supervisor image contains `/openshell-sandbox`. - Sandbox never registers: check gateway logs and supervisor callback endpoint. - Supervisor image exits before printing `openshell-sandbox --version`: the image should be the scratch supervisor image from `deploy/docker/Dockerfile.supervisor` and must contain a static executable at `/openshell-sandbox`. +- `mise run e2e:docker:gpu` fails with `docker info --format json did not report any discovered NVIDIA CDI GPU devices`: Docker may report `CDISpecDirs` while still having no generated NVIDIA CDI specs. Verify `.DiscoveredDevices` contains entries such as `nvidia.com/gpu=all`, verify `/etc/cdi` or `/var/run/cdi` contains a generated NVIDIA spec, and check that `nvidia-cdi-refresh.service` and `nvidia-cdi-refresh.path` from NVIDIA Container Toolkit are enabled and healthy. The service is a one-shot unit, so `inactive (dead)` can be normal after a successful run; use `systemctl status` and `journalctl` to distinguish success from a skipped or failed refresh. NVIDIA recommends enabling the path and service units, and restarting `nvidia-cdi-refresh.service` to regenerate missing or stale CDI specs. If specs are generated but Docker still reports no discovered devices, restart Docker or reload the daemon and re-check `docker info`. For source checkout development, restart the local gateway with: @@ -200,6 +228,7 @@ openshell logs | `openshell status` fails | Gateway endpoint unreachable or auth mismatch | `openshell gateway info`, gateway logs | | Gateway starts but sandbox create fails | Compute driver cannot reach runtime | Docker/Podman/Kubernetes/VM driver logs | | Docker or Podman sandbox never registers | Wrong callback endpoint or supervisor startup failure | Gateway logs and sandbox container logs | +| Docker GPU e2e fails before GPU sandbox comparison | NVIDIA CDI specs are missing or Docker has not discovered them | `docker info --format '{{json .DiscoveredDevices}}'`, `/etc/cdi`, `/var/run/cdi`, `nvidia-cdi-refresh.service` | | Kubernetes gateway pod pending | PVC unbound, taint, selector, or insufficient resources | `kubectl -n openshell describe pod ` | | Kubernetes gateway pod crash loops | Missing secret, bad DB URL, bad TLS config | `kubectl -n openshell logs statefulset/openshell` | | CLI TLS error | Local mTLS bundle does not match server cert/CA | Check `~/.config/openshell/gateways//mtls/` |