nvidia-device-plugin fails with NVIDIA drivers >= 595.x due to missing versioned libnvidia-gpucomp.so in WSL mount
Environment
- Windows 11 25H2 (build 26200.7840)
- AKS-EE 1.11.247.0 (Linux node)
- NVIDIA driver 595.79 (595.54 internal)
- nvidia-device-plugin 0.18.2
Problem
The nvidia-device-plugin daemonset pod fails to start with:
nvidia-container-cli: mount error: lstat failed:
/var/.eflow/custom-configs/usr/lib/wsl/lib/libnvidia-gpucomp.so.595.54:
no such file or directory
Root cause
NVIDIA drivers >= 595.x introduce libnvidia-gpucomp.so in the WSL library mount (C:\Windows\System32\lxss\lib). This file is exposed in the AKS-EE Linux node at /usr/lib/wsl/lib/libnvidia-gpucomp.so — without a version number in the filename. However, the ELF SONAME embedded in the binary is libnvidia-gpucomp.so.595.54 — with a driver-specific version number.
When ldconfig scans /usr/lib/wsl/lib at boot, it registers the library as:
libnvidia-gpucomp.so.595.54 => /usr/lib/wsl/lib/libnvidia-gpucomp.so.595.54
The nvidia-container-runtime then attempts to bind-mount that versioned path into the container — but that file does not exist. All other libraries in lxss\lib (e.g. libnvidia-ml.so.1, libnvidia-encode.so.1) have a fixed version suffix in their filename that matches their SONAME, so they do not have this problem.
This worked correctly with NVIDIA driver 572.83, which does not include libnvidia-gpucomp.so at all.
Workaround
Copy the library with the versioned filename into lxss\lib on Windows (as Administrator):
Copy-Item "C:\Windows\System32\lxss\lib\libnvidia-gpucomp.so" `
"C:\Windows\System32\lxss\lib\libnvidia-gpucomp.so.595.54"
Then restart the nvidia-device-plugin pod. No AKS-EE restart required.
Note: this workaround must be reapplied after every NVIDIA driver upgrade, as the version suffix changes with each driver version.
Expected behavior
Either:
- The WSL mount should expose
libnvidia-gpucomp.so with a versioned filename consistent with its SONAME (as NVIDIA/Microsoft should fix), or
- AKS-EE should handle the case where a versioned library path registered in ldconfig does not exist as a physical file, and fall back to the unversioned
.so
nvidia-device-pluginfails with NVIDIA drivers >= 595.x due to missing versionedlibnvidia-gpucomp.soin WSL mountEnvironment
Problem
The
nvidia-device-plugindaemonset pod fails to start with:Root cause
NVIDIA drivers >= 595.x introduce
libnvidia-gpucomp.soin the WSL library mount (C:\Windows\System32\lxss\lib). This file is exposed in the AKS-EE Linux node at/usr/lib/wsl/lib/libnvidia-gpucomp.so— without a version number in the filename. However, the ELF SONAME embedded in the binary islibnvidia-gpucomp.so.595.54— with a driver-specific version number.When
ldconfigscans/usr/lib/wsl/libat boot, it registers the library as:The nvidia-container-runtime then attempts to bind-mount that versioned path into the container — but that file does not exist. All other libraries in
lxss\lib(e.g.libnvidia-ml.so.1,libnvidia-encode.so.1) have a fixed version suffix in their filename that matches their SONAME, so they do not have this problem.This worked correctly with NVIDIA driver 572.83, which does not include
libnvidia-gpucomp.soat all.Workaround
Copy the library with the versioned filename into
lxss\libon Windows (as Administrator):Then restart the nvidia-device-plugin pod. No AKS-EE restart required.
Expected behavior
Either:
libnvidia-gpucomp.sowith a versioned filename consistent with its SONAME (as NVIDIA/Microsoft should fix), or.so