Skip to content

[BUG] nvidia-device-plugin fails with NVIDIA drivers due to missing versioned libnvidia-gpucomp.so in WSL mount #274

@dj-vandijk

Description

@dj-vandijk

nvidia-device-plugin fails with NVIDIA drivers >= 595.x due to missing versioned libnvidia-gpucomp.so in WSL mount

Environment

  • Windows 11 25H2 (build 26200.7840)
  • AKS-EE 1.11.247.0 (Linux node)
  • NVIDIA driver 595.79 (595.54 internal)
  • nvidia-device-plugin 0.18.2

Problem

The nvidia-device-plugin daemonset pod fails to start with:

nvidia-container-cli: mount error: lstat failed:
/var/.eflow/custom-configs/usr/lib/wsl/lib/libnvidia-gpucomp.so.595.54:
no such file or directory

Root cause

NVIDIA drivers >= 595.x introduce libnvidia-gpucomp.so in the WSL library mount (C:\Windows\System32\lxss\lib). This file is exposed in the AKS-EE Linux node at /usr/lib/wsl/lib/libnvidia-gpucomp.sowithout a version number in the filename. However, the ELF SONAME embedded in the binary is libnvidia-gpucomp.so.595.54with a driver-specific version number.

When ldconfig scans /usr/lib/wsl/lib at boot, it registers the library as:

libnvidia-gpucomp.so.595.54 => /usr/lib/wsl/lib/libnvidia-gpucomp.so.595.54

The nvidia-container-runtime then attempts to bind-mount that versioned path into the container — but that file does not exist. All other libraries in lxss\lib (e.g. libnvidia-ml.so.1, libnvidia-encode.so.1) have a fixed version suffix in their filename that matches their SONAME, so they do not have this problem.

This worked correctly with NVIDIA driver 572.83, which does not include libnvidia-gpucomp.so at all.

Workaround

Copy the library with the versioned filename into lxss\lib on Windows (as Administrator):

Copy-Item "C:\Windows\System32\lxss\lib\libnvidia-gpucomp.so" `
          "C:\Windows\System32\lxss\lib\libnvidia-gpucomp.so.595.54"

Then restart the nvidia-device-plugin pod. No AKS-EE restart required.

Note: this workaround must be reapplied after every NVIDIA driver upgrade, as the version suffix changes with each driver version.

Expected behavior

Either:

  1. The WSL mount should expose libnvidia-gpucomp.so with a versioned filename consistent with its SONAME (as NVIDIA/Microsoft should fix), or
  2. AKS-EE should handle the case where a versioned library path registered in ldconfig does not exist as a physical file, and fall back to the unversioned .so

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions