Skip to content

Support Read-Only, systemd-less Systems#382

Open
Arc676 wants to merge 2 commits into
NVIDIA:mainfrom
Arc676:talos
Open

Support Read-Only, systemd-less Systems#382
Arc676 wants to merge 2 commits into
NVIDIA:mainfrom
Arc676:talos

Conversation

@Arc676
Copy link
Copy Markdown

@Arc676 Arc676 commented May 19, 2026

Motivation

Informally, this PR adds (partial) support for Talos. Closes #356.

More formally, this PR adds support for systems with read-only filesystems and systems that do not run systemd.

Description

The MIG manager assumes that it will be able to copy the mig-parted binary to the host and use systemd to restart host-side GPU services. Neither of these is true for Talos, which is an immutable OS that doesn't run systemd. Proper Talos support would introduce a dependency on the Talos API, but that is beyond the scope of this PR and likely falls beyond the scope of what this tool should support.

This PR adds support for systems like Talos by introducing two new flags (both of which are required for Talos):

  1. Identifying the host as read-only: prevent the manager from attempting to copy data to the host
  2. Flagging the absence of systemd: tell the manager to skip all systemd operations that would otherwise cause the program to hang, since there would be no response on DBus

This PR includes nil-checks for the systemd manager that were not present before. In the original code, these checks are effectively unnecessary because this member is always initialized and the entire program blocks on this initialization if systemd is not present.

Improvements

This is the simplest possible solution to the problem described in the linked issue. All the MIG- and GPU-related operations work fine1 on Talos. We simply need to skip over the parts that can't work on Talos. The obvious alternatives or improvements over this PR are:

  1. Specifically catching the "read-only FS" error when attempting to copy the binary instead of requiring a flag to skip the operation entirely
  2. Detecting the presence or absence of systemd, either by inspecting the running processes or by introducing a timeout on the DBus connection, and adjusting accordingly, instead of requiring a flag

Caveats

This PR exists more for discussion than with the goal of being merged. These changes were made based on a very cursory reading and superficial understanding of the MIG manager. There is likely a cleaner and more elegant way to achieve this. However, I'll submit the patch as a proof-of-concept: by disabling the host-copies and all systemd features, the MIG manager works properly on Talos. This is, at least for us, an important starting point.

Footnotes

  1. CUDA validation yields ERROR: init 250 result=11s. I haven't yet figured out what this means, but so far it hasn't impacted the use of the GPU. The GPU workloads still run fine, as does the CUDA validation pod.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 19, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Arc676 added 2 commits May 19, 2026 16:12
Add flag to skip all host-mutating operations

Signed-off-by: Alessandro Vinciguerra <alessandro.vinciguerra@postfinance.ch>
Add flag to skip all systemd operations

Signed-off-by: Alessandro Vinciguerra <alessandro.vinciguerra@postfinance.ch>
@linkages
Copy link
Copy Markdown

Just want to add my feedback to this. I just tested this with the following setup:

Environment:

  • OS: Talos Linux v1.13.0
  • Kubernetes: v1.34.0
  • GPUs: NVIDIA B300 NVL and RTX 6000 Pro Server
  • NVIDIA driver/toolkit: provided by Talos system extensions (580.159.03)
  • Kernel Version: 6.18.29-talos
  • GPU Operator: Helm chart v26.3.1

I had to build a new k8s-mig-manager based on @Arc676 repo. I then pushed it to docker.io/linkages/k8s-mig-manager:v0.14.1.

Then when I deploy the gpu-operator, I set the values for the helm chart using this:

driver:
  enabled: false

toolkit:
  enabled: false

hostPaths:
  driverInstallDir: /usr/local

mig:
  strategy: mixed

migManager:
  enabled: true
  repository: docker.io/linkages
  version: v0.14.1
  env:
    - name: READONLY_ROOTFS
      value: "true"
    - name: SYSTEMD_UNAVAILABLE
      value: "true"

operator:
  cleanupCRD: true

I then set the nvidia.com/mig.config label on all nodes to all-balanced and the mig-manager did the right thing in waiting for all the operator components to stop and then it adjusted the MiG settings and restarted everything back up. Shortly after the gpu-feature-discovery controller set the correct labels on the nodes.

This was tested on 2 different types of nodes in the same cluster:

2 x Lenovo nodes with 8 x NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs
and
1 x DGX B300 with 8 x NVIDIA B300 SXM6 AC

Thank you @Arc676 for this patch. I hope it or a more elegant version of this gets pulled upstream. For now this solves my problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: k8s-mig-manager v0.14.0 observes nvidia.com/mig.config label but does not apply geometry on Talos

2 participants