User/jocelynb/upstream to aurelien by JocelynBerrendonner · Pull Request #460 · microsoft/kata-containers

JocelynBerrendonner · 2026-06-15T18:27:47Z

Merge Checklist

Followed patch format from upstream recommendation: https://github.com/kata-containers/community/blob/main/CONTRIBUTING.md#patch-format
Included a single commit in a given PR - at least unless there are related commits and each makes sense as a change on its own.
Merged using "create a merge commit" rather than "squash and merge" (or similar)
genPolicy only: Builds on Windows
genPolicy only: Updated sample YAMLs' policy annotations, if applicable

Summary

Bring the runtime-rs OpenVMM backend up to working end-to-end VFIO cold-plug on
HGX A100 / H100 baseboards (8 GPUs + 6 NVSwitches), plus the in-process VMM
cgroup story it needs to actually boot under kubelet memory limits. Five
commits, grouped by topic so each one stands on its own for review and
bisect:

kata-types: add openvmm to TopologyConfigInfo allowlist
Without this, every VFIO attach to an openvmm sandbox failed with
"VFIO device requires a PCIe topology but none was provided" because the
hardcoded hypervisor allowlist in TopologyConfigInfo::new() silently
returned None for openvmm.
runtime-rs/openvmm: cold-plug VFIO devices with correct guest PciPath
The existing cold-plug arm only matched the legacy DeviceType::Vfio,
not the DeviceType::VfioModern(Arc<Mutex<…>>) that the real producers
(prepare_coldplug_cdi_devices and prepare_coldplug_raw_vfio_devices)
actually emit, so every real workload silently dropped its VFIO devices.
The new arm also writes config.guest_pci_path back onto the device,
matching what the CH backend has been doing all along — without this the
container-create step fails with "VFIO device has no guest PCI path
assigned" even though the VM is up.
runtime-rs/openvmm: PCIe root port layout, slot field, MMIO sizing and ACS fixes
Five inter-related root-complex topology fixes that only show up at scale
(8 GPUs + 6 NVSwitches):
- Enlarge the PCIe low_mmio window from 320 MB to 640 MB; the old value
  overflowed at low_mmio MMIO exhaustion: need 0x14500000 bytes, have 0x14000000.
- Rebalance static / block-hotplug / cold-plug port counts to fit in
  PCI's 5-bit slot field (was 5+24+16=45, now 5+4+23=32).
- Pack root ports across PCI functions to match what OpenVMM actually
  programs into the guest's config space (port i → device i/8,
  function i%8); extend PciSlot with a function number and add an
  openvmm_port_pci_path helper used by both block-hotplug and VFIO
  cold-plug.
- Fix the two remaining PciSlot tuple-struct callsites (topology.rs,
  swap.rs) after the named-field refactor.
- Advertise ACS capabilities (0x5f = SV|TB|RR|CR|UF|DT) on every
  PcieRootPortConfig. Without this, the in-guest NVIDIA driver silently
  declines GPU peer-to-peer DMA: fabricmanager never programs the
  NVSwitch fabric, nvidia-smi nvlink --status reports every link
  inactive, nvidia-smi topo -m shows PHB everywhere, and NCCL falls
  back to slow PCIe-over-CPU paths. (Reported by John Starks.)
runtime-rs/openvmm: exit shim when guest halts on its own
OpenVMM is in-process so there is no child to reap; the mesh
Receiver<HaltReason> was stored as _notify_recv and never read, so
guest-initiated power-off / panic / OOM-killed-init left the shim hung
on pod teardown. Spawn a halt_watcher task that forwards terminal
halt reasons to exit_notify.
runtime-rs/openvmm: default sandbox_cgroup_only=false for in-process VMM
- Flip the default in the generated openvmm TOML to match the Rust
  language default. With the previous (true) default the runtime PID
  lives in the pod cgroup; OpenVMM's shmem-backed guest RAM is then
  accounted to container.limits.memory and the kernel OOM-killer
  fires during VM construction (misc mshv: pNN: Failed to populate memory region: -12, Killed process … shmem-rss:8472576kB),
  producing FailedCreatePodSandBox.
- Teach CgroupsResourceInner::setup_after_start_vm to recognise
  in-process VMMs (those whose pid list is just process::id()) so the
  new default doesn't trip the existing "hypervisor cannot be moved to
  sandbox cgroup" fatal — that check is only meaningful for
  out-of-process VMMs whose vCPU threads belong to a distinct child.
Documented trade-off: container limits.cpu no longer constrains vCPU
host CPU usage for in-process VMMs when sandbox_cgroup_only=false,
because vCPU threads share a cgroup with the unconstrained runtime.
Operators that need both vCPU limits AND large guest RAM should run
an out-of-process VMM (clh). Dragonball is unchanged (force-pinned to
sandbox_cgroup_only=true in cgroups/mod.rs). CLH default is left
alone — it's affected in principle but most CLH deployments use small
guest RAM, so a separate change can flip it after broader discussion.

Associated issues

N/A

Links to CVEs

N/A

Test Methodology

End-to-end on an HGX A100 8-GPU / 6-NVSwitch L1 VH test bench:

Built runtime-rs (openvmm feature) and the kata containerd shim from
this branch; deployed via the standard kata-deploy flow.
Launched a GPU pod with all 8 GPUs + 6 NVSwitches passed through via
raw OCI VFIO (ctr --device /dev/vfio/N) and via CDI (kubelet
device-plugin path); both producers now reach Running with the
agent receiving correct vfio-pci device_options
(HOST_BDF=GUEST_PCI_PATH).
Confirmed in-guest:
- lspci shows all 14 endpoints, each behind its own root port packed
  into multi-function device slots as expected.
- nvidia-smi nvlink --status reports all links active; fabricmanager
  programs the NVSwitch fabric.
- nvidia-smi topo -m shows NV12 between GPU pairs (not PHB).
Cgroup behaviour verified with systemd-cgls / /proc/<shim-pid>/cgroup:
shim lives in /kata_overhead/<sid>, RSS climbs to ~27 GB for a 24 GiB
guest with no OOM, where the previous default OOM-killed the shim in
~20 s on a 10 GiB guest under an 8 GiB pod memory limit.
Pod teardown (kubectl delete pod, in-guest shutdown -h now, induced
guest panic) all cleanly exit the shim; no more hung containerd-shim
processes.
No code changes outside the openvmm path; CH / Dragonball / qemu / fc
configs and behaviour are unchanged. Existing CH cold-plug pods on the
same bench continue to work.

`TopologyConfigInfo::new()` in `src/libs/kata-types/src/config/hypervisor/mod.rs` gates PCIe topology construction on a hardcoded allowlist of hypervisor names. Hypervisors not in the list silently get `None`, which is then propagated into `DeviceManager::pcie_topology`, which then causes every VFIO device attach to fail with: Caused by: 0: set up device before start vm 1: do handle vfio device failed. 2: failed to add device 3: VFIO device requires a PCIe topology but none was provided (thrown from `VfioDeviceModernHandle::attach` in `src/runtime-rs/crates/hypervisor/src/device/driver/vfio_device/device.rs`.) `openvmm` was missing from the allowlist, so any VFIO device assignment to an openvmm sandbox would fail at sandbox-create time. Add it so the generic runtime-rs device manager produces a real `PCIeTopology` for openvmm too. The openvmm hypervisor backend already knows how to honour the topology's `pcie_root_port` count when laying out the in-process machine. Signed-off-by: Jocelyn Berrendonner <jocelynb@microsoft.com>

The openvmm cold-plug branch in inner_hypervisor.rs::start_vm did two things wrong, both invisible at VM-boot time but fatal at container create: 1. It only matched DeviceType::Vfio (the legacy struct from device/driver/vfio.rs), but the actual producers -- prepare_coldplug_cdi_devices() (kubelet CDI grants) and prepare_coldplug_raw_vfio_devices() (`ctr --device /dev/vfio/N`) -- emit ResourceConfig::VfioDeviceModern, which becomes DeviceType::VfioModern(Arc<Mutex<VfioDeviceModern>>). So the existing arm never fired for any real workload, and devices fell into the `other` warn branch and were silently dropped. 2. Even if a producer somewhere did emit the legacy variant, neither arm wrote `config.guest_pci_path` back onto the device. The CH driver in ch/inner_device.rs does this assignment as the last step of its cold-plug; openvmm was missing it. With `guest_pci_path = None`, the container-create-time call to handler_devices() in resource::manager_inner fails with "VFIO device has no guest PCI path assigned" even though the VM is up and the BAR space is mapped. Add a new DeviceType::VfioModern arm that: * locks the Arc<Mutex<VfioDeviceModern>>, * iterates `vfio_device.device.devices` (or the primary if the devices vec is empty), * reserves the next pre-allocated vfio<N> root port for each PCI function, * opens the /dev/vfio/<group> fd and pushes a PcieDeviceConfig with the segment-qualified BDF (e.g. "0001:00:00.0", as produced by BdfAddress::Display), * computes the guest PciPath [root_slot, 0] -- matching openvmm's root-bus slot layout (STATIC + BLOCK_HOTPLUG + port_index) and the two-slot PciPath convention block hotplug already uses, * writes that PciPath back onto `config.guest_pci_path` for the IOMMU-group primary (handler_devices exposes one device per VfioDeviceModern, matching the primary). The legacy DeviceType::Vfio arm is left in place as a no-op safety net. It is unreachable today but the cost of keeping it is small, and removing it would change the public Vfio-handling surface of the openvmm shim for no functional gain. End-to-end effect: NVIDIA GPU + NVSwitch passthrough pods now reach the container create step with the agent receiving correct vfio-pci device_options ("HOST_BDF=GUEST_PCI_PATH"), matching what the CH back-end has been doing all along. Build fixes folded in: * Re-export DeviceAddress from crate::device::driver::vfio_device (the `core` submodule is private; only re-exports from mod.rs are public). * Lock the Arc<Mutex<VfioDeviceModern>> directly instead of going through a non-existent `.inner` field -- DeviceType::VfioModern carries the Arc<Mutex<...>> directly (see device/mod.rs:63), not the VfioDeviceModernHandle newtype, matching what ch/inner_device.rs already does in its own cold-plug branch. Signed-off-by: Jocelyn Berrendonner <jocelynb@microsoft.com>

…d ACS fixes A bundle of related fixes to the OpenVMM PCIe root-complex topology that runtime-rs hands to OpenVMM. Pre-fix, an HGX A100 8-GPU + 6-NVSwitch baseboard (14 VFIO devices) either could not allocate enough MMIO, overflowed the 5-bit PCI slot field, programmed the wrong PciPath into the guest, or silently disabled GPU peer-to-peer DMA. This commit addresses all of those. 1. Enlarge PCIe low_mmio window to 640 MB ---------------------------------------- The openvmm PCIe root complex was configured with a 320 MB non-prefetchable MMIO window (0xc000_0000..0xd400_0000). Cold-plugging 8 GPUs + 6 NVSwitches on the test bench overruns it and the worker aborts before VM boot: PCI resource assignment failed low_mmio MMIO exhaustion: need 0x14500000 bytes, have 0x14000000 The need comes mostly from bridge MMIO32 windows that the root ports pre-reserve (8x 16 MB GPU BAR0 + 6x 32 MB NVSwitch BAR0 = 320 MB on their own, plus bridge alignment padding for every static / block- hotplug / vfio-coldplug port). 64-bit prefetchable BARs (H100 BAR1/2 = 128 GB, BAR3/4 = 32 MB) go into high_mmio and do not count. Widen the window to 640 MB (0xc000_0000..0xe800_0000) for ~2x headroom; still fits before ECAM. high_mmio is untouched -- its ~125 TB range continues to absorb the 1 TB+ of prefetchable BARs an 8-GPU pod brings. 2. Fit PCIe root ports within 5-bit slot field -------------------------------------------- PCI slot numbers are 5 bits (0..0x1f = 0..31). The OpenVMM shim assigns one slot per root port on the PCIe root bus, so the total of static + block-hotplug + VFIO cold-plug ports must be <= 32. The previous layout (5 + 24 + 16 = 45) overflowed and triggered 'Failed to parse PCI path from QOM path '24/00' Caused by: PCI slot 36 should be in range [0..0x1f]' in kata-agent the moment the shim tried to cold-plug the 4th VFIO device on an HGX A100 8-GPU baseboard. Rebalance to (5 + 4 + 23 = 32) and document the budget on the constants so the next person adding a new port class cannot accidentally overflow PCI again. 3. Pack PCIe root ports across PCI functions ------------------------------------------ OpenVMM packs PCIe root ports into multi-function device slots: port `i` sits at `device = i/8, function = i%8` on bus 0 of the root complex (see microsoft/openvmm `vm/devices/pci/pcie/src/root.rs::GenericPcieRootComplex::new`, where `first_port_device_number = 0` on the x86_64 no-IOMMU path the kata runtime uses). The previous kata-shim layout assigned single-function slots: `root_slot = STATIC + BLOCK_HOTPLUG + port_index`. That: a. Overflowed PCI's 5-bit slot field once a few VFIO ports were allocated on top of the 24-slot block-hotplug pool. b. Disagreed with what OpenVMM actually programs into the guest's PCIe config space, so slots below 32 only worked by coincidence. Extend PciSlot with a function number (defaulting to 0; wire format stays `xx` for function 0 and `xx.f` for multi-function; kata-agent's `SlotFn::from_str` already accepts both). Add an `openvmm_port_pci_path` helper that mirrors OpenVMM's packing, and use it from both the block-hotplug allocator and the VFIO cold-plug code path. This restores the original 24 block / 32 VFIO budgets and unlocks OpenVMM's full 256-port-per-complex capacity. 4. PciSlot callsite fixups after tuple -> named-field refactor ------------------------------------------------------------ PciSlot was refactored from a tuple struct to a named-field struct in change kata-containers#3. Update the two remaining callsites that still used tuple syntax: * topology.rs (static PCIe-topology allocator) -- use `PciSlot::new` instead of `PciSlot(v as u8)`. * swap.rs (AddSwap path) -- send the device number alone via the `device()` accessor; the swap RPC's wire format predates multi-function and kata-agent's do_add_swap hardcodes function=0, so the agent expectation is unchanged. 5. Advertise ACS capabilities on every PCIe root port --------------------------------------------------- Set `acs_capabilities_supported: Some(0x5f)` on every `PcieRootPortConfig` we hand to OpenVMM (5 static + N block-hotplug + M VFIO cold-plug). 0x5f = SV | TB | RR | CR | UF | DT, i.e. the standard PCIe ACS bitmask for downstream ports (everything except EC). Previously kata left this as `None`, which OpenVMM treats as "do not synthesize the ACS Extended Capability at all" (see microsoft/openvmm `vm/devices/pci/pcie/src/port.rs::PcieDownstreamPort::new` where `acs_supported` is only added to the cap list when non-zero). Without ACS bits visible on the upstream root port, the in-guest NVIDIA driver's P2P-enablement code concludes that peer-to-peer DMA between assigned PCI devices is not safely supported and silently declines to enable it. On an HGX A100 8-GPU + 6-NVSwitch baseboard the symptom is: * fabricmanager never programs the NVSwitch fabric * `nvidia-smi nvlink --status` reports every link inactive * `nvidia-smi topo -m` shows PHB (PCIe host-bridge) everywhere instead of NV12 between any GPU pair * NCCL falls back to slow PCIe-over-CPU paths OpenVMM's own CLI parser defaults to 0x5f when `--pcie-root-port` is passed without an explicit `acs=` value, so this just brings the kata-driven path in line with the documented default. Reported by John Starks. Signed-off-by: Jocelyn Berrendonner <jocelynb@microsoft.com>

The OpenVMM backend embeds the VmWorker in-process, so unlike the qemu/CH backends there is no child process for the shim to reap. `OpenVmm::wait_vm()` blocks on an `exit_notify` mpsc channel that is only ever signalled from `VmmInstance::stop()` -- i.e. only when something externally calls `stop_vm()`. The mesh `Receiver<HaltReason>` that the VmWorker uses to publish halt events was stored as `_notify_recv` and never read. As a result, when the guest powered itself off (agent-initiated shutdown, kernel panic, triple fault, OOM-killed init, ...) the VmWorker emitted `HaltReason::PowerOff` into a dead-letter channel, `wait_vm()` kept waiting forever and the containerd-shim hung on pod teardown. Replace the unread `_notify_recv` field with a `halt_watcher` tokio task spawned during `launch()`. The watcher consumes halt notifications and fires `exit_notify` on any terminal reason. `HaltReason::Reset` is ignored because the worker is configured with `automatic_guest_reset = true` and handles resets internally; the explicit match is defensive in case that ever changes. `stop()` aborts the watcher before its own `try_send(0)` so the two paths do not race. Double-signaling would be harmless anyway -- `exit_notify` is a capacity-1 channel and `OpenVmm::wait_vm()` caches the first received exit code -- but aborting keeps the logs clean. OpenVMM itself has no `--exit-on-halt` flag; its REPL only logs "guest halted" and stays alive waiting for an interactive `q`. This fix is the equivalent for the embedded VmWorker.

Two changes folded into one: flip the OpenVMM default and teach the cgroup setup code to recognise in-process VMMs so the new default does not regress sandbox creation. 1. Default sandbox_cgroup_only=false in the OpenVMM TOML ------------------------------------------------------ When sandbox_cgroup_only=true the runtime-rs shim PID lives inside the pod cgroup, which is bounded by container.limits.memory. For OpenVMM this is fatal: OpenVMM runs in-process inside the shim and allocates guest RAM as shmem mappings owned by that PID. With any non-trivial guest RAM size the kernel OOM-killer fires during VM construction: misc mshv: pNN: Failed to populate memory region: -12 Memory cgroup out of memory: Killed process NNN (containerd-shim) shmem-rss:8472576kB The runtime then exits and the pod sandbox creation fails with "ttrpc: closed" / FailedCreatePodSandBox. The cgroup separation that would solve this is already implemented: crates/resource/src/cgroups/resource_inner.rs::new_cgroup_managers moves the runtime PID to an unconstrained kata-overhead cgroup (systemd: kata-overhead.slice:runtime-rs:<sid>; cgroupfs: kata_overhead/<path>) when sandbox_cgroup_only=false. That is also the default of the corresponding Rust field (libs/kata-types/src/config/runtime.rs::sandbox_cgroup_only), so we are simply matching the language default in the generated TOML. Container processes still get sized correctly because they continue to be placed in the pod cgroup as usual; only the runtime+VMM thread group moves to the overhead cgroup. Verified on an HGX A100 8-GPU / 6-NVSwitch L1 VH: - Before: limits.memory=8Gi -> openvmm mem_size=10GiB -> cgroup OOM in ~20s; every sandbox-creation retry repeats the kill cycle. - After: limits.memory=24Gi -> shim cgroup is /kata_overhead/<sid>; shim RSS climbs to ~27GB (24GB guest RAM + VMM overhead) with no OOM; VM construction, VFIO IOMMU map, and per-GPU MSI-X setup all complete; pod reaches Running. Dragonball is intentionally not touched: it is force-pinned to sandbox_cgroup_only=true in crates/resource/src/cgroups/mod.rs because the Dragonball VMM shares its address space with the runtime process and uses a different memory accounting model. CLH is also out of scope for this commit. CLH execs the VMM as a child process, so it suffers the same cgroup inheritance issue in principle but is far less acute in practice (most CLH deployments use small guest RAM). A separate change can flip the CLH default after broader discussion. 2. Allow sandbox_cgroup_only=false for in-process VMM --------------------------------------------------- After the default flip above, pod sandbox creation fails with: failed to handle message start sandbox in task handler Caused by: 0: setup device after start vm 1: setup cgroups after start vm 2: hypervisor cannot be moved to sandbox cgroup The error originates in `CgroupsResourceInner::setup_after_start_vm` which expects every VMM with an overhead cgroup to report at least one vCPU thread ID. The pre-existing comment correctly explains the intent: when an overhead cgroup exists, vCPU threads must be moved to the sandbox cgroup so container resource limits apply to them; an empty `VcpuThreadIds` after VM start would otherwise let vCPUs run unbounded in the overhead cgroup. That intent is sound for out-of-process VMMs (clh, qemu, fc), whose vCPU threads belong to a distinct child process that can be moved without affecting the runtime. It does not work for in-process VMMs (OpenVMM, and Dragonball when opted into sandbox_cgroup_only=false): the runtime IS the VMM, the vCPU threads are runtime threads, and moving them would drag the runtime (including the in-process VMM holding gigabytes of guest RAM as shmem) into the sandbox cgroup. The OpenVMM driver therefore returns an empty `VcpuThreadIds` from `get_thread_ids` by design (see openvmm/inner_hypervisor.rs:792), the same convention Dragonball already follows. Distinguish the two cases by asking the hypervisor for its PID list: in-process VMMs report only `process::id()` (the runtime's own PID), while out-of-process VMMs report a distinct child. Only out-of-process VMMs warrant the fatal error; in-process VMMs intentionally leave the runtime (and its vCPU threads) in the overhead cgroup, with the sandbox cgroup materialising later when container processes are added. Trade-off documented for in-process VMMs: container `limits.cpu` no longer constrains guest vCPU host CPU usage when sandbox_cgroup_only=false, because the vCPU threads share a cgroup with the unconstrained runtime. This is the price of allowing large guest RAM under container memory limits, and matches the behaviour the user opts into by setting sandbox_cgroup_only=false. Operators that need both vCPU limits AND large guest RAM should run an out-of-process VMM (clh). Verified on the same HGX A100 8-GPU / 6-NVSwitch bench as change kata-containers#1: with both changes in place, sandbox creation completes (shim in /kata_overhead/<sid>, ~27GB RSS for 24GB guest RAM, no FailedCreatePodSandBox events). Signed-off-by: Jocelyn Berrendonner <jocelynb@microsoft.com>

JocelynBerrendonner added 5 commits June 15, 2026 11:19

JocelynBerrendonner commented Jun 15, 2026

View reviewed changes

Comment thread src/runtime-rs/crates/hypervisor/src/openvmm/inner_hypervisor.rs

JocelynBerrendonner commented Jun 15, 2026

View reviewed changes

Comment thread src/runtime-rs/crates/hypervisor/src/openvmm/inner_hypervisor.rs

sprt approved these changes Jun 17, 2026

View reviewed changes

sprt self-requested a review June 17, 2026 21:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

User/jocelynb/upstream to aurelien#460

User/jocelynb/upstream to aurelien#460
JocelynBerrendonner wants to merge 5 commits into
microsoft:sprt/openvmmfrom
JocelynBerrendonner:user/jocelynb/upstream-to-aurelien

JocelynBerrendonner commented Jun 15, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

JocelynBerrendonner commented Jun 15, 2026

Merge Checklist

Summary

Associated issues

Links to CVEs

Test Methodology

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants