User/jocelynb/upstream to aurelien#460
Open
JocelynBerrendonner wants to merge 5 commits into
Open
Conversation
`TopologyConfigInfo::new()` in `src/libs/kata-types/src/config/hypervisor/mod.rs`
gates PCIe topology construction on a hardcoded allowlist of hypervisor
names. Hypervisors not in the list silently get `None`, which is then
propagated into `DeviceManager::pcie_topology`, which then causes every
VFIO device attach to fail with:
Caused by:
0: set up device before start vm
1: do handle vfio device failed.
2: failed to add device
3: VFIO device requires a PCIe topology but none was provided
(thrown from `VfioDeviceModernHandle::attach` in
`src/runtime-rs/crates/hypervisor/src/device/driver/vfio_device/device.rs`.)
`openvmm` was missing from the allowlist, so any VFIO device assignment
to an openvmm sandbox would fail at sandbox-create time. Add it so the
generic runtime-rs device manager produces a real `PCIeTopology` for
openvmm too. The openvmm hypervisor backend already knows how to honour
the topology's `pcie_root_port` count when laying out the in-process
machine.
Signed-off-by: Jocelyn Berrendonner <jocelynb@microsoft.com>
The openvmm cold-plug branch in inner_hypervisor.rs::start_vm did two
things wrong, both invisible at VM-boot time but fatal at container
create:
1. It only matched DeviceType::Vfio (the legacy struct from
device/driver/vfio.rs), but the actual producers
-- prepare_coldplug_cdi_devices() (kubelet CDI grants) and
prepare_coldplug_raw_vfio_devices() (`ctr --device /dev/vfio/N`)
-- emit ResourceConfig::VfioDeviceModern, which becomes
DeviceType::VfioModern(Arc<Mutex<VfioDeviceModern>>). So the
existing arm never fired for any real workload, and devices fell
into the `other` warn branch and were silently dropped.
2. Even if a producer somewhere did emit the legacy variant, neither
arm wrote `config.guest_pci_path` back onto the device. The CH
driver in ch/inner_device.rs does this assignment as the last
step of its cold-plug; openvmm was missing it. With
`guest_pci_path = None`, the container-create-time call to
handler_devices() in resource::manager_inner fails with
"VFIO device has no guest PCI path assigned"
even though the VM is up and the BAR space is mapped.
Add a new DeviceType::VfioModern arm that:
* locks the Arc<Mutex<VfioDeviceModern>>,
* iterates `vfio_device.device.devices` (or the primary if the
devices vec is empty),
* reserves the next pre-allocated vfio<N> root port for each PCI
function,
* opens the /dev/vfio/<group> fd and pushes a PcieDeviceConfig with
the segment-qualified BDF (e.g. "0001:00:00.0", as produced by
BdfAddress::Display),
* computes the guest PciPath [root_slot, 0] -- matching openvmm's
root-bus slot layout (STATIC + BLOCK_HOTPLUG + port_index) and the
two-slot PciPath convention block hotplug already uses,
* writes that PciPath back onto `config.guest_pci_path` for the
IOMMU-group primary (handler_devices exposes one device per
VfioDeviceModern, matching the primary).
The legacy DeviceType::Vfio arm is left in place as a no-op safety
net. It is unreachable today but the cost of keeping it is small,
and removing it would change the public Vfio-handling surface of the
openvmm shim for no functional gain.
End-to-end effect: NVIDIA GPU + NVSwitch passthrough pods now reach
the container create step with the agent receiving correct vfio-pci
device_options ("HOST_BDF=GUEST_PCI_PATH"), matching what the CH
back-end has been doing all along.
Build fixes folded in:
* Re-export DeviceAddress from
crate::device::driver::vfio_device (the `core` submodule is
private; only re-exports from mod.rs are public).
* Lock the Arc<Mutex<VfioDeviceModern>> directly instead of going
through a non-existent `.inner` field -- DeviceType::VfioModern
carries the Arc<Mutex<...>> directly (see device/mod.rs:63), not
the VfioDeviceModernHandle newtype, matching what
ch/inner_device.rs already does in its own cold-plug branch.
Signed-off-by: Jocelyn Berrendonner <jocelynb@microsoft.com>
…d ACS fixes
A bundle of related fixes to the OpenVMM PCIe root-complex topology
that runtime-rs hands to OpenVMM. Pre-fix, an HGX A100 8-GPU + 6-NVSwitch
baseboard (14 VFIO devices) either could not allocate enough MMIO,
overflowed the 5-bit PCI slot field, programmed the wrong PciPath into
the guest, or silently disabled GPU peer-to-peer DMA. This commit
addresses all of those.
1. Enlarge PCIe low_mmio window to 640 MB
----------------------------------------
The openvmm PCIe root complex was configured with a 320 MB
non-prefetchable MMIO window (0xc000_0000..0xd400_0000). Cold-plugging
8 GPUs + 6 NVSwitches on the test bench overruns it and the worker
aborts before VM boot:
PCI resource assignment failed
low_mmio MMIO exhaustion: need 0x14500000 bytes, have 0x14000000
The need comes mostly from bridge MMIO32 windows that the root ports
pre-reserve (8x 16 MB GPU BAR0 + 6x 32 MB NVSwitch BAR0 = 320 MB on
their own, plus bridge alignment padding for every static / block-
hotplug / vfio-coldplug port).
64-bit prefetchable BARs (H100 BAR1/2 = 128 GB, BAR3/4 = 32 MB) go
into high_mmio and do not count.
Widen the window to 640 MB (0xc000_0000..0xe800_0000) for ~2x
headroom; still fits before ECAM. high_mmio is untouched -- its
~125 TB range continues to absorb the 1 TB+ of prefetchable BARs
an 8-GPU pod brings.
2. Fit PCIe root ports within 5-bit slot field
--------------------------------------------
PCI slot numbers are 5 bits (0..0x1f = 0..31). The OpenVMM shim
assigns one slot per root port on the PCIe root bus, so the total
of static + block-hotplug + VFIO cold-plug ports must be <= 32.
The previous layout (5 + 24 + 16 = 45) overflowed and triggered
'Failed to parse PCI path from QOM path '24/00'
Caused by: PCI slot 36 should be in range [0..0x1f]'
in kata-agent the moment the shim tried to cold-plug the 4th VFIO
device on an HGX A100 8-GPU baseboard.
Rebalance to (5 + 4 + 23 = 32) and document the budget on the
constants so the next person adding a new port class cannot
accidentally overflow PCI again.
3. Pack PCIe root ports across PCI functions
------------------------------------------
OpenVMM packs PCIe root ports into multi-function device slots:
port `i` sits at `device = i/8, function = i%8` on bus 0 of the
root complex (see microsoft/openvmm
`vm/devices/pci/pcie/src/root.rs::GenericPcieRootComplex::new`,
where `first_port_device_number = 0` on the x86_64 no-IOMMU path
the kata runtime uses).
The previous kata-shim layout assigned single-function slots:
`root_slot = STATIC + BLOCK_HOTPLUG + port_index`. That:
a. Overflowed PCI's 5-bit slot field once a few VFIO ports were
allocated on top of the 24-slot block-hotplug pool.
b. Disagreed with what OpenVMM actually programs into the guest's
PCIe config space, so slots below 32 only worked by coincidence.
Extend PciSlot with a function number (defaulting to 0; wire format
stays `xx` for function 0 and `xx.f` for multi-function; kata-agent's
`SlotFn::from_str` already accepts both). Add an
`openvmm_port_pci_path` helper that mirrors OpenVMM's packing, and
use it from both the block-hotplug allocator and the VFIO cold-plug
code path. This restores the original 24 block / 32 VFIO budgets and
unlocks OpenVMM's full 256-port-per-complex capacity.
4. PciSlot callsite fixups after tuple -> named-field refactor
------------------------------------------------------------
PciSlot was refactored from a tuple struct to a named-field struct
in change kata-containers#3. Update the two remaining callsites that still used
tuple syntax:
* topology.rs (static PCIe-topology allocator) -- use
`PciSlot::new` instead of `PciSlot(v as u8)`.
* swap.rs (AddSwap path) -- send the device number alone via
the `device()` accessor; the swap RPC's wire format predates
multi-function and kata-agent's do_add_swap hardcodes
function=0, so the agent expectation is unchanged.
5. Advertise ACS capabilities on every PCIe root port
---------------------------------------------------
Set `acs_capabilities_supported: Some(0x5f)` on every
`PcieRootPortConfig` we hand to OpenVMM (5 static + N block-hotplug
+ M VFIO cold-plug). 0x5f = SV | TB | RR | CR | UF | DT, i.e. the
standard PCIe ACS bitmask for downstream ports (everything except
EC).
Previously kata left this as `None`, which OpenVMM treats as
"do not synthesize the ACS Extended Capability at all" (see
microsoft/openvmm
`vm/devices/pci/pcie/src/port.rs::PcieDownstreamPort::new` where
`acs_supported` is only added to the cap list when non-zero).
Without ACS bits visible on the upstream root port, the in-guest
NVIDIA driver's P2P-enablement code concludes that peer-to-peer DMA
between assigned PCI devices is not safely supported and silently
declines to enable it. On an HGX A100 8-GPU + 6-NVSwitch baseboard
the symptom is:
* fabricmanager never programs the NVSwitch fabric
* `nvidia-smi nvlink --status` reports every link inactive
* `nvidia-smi topo -m` shows PHB (PCIe host-bridge) everywhere
instead of NV12 between any GPU pair
* NCCL falls back to slow PCIe-over-CPU paths
OpenVMM's own CLI parser defaults to 0x5f when `--pcie-root-port`
is passed without an explicit `acs=` value, so this just brings the
kata-driven path in line with the documented default.
Reported by John Starks.
Signed-off-by: Jocelyn Berrendonner <jocelynb@microsoft.com>
The OpenVMM backend embeds the VmWorker in-process, so unlike the qemu/CH backends there is no child process for the shim to reap. `OpenVmm::wait_vm()` blocks on an `exit_notify` mpsc channel that is only ever signalled from `VmmInstance::stop()` -- i.e. only when something externally calls `stop_vm()`. The mesh `Receiver<HaltReason>` that the VmWorker uses to publish halt events was stored as `_notify_recv` and never read. As a result, when the guest powered itself off (agent-initiated shutdown, kernel panic, triple fault, OOM-killed init, ...) the VmWorker emitted `HaltReason::PowerOff` into a dead-letter channel, `wait_vm()` kept waiting forever and the containerd-shim hung on pod teardown. Replace the unread `_notify_recv` field with a `halt_watcher` tokio task spawned during `launch()`. The watcher consumes halt notifications and fires `exit_notify` on any terminal reason. `HaltReason::Reset` is ignored because the worker is configured with `automatic_guest_reset = true` and handles resets internally; the explicit match is defensive in case that ever changes. `stop()` aborts the watcher before its own `try_send(0)` so the two paths do not race. Double-signaling would be harmless anyway -- `exit_notify` is a capacity-1 channel and `OpenVmm::wait_vm()` caches the first received exit code -- but aborting keeps the logs clean. OpenVMM itself has no `--exit-on-halt` flag; its REPL only logs "guest halted" and stays alive waiting for an interactive `q`. This fix is the equivalent for the embedded VmWorker.
Two changes folded into one: flip the OpenVMM default and teach the
cgroup setup code to recognise in-process VMMs so the new default
does not regress sandbox creation.
1. Default sandbox_cgroup_only=false in the OpenVMM TOML
------------------------------------------------------
When sandbox_cgroup_only=true the runtime-rs shim PID lives inside
the pod cgroup, which is bounded by container.limits.memory. For
OpenVMM this is fatal: OpenVMM runs in-process inside the shim and
allocates guest RAM as shmem mappings owned by that PID. With any
non-trivial guest RAM size the kernel OOM-killer fires during VM
construction:
misc mshv: pNN: Failed to populate memory region: -12
Memory cgroup out of memory: Killed process NNN (containerd-shim)
shmem-rss:8472576kB
The runtime then exits and the pod sandbox creation fails with
"ttrpc: closed" / FailedCreatePodSandBox.
The cgroup separation that would solve this is already implemented:
crates/resource/src/cgroups/resource_inner.rs::new_cgroup_managers
moves the runtime PID to an unconstrained kata-overhead cgroup
(systemd: kata-overhead.slice:runtime-rs:<sid>; cgroupfs:
kata_overhead/<path>) when sandbox_cgroup_only=false. That is also
the default of the corresponding Rust field
(libs/kata-types/src/config/runtime.rs::sandbox_cgroup_only), so
we are simply matching the language default in the generated TOML.
Container processes still get sized correctly because they continue
to be placed in the pod cgroup as usual; only the runtime+VMM
thread group moves to the overhead cgroup.
Verified on an HGX A100 8-GPU / 6-NVSwitch L1 VH:
- Before: limits.memory=8Gi -> openvmm mem_size=10GiB -> cgroup
OOM in ~20s; every sandbox-creation retry repeats the kill
cycle.
- After: limits.memory=24Gi -> shim cgroup is /kata_overhead/<sid>;
shim RSS climbs to ~27GB (24GB guest RAM + VMM overhead) with no
OOM; VM construction, VFIO IOMMU map, and per-GPU MSI-X setup
all complete; pod reaches Running.
Dragonball is intentionally not touched: it is force-pinned to
sandbox_cgroup_only=true in crates/resource/src/cgroups/mod.rs
because the Dragonball VMM shares its address space with the
runtime process and uses a different memory accounting model.
CLH is also out of scope for this commit. CLH execs the VMM as a
child process, so it suffers the same cgroup inheritance issue in
principle but is far less acute in practice (most CLH deployments
use small guest RAM). A separate change can flip the CLH default
after broader discussion.
2. Allow sandbox_cgroup_only=false for in-process VMM
---------------------------------------------------
After the default flip above, pod sandbox creation fails with:
failed to handle message start sandbox in task handler
Caused by:
0: setup device after start vm
1: setup cgroups after start vm
2: hypervisor cannot be moved to sandbox cgroup
The error originates in
`CgroupsResourceInner::setup_after_start_vm` which expects every
VMM with an overhead cgroup to report at least one vCPU thread ID.
The pre-existing comment correctly explains the intent: when an
overhead cgroup exists, vCPU threads must be moved to the sandbox
cgroup so container resource limits apply to them; an empty
`VcpuThreadIds` after VM start would otherwise let vCPUs run
unbounded in the overhead cgroup.
That intent is sound for out-of-process VMMs (clh, qemu, fc),
whose vCPU threads belong to a distinct child process that can be
moved without affecting the runtime. It does not work for
in-process VMMs (OpenVMM, and Dragonball when opted into
sandbox_cgroup_only=false): the runtime IS the VMM, the vCPU
threads are runtime threads, and moving them would drag the
runtime (including the in-process VMM holding gigabytes of guest
RAM as shmem) into the sandbox cgroup. The OpenVMM driver
therefore returns an empty `VcpuThreadIds` from `get_thread_ids`
by design (see openvmm/inner_hypervisor.rs:792), the same
convention Dragonball already follows.
Distinguish the two cases by asking the hypervisor for its PID
list: in-process VMMs report only `process::id()` (the runtime's
own PID), while out-of-process VMMs report a distinct child. Only
out-of-process VMMs warrant the fatal error; in-process VMMs
intentionally leave the runtime (and its vCPU threads) in the
overhead cgroup, with the sandbox cgroup materialising later when
container processes are added.
Trade-off documented for in-process VMMs: container `limits.cpu`
no longer constrains guest vCPU host CPU usage when
sandbox_cgroup_only=false, because the vCPU threads share a cgroup
with the unconstrained runtime. This is the price of allowing
large guest RAM under container memory limits, and matches the
behaviour the user opts into by setting sandbox_cgroup_only=false.
Operators that need both vCPU limits AND large guest RAM should
run an out-of-process VMM (clh).
Verified on the same HGX A100 8-GPU / 6-NVSwitch bench as change
kata-containers#1: with both changes in place, sandbox creation completes (shim
in /kata_overhead/<sid>, ~27GB RSS for 24GB guest RAM, no
FailedCreatePodSandBox events).
Signed-off-by: Jocelyn Berrendonner <jocelynb@microsoft.com>
sprt
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Merge Checklist
Summary
Bring the runtime-rs OpenVMM backend up to working end-to-end VFIO cold-plug on
HGX A100 / H100 baseboards (8 GPUs + 6 NVSwitches), plus the in-process VMM
cgroup story it needs to actually boot under kubelet memory limits. Five
commits, grouped by topic so each one stands on its own for review and
bisect:
kata-types: add openvmm to TopologyConfigInfo allowlist
Without this, every VFIO attach to an openvmm sandbox failed with
"VFIO device requires a PCIe topology but none was provided" because the
hardcoded hypervisor allowlist in
TopologyConfigInfo::new()silentlyreturned
Nonefor openvmm.runtime-rs/openvmm: cold-plug VFIO devices with correct guest PciPath
The existing cold-plug arm only matched the legacy
DeviceType::Vfio,not the
DeviceType::VfioModern(Arc<Mutex<…>>)that the real producers(
prepare_coldplug_cdi_devicesandprepare_coldplug_raw_vfio_devices)actually emit, so every real workload silently dropped its VFIO devices.
The new arm also writes
config.guest_pci_pathback onto the device,matching what the CH backend has been doing all along — without this the
container-create step fails with "VFIO device has no guest PCI path
assigned" even though the VM is up.
runtime-rs/openvmm: PCIe root port layout, slot field, MMIO sizing and ACS fixes
Five inter-related root-complex topology fixes that only show up at scale
(8 GPUs + 6 NVSwitches):
overflowed at
low_mmio MMIO exhaustion: need 0x14500000 bytes, have 0x14000000.PCI's 5-bit slot field (was 5+24+16=45, now 5+4+23=32).
programs into the guest's config space (port
i→ devicei/8,function
i%8); extendPciSlotwith a function number and add anopenvmm_port_pci_pathhelper used by both block-hotplug and VFIOcold-plug.
PciSlottuple-struct callsites (topology.rs,swap.rs) after the named-field refactor.PcieRootPortConfig. Without this, the in-guest NVIDIA driver silentlydeclines GPU peer-to-peer DMA: fabricmanager never programs the
NVSwitch fabric,
nvidia-smi nvlink --statusreports every linkinactive,
nvidia-smi topo -mshows PHB everywhere, and NCCL fallsback to slow PCIe-over-CPU paths. (Reported by John Starks.)
runtime-rs/openvmm: exit shim when guest halts on its own
OpenVMM is in-process so there is no child to reap; the mesh
Receiver<HaltReason>was stored as_notify_recvand never read, soguest-initiated power-off / panic / OOM-killed-init left the shim hung
on pod teardown. Spawn a
halt_watchertask that forwards terminalhalt reasons to
exit_notify.runtime-rs/openvmm: default sandbox_cgroup_only=false for in-process VMM
language default. With the previous (true) default the runtime PID
lives in the pod cgroup; OpenVMM's shmem-backed guest RAM is then
accounted to
container.limits.memoryand the kernel OOM-killerfires during VM construction (
misc mshv: pNN: Failed to populate memory region: -12,Killed process … shmem-rss:8472576kB),producing
FailedCreatePodSandBox.CgroupsResourceInner::setup_after_start_vmto recognisein-process VMMs (those whose pid list is just
process::id()) so thenew default doesn't trip the existing "hypervisor cannot be moved to
sandbox cgroup" fatal — that check is only meaningful for
out-of-process VMMs whose vCPU threads belong to a distinct child.
Documented trade-off: container
limits.cpuno longer constrains vCPUhost CPU usage for in-process VMMs when
sandbox_cgroup_only=false,because vCPU threads share a cgroup with the unconstrained runtime.
Operators that need both vCPU limits AND large guest RAM should run
an out-of-process VMM (clh). Dragonball is unchanged (force-pinned to
sandbox_cgroup_only=trueincgroups/mod.rs). CLH default is leftalone — it's affected in principle but most CLH deployments use small
guest RAM, so a separate change can flip it after broader discussion.
Associated issues
N/A
Links to CVEs
N/A
Test Methodology
End-to-end on an HGX A100 8-GPU / 6-NVSwitch L1 VH test bench:
runtime-rs(openvmm feature) and the kata containerd shim fromthis branch; deployed via the standard
kata-deployflow.raw OCI VFIO (
ctr --device /dev/vfio/N) and via CDI (kubeletdevice-plugin path); both producers now reach
Runningwith theagent receiving correct
vfio-pcidevice_options(
HOST_BDF=GUEST_PCI_PATH).lspcishows all 14 endpoints, each behind its own root port packedinto multi-function device slots as expected.
nvidia-smi nvlink --statusreports all links active; fabricmanagerprograms the NVSwitch fabric.
nvidia-smi topo -mshows NV12 between GPU pairs (not PHB).systemd-cgls//proc/<shim-pid>/cgroup:shim lives in
/kata_overhead/<sid>, RSS climbs to ~27 GB for a 24 GiBguest with no OOM, where the previous default OOM-killed the shim in
~20 s on a 10 GiB guest under an 8 GiB pod memory limit.
kubectl delete pod, in-guestshutdown -h now, inducedguest panic) all cleanly exit the shim; no more hung containerd-shim
processes.
configs and behaviour are unchanged. Existing CH cold-plug pods on the
same bench continue to work.