Skip to content

User/jocelynb/upstream to aurelien#460

Open
JocelynBerrendonner wants to merge 5 commits into
microsoft:sprt/openvmmfrom
JocelynBerrendonner:user/jocelynb/upstream-to-aurelien
Open

User/jocelynb/upstream to aurelien#460
JocelynBerrendonner wants to merge 5 commits into
microsoft:sprt/openvmmfrom
JocelynBerrendonner:user/jocelynb/upstream-to-aurelien

Conversation

@JocelynBerrendonner

Copy link
Copy Markdown
Member
Merge Checklist
  • Followed patch format from upstream recommendation: https://github.com/kata-containers/community/blob/main/CONTRIBUTING.md#patch-format
  • Included a single commit in a given PR - at least unless there are related commits and each makes sense as a change on its own.
  • Merged using "create a merge commit" rather than "squash and merge" (or similar)
  • genPolicy only: Builds on Windows
  • genPolicy only: Updated sample YAMLs' policy annotations, if applicable
Summary

Bring the runtime-rs OpenVMM backend up to working end-to-end VFIO cold-plug on
HGX A100 / H100 baseboards (8 GPUs + 6 NVSwitches), plus the in-process VMM
cgroup story it needs to actually boot under kubelet memory limits. Five
commits, grouped by topic so each one stands on its own for review and
bisect:

  1. kata-types: add openvmm to TopologyConfigInfo allowlist
    Without this, every VFIO attach to an openvmm sandbox failed with
    "VFIO device requires a PCIe topology but none was provided" because the
    hardcoded hypervisor allowlist in TopologyConfigInfo::new() silently
    returned None for openvmm.

  2. runtime-rs/openvmm: cold-plug VFIO devices with correct guest PciPath
    The existing cold-plug arm only matched the legacy DeviceType::Vfio,
    not the DeviceType::VfioModern(Arc<Mutex<…>>) that the real producers
    (prepare_coldplug_cdi_devices and prepare_coldplug_raw_vfio_devices)
    actually emit, so every real workload silently dropped its VFIO devices.
    The new arm also writes config.guest_pci_path back onto the device,
    matching what the CH backend has been doing all along — without this the
    container-create step fails with "VFIO device has no guest PCI path
    assigned" even though the VM is up.

  3. runtime-rs/openvmm: PCIe root port layout, slot field, MMIO sizing and ACS fixes
    Five inter-related root-complex topology fixes that only show up at scale
    (8 GPUs + 6 NVSwitches):

    • Enlarge the PCIe low_mmio window from 320 MB to 640 MB; the old value
      overflowed at low_mmio MMIO exhaustion: need 0x14500000 bytes, have 0x14000000.
    • Rebalance static / block-hotplug / cold-plug port counts to fit in
      PCI's 5-bit slot field (was 5+24+16=45, now 5+4+23=32).
    • Pack root ports across PCI functions to match what OpenVMM actually
      programs into the guest's config space (port i → device i/8,
      function i%8); extend PciSlot with a function number and add an
      openvmm_port_pci_path helper used by both block-hotplug and VFIO
      cold-plug.
    • Fix the two remaining PciSlot tuple-struct callsites (topology.rs,
      swap.rs) after the named-field refactor.
    • Advertise ACS capabilities (0x5f = SV|TB|RR|CR|UF|DT) on every
      PcieRootPortConfig. Without this, the in-guest NVIDIA driver silently
      declines GPU peer-to-peer DMA: fabricmanager never programs the
      NVSwitch fabric, nvidia-smi nvlink --status reports every link
      inactive, nvidia-smi topo -m shows PHB everywhere, and NCCL falls
      back to slow PCIe-over-CPU paths. (Reported by John Starks.)
  4. runtime-rs/openvmm: exit shim when guest halts on its own
    OpenVMM is in-process so there is no child to reap; the mesh
    Receiver<HaltReason> was stored as _notify_recv and never read, so
    guest-initiated power-off / panic / OOM-killed-init left the shim hung
    on pod teardown. Spawn a halt_watcher task that forwards terminal
    halt reasons to exit_notify.

  5. runtime-rs/openvmm: default sandbox_cgroup_only=false for in-process VMM

    • Flip the default in the generated openvmm TOML to match the Rust
      language default. With the previous (true) default the runtime PID
      lives in the pod cgroup; OpenVMM's shmem-backed guest RAM is then
      accounted to container.limits.memory and the kernel OOM-killer
      fires during VM construction (misc mshv: pNN: Failed to populate memory region: -12, Killed process … shmem-rss:8472576kB),
      producing FailedCreatePodSandBox.
    • Teach CgroupsResourceInner::setup_after_start_vm to recognise
      in-process VMMs (those whose pid list is just process::id()) so the
      new default doesn't trip the existing "hypervisor cannot be moved to
      sandbox cgroup" fatal — that check is only meaningful for
      out-of-process VMMs whose vCPU threads belong to a distinct child.

    Documented trade-off: container limits.cpu no longer constrains vCPU
    host CPU usage for in-process VMMs when sandbox_cgroup_only=false,
    because vCPU threads share a cgroup with the unconstrained runtime.
    Operators that need both vCPU limits AND large guest RAM should run
    an out-of-process VMM (clh). Dragonball is unchanged (force-pinned to
    sandbox_cgroup_only=true in cgroups/mod.rs). CLH default is left
    alone — it's affected in principle but most CLH deployments use small
    guest RAM, so a separate change can flip it after broader discussion.

Associated issues

N/A

Links to CVEs

N/A

Test Methodology

End-to-end on an HGX A100 8-GPU / 6-NVSwitch L1 VH test bench:

  • Built runtime-rs (openvmm feature) and the kata containerd shim from
    this branch; deployed via the standard kata-deploy flow.
  • Launched a GPU pod with all 8 GPUs + 6 NVSwitches passed through via
    raw OCI VFIO (ctr --device /dev/vfio/N) and via CDI (kubelet
    device-plugin path); both producers now reach Running with the
    agent receiving correct vfio-pci device_options
    (HOST_BDF=GUEST_PCI_PATH).
  • Confirmed in-guest:
    • lspci shows all 14 endpoints, each behind its own root port packed
      into multi-function device slots as expected.
    • nvidia-smi nvlink --status reports all links active; fabricmanager
      programs the NVSwitch fabric.
    • nvidia-smi topo -m shows NV12 between GPU pairs (not PHB).
  • Cgroup behaviour verified with systemd-cgls / /proc/<shim-pid>/cgroup:
    shim lives in /kata_overhead/<sid>, RSS climbs to ~27 GB for a 24 GiB
    guest with no OOM, where the previous default OOM-killed the shim in
    ~20 s on a 10 GiB guest under an 8 GiB pod memory limit.
  • Pod teardown (kubectl delete pod, in-guest shutdown -h now, induced
    guest panic) all cleanly exit the shim; no more hung containerd-shim
    processes.
  • No code changes outside the openvmm path; CH / Dragonball / qemu / fc
    configs and behaviour are unchanged. Existing CH cold-plug pods on the
    same bench continue to work.

`TopologyConfigInfo::new()` in `src/libs/kata-types/src/config/hypervisor/mod.rs`
gates PCIe topology construction on a hardcoded allowlist of hypervisor
names. Hypervisors not in the list silently get `None`, which is then
propagated into `DeviceManager::pcie_topology`, which then causes every
VFIO device attach to fail with:

  Caused by:
      0: set up device before start vm
      1: do handle vfio device failed.
      2: failed to add device
      3: VFIO device requires a PCIe topology but none was provided

(thrown from `VfioDeviceModernHandle::attach` in
`src/runtime-rs/crates/hypervisor/src/device/driver/vfio_device/device.rs`.)

`openvmm` was missing from the allowlist, so any VFIO device assignment
to an openvmm sandbox would fail at sandbox-create time. Add it so the
generic runtime-rs device manager produces a real `PCIeTopology` for
openvmm too. The openvmm hypervisor backend already knows how to honour
the topology's `pcie_root_port` count when laying out the in-process
machine.

Signed-off-by: Jocelyn Berrendonner <jocelynb@microsoft.com>
The openvmm cold-plug branch in inner_hypervisor.rs::start_vm did two
things wrong, both invisible at VM-boot time but fatal at container
create:

  1. It only matched DeviceType::Vfio (the legacy struct from
     device/driver/vfio.rs), but the actual producers
     -- prepare_coldplug_cdi_devices() (kubelet CDI grants) and
        prepare_coldplug_raw_vfio_devices() (`ctr --device /dev/vfio/N`)
     -- emit ResourceConfig::VfioDeviceModern, which becomes
     DeviceType::VfioModern(Arc<Mutex<VfioDeviceModern>>). So the
     existing arm never fired for any real workload, and devices fell
     into the `other` warn branch and were silently dropped.

  2. Even if a producer somewhere did emit the legacy variant, neither
     arm wrote `config.guest_pci_path` back onto the device. The CH
     driver in ch/inner_device.rs does this assignment as the last
     step of its cold-plug; openvmm was missing it. With
     `guest_pci_path = None`, the container-create-time call to
     handler_devices() in resource::manager_inner fails with
        "VFIO device has no guest PCI path assigned"
     even though the VM is up and the BAR space is mapped.

Add a new DeviceType::VfioModern arm that:
  * locks the Arc<Mutex<VfioDeviceModern>>,
  * iterates `vfio_device.device.devices` (or the primary if the
    devices vec is empty),
  * reserves the next pre-allocated vfio<N> root port for each PCI
    function,
  * opens the /dev/vfio/<group> fd and pushes a PcieDeviceConfig with
    the segment-qualified BDF (e.g. "0001:00:00.0", as produced by
    BdfAddress::Display),
  * computes the guest PciPath [root_slot, 0] -- matching openvmm's
    root-bus slot layout (STATIC + BLOCK_HOTPLUG + port_index) and the
    two-slot PciPath convention block hotplug already uses,
  * writes that PciPath back onto `config.guest_pci_path` for the
    IOMMU-group primary (handler_devices exposes one device per
    VfioDeviceModern, matching the primary).

The legacy DeviceType::Vfio arm is left in place as a no-op safety
net. It is unreachable today but the cost of keeping it is small,
and removing it would change the public Vfio-handling surface of the
openvmm shim for no functional gain.

End-to-end effect: NVIDIA GPU + NVSwitch passthrough pods now reach
the container create step with the agent receiving correct vfio-pci
device_options ("HOST_BDF=GUEST_PCI_PATH"), matching what the CH
back-end has been doing all along.

Build fixes folded in:
  * Re-export DeviceAddress from
    crate::device::driver::vfio_device (the `core` submodule is
    private; only re-exports from mod.rs are public).
  * Lock the Arc<Mutex<VfioDeviceModern>> directly instead of going
    through a non-existent `.inner` field -- DeviceType::VfioModern
    carries the Arc<Mutex<...>> directly (see device/mod.rs:63), not
    the VfioDeviceModernHandle newtype, matching what
    ch/inner_device.rs already does in its own cold-plug branch.

Signed-off-by: Jocelyn Berrendonner <jocelynb@microsoft.com>
…d ACS fixes

A bundle of related fixes to the OpenVMM PCIe root-complex topology
that runtime-rs hands to OpenVMM. Pre-fix, an HGX A100 8-GPU + 6-NVSwitch
baseboard (14 VFIO devices) either could not allocate enough MMIO,
overflowed the 5-bit PCI slot field, programmed the wrong PciPath into
the guest, or silently disabled GPU peer-to-peer DMA. This commit
addresses all of those.

1. Enlarge PCIe low_mmio window to 640 MB
   ----------------------------------------
   The openvmm PCIe root complex was configured with a 320 MB
   non-prefetchable MMIO window (0xc000_0000..0xd400_0000). Cold-plugging
   8 GPUs + 6 NVSwitches on the test bench overruns it and the worker
   aborts before VM boot:

       PCI resource assignment failed
       low_mmio MMIO exhaustion: need 0x14500000 bytes, have 0x14000000

   The need comes mostly from bridge MMIO32 windows that the root ports
   pre-reserve (8x 16 MB GPU BAR0 + 6x 32 MB NVSwitch BAR0 = 320 MB on
   their own, plus bridge alignment padding for every static / block-
   hotplug / vfio-coldplug port).

   64-bit prefetchable BARs (H100 BAR1/2 = 128 GB, BAR3/4 = 32 MB) go
   into high_mmio and do not count.

   Widen the window to 640 MB (0xc000_0000..0xe800_0000) for ~2x
   headroom; still fits before ECAM. high_mmio is untouched -- its
   ~125 TB range continues to absorb the 1 TB+ of prefetchable BARs
   an 8-GPU pod brings.

2. Fit PCIe root ports within 5-bit slot field
   --------------------------------------------
   PCI slot numbers are 5 bits (0..0x1f = 0..31). The OpenVMM shim
   assigns one slot per root port on the PCIe root bus, so the total
   of static + block-hotplug + VFIO cold-plug ports must be <= 32.
   The previous layout (5 + 24 + 16 = 45) overflowed and triggered

       'Failed to parse PCI path from QOM path '24/00'
        Caused by: PCI slot 36 should be in range [0..0x1f]'

   in kata-agent the moment the shim tried to cold-plug the 4th VFIO
   device on an HGX A100 8-GPU baseboard.

   Rebalance to (5 + 4 + 23 = 32) and document the budget on the
   constants so the next person adding a new port class cannot
   accidentally overflow PCI again.

3. Pack PCIe root ports across PCI functions
   ------------------------------------------
   OpenVMM packs PCIe root ports into multi-function device slots:
   port `i` sits at `device = i/8, function = i%8` on bus 0 of the
   root complex (see microsoft/openvmm
   `vm/devices/pci/pcie/src/root.rs::GenericPcieRootComplex::new`,
   where `first_port_device_number = 0` on the x86_64 no-IOMMU path
   the kata runtime uses).

   The previous kata-shim layout assigned single-function slots:
   `root_slot = STATIC + BLOCK_HOTPLUG + port_index`. That:

     a. Overflowed PCI's 5-bit slot field once a few VFIO ports were
        allocated on top of the 24-slot block-hotplug pool.
     b. Disagreed with what OpenVMM actually programs into the guest's
        PCIe config space, so slots below 32 only worked by coincidence.

   Extend PciSlot with a function number (defaulting to 0; wire format
   stays `xx` for function 0 and `xx.f` for multi-function; kata-agent's
   `SlotFn::from_str` already accepts both). Add an
   `openvmm_port_pci_path` helper that mirrors OpenVMM's packing, and
   use it from both the block-hotplug allocator and the VFIO cold-plug
   code path. This restores the original 24 block / 32 VFIO budgets and
   unlocks OpenVMM's full 256-port-per-complex capacity.

4. PciSlot callsite fixups after tuple -> named-field refactor
   ------------------------------------------------------------
   PciSlot was refactored from a tuple struct to a named-field struct
   in change kata-containers#3. Update the two remaining callsites that still used
   tuple syntax:
     * topology.rs (static PCIe-topology allocator) -- use
       `PciSlot::new` instead of `PciSlot(v as u8)`.
     * swap.rs (AddSwap path) -- send the device number alone via
       the `device()` accessor; the swap RPC's wire format predates
       multi-function and kata-agent's do_add_swap hardcodes
       function=0, so the agent expectation is unchanged.

5. Advertise ACS capabilities on every PCIe root port
   ---------------------------------------------------
   Set `acs_capabilities_supported: Some(0x5f)` on every
   `PcieRootPortConfig` we hand to OpenVMM (5 static + N block-hotplug
   + M VFIO cold-plug). 0x5f = SV | TB | RR | CR | UF | DT, i.e. the
   standard PCIe ACS bitmask for downstream ports (everything except
   EC).

   Previously kata left this as `None`, which OpenVMM treats as
   "do not synthesize the ACS Extended Capability at all" (see
   microsoft/openvmm
   `vm/devices/pci/pcie/src/port.rs::PcieDownstreamPort::new` where
   `acs_supported` is only added to the cap list when non-zero).

   Without ACS bits visible on the upstream root port, the in-guest
   NVIDIA driver's P2P-enablement code concludes that peer-to-peer DMA
   between assigned PCI devices is not safely supported and silently
   declines to enable it. On an HGX A100 8-GPU + 6-NVSwitch baseboard
   the symptom is:

     * fabricmanager never programs the NVSwitch fabric
     * `nvidia-smi nvlink --status` reports every link inactive
     * `nvidia-smi topo -m` shows PHB (PCIe host-bridge) everywhere
       instead of NV12 between any GPU pair
     * NCCL falls back to slow PCIe-over-CPU paths

   OpenVMM's own CLI parser defaults to 0x5f when `--pcie-root-port`
   is passed without an explicit `acs=` value, so this just brings the
   kata-driven path in line with the documented default.

   Reported by John Starks.

Signed-off-by: Jocelyn Berrendonner <jocelynb@microsoft.com>
The OpenVMM backend embeds the VmWorker in-process, so unlike the
qemu/CH backends there is no child process for the shim to reap.
`OpenVmm::wait_vm()` blocks on an `exit_notify` mpsc channel that is
only ever signalled from `VmmInstance::stop()` -- i.e. only when
something externally calls `stop_vm()`.

The mesh `Receiver<HaltReason>` that the VmWorker uses to publish
halt events was stored as `_notify_recv` and never read. As a result,
when the guest powered itself off (agent-initiated shutdown, kernel
panic, triple fault, OOM-killed init, ...) the VmWorker emitted
`HaltReason::PowerOff` into a dead-letter channel, `wait_vm()` kept
waiting forever and the containerd-shim hung on pod teardown.

Replace the unread `_notify_recv` field with a `halt_watcher` tokio
task spawned during `launch()`. The watcher consumes halt notifications
and fires `exit_notify` on any terminal reason. `HaltReason::Reset` is
ignored because the worker is configured with
`automatic_guest_reset = true` and handles resets internally; the
explicit match is defensive in case that ever changes.

`stop()` aborts the watcher before its own `try_send(0)` so the two
paths do not race. Double-signaling would be harmless anyway --
`exit_notify` is a capacity-1 channel and `OpenVmm::wait_vm()` caches
the first received exit code -- but aborting keeps the logs clean.

OpenVMM itself has no `--exit-on-halt` flag; its REPL only logs
"guest halted" and stays alive waiting for an interactive `q`. This
fix is the equivalent for the embedded VmWorker.
Two changes folded into one: flip the OpenVMM default and teach the
cgroup setup code to recognise in-process VMMs so the new default
does not regress sandbox creation.

1. Default sandbox_cgroup_only=false in the OpenVMM TOML
   ------------------------------------------------------
   When sandbox_cgroup_only=true the runtime-rs shim PID lives inside
   the pod cgroup, which is bounded by container.limits.memory. For
   OpenVMM this is fatal: OpenVMM runs in-process inside the shim and
   allocates guest RAM as shmem mappings owned by that PID. With any
   non-trivial guest RAM size the kernel OOM-killer fires during VM
   construction:

     misc mshv: pNN: Failed to populate memory region: -12
     Memory cgroup out of memory: Killed process NNN (containerd-shim)
                                   shmem-rss:8472576kB

   The runtime then exits and the pod sandbox creation fails with
   "ttrpc: closed" / FailedCreatePodSandBox.

   The cgroup separation that would solve this is already implemented:
   crates/resource/src/cgroups/resource_inner.rs::new_cgroup_managers
   moves the runtime PID to an unconstrained kata-overhead cgroup
   (systemd: kata-overhead.slice:runtime-rs:<sid>; cgroupfs:
   kata_overhead/<path>) when sandbox_cgroup_only=false. That is also
   the default of the corresponding Rust field
   (libs/kata-types/src/config/runtime.rs::sandbox_cgroup_only), so
   we are simply matching the language default in the generated TOML.

   Container processes still get sized correctly because they continue
   to be placed in the pod cgroup as usual; only the runtime+VMM
   thread group moves to the overhead cgroup.

   Verified on an HGX A100 8-GPU / 6-NVSwitch L1 VH:
    - Before: limits.memory=8Gi -> openvmm mem_size=10GiB -> cgroup
      OOM in ~20s; every sandbox-creation retry repeats the kill
      cycle.
    - After:  limits.memory=24Gi -> shim cgroup is /kata_overhead/<sid>;
      shim RSS climbs to ~27GB (24GB guest RAM + VMM overhead) with no
      OOM; VM construction, VFIO IOMMU map, and per-GPU MSI-X setup
      all complete; pod reaches Running.

   Dragonball is intentionally not touched: it is force-pinned to
   sandbox_cgroup_only=true in crates/resource/src/cgroups/mod.rs
   because the Dragonball VMM shares its address space with the
   runtime process and uses a different memory accounting model.

   CLH is also out of scope for this commit. CLH execs the VMM as a
   child process, so it suffers the same cgroup inheritance issue in
   principle but is far less acute in practice (most CLH deployments
   use small guest RAM). A separate change can flip the CLH default
   after broader discussion.

2. Allow sandbox_cgroup_only=false for in-process VMM
   ---------------------------------------------------
   After the default flip above, pod sandbox creation fails with:

     failed to handle message start sandbox in task handler
     Caused by:
       0: setup device after start vm
       1: setup cgroups after start vm
       2: hypervisor cannot be moved to sandbox cgroup

   The error originates in
   `CgroupsResourceInner::setup_after_start_vm` which expects every
   VMM with an overhead cgroup to report at least one vCPU thread ID.
   The pre-existing comment correctly explains the intent: when an
   overhead cgroup exists, vCPU threads must be moved to the sandbox
   cgroup so container resource limits apply to them; an empty
   `VcpuThreadIds` after VM start would otherwise let vCPUs run
   unbounded in the overhead cgroup.

   That intent is sound for out-of-process VMMs (clh, qemu, fc),
   whose vCPU threads belong to a distinct child process that can be
   moved without affecting the runtime. It does not work for
   in-process VMMs (OpenVMM, and Dragonball when opted into
   sandbox_cgroup_only=false): the runtime IS the VMM, the vCPU
   threads are runtime threads, and moving them would drag the
   runtime (including the in-process VMM holding gigabytes of guest
   RAM as shmem) into the sandbox cgroup. The OpenVMM driver
   therefore returns an empty `VcpuThreadIds` from `get_thread_ids`
   by design (see openvmm/inner_hypervisor.rs:792), the same
   convention Dragonball already follows.

   Distinguish the two cases by asking the hypervisor for its PID
   list: in-process VMMs report only `process::id()` (the runtime's
   own PID), while out-of-process VMMs report a distinct child. Only
   out-of-process VMMs warrant the fatal error; in-process VMMs
   intentionally leave the runtime (and its vCPU threads) in the
   overhead cgroup, with the sandbox cgroup materialising later when
   container processes are added.

   Trade-off documented for in-process VMMs: container `limits.cpu`
   no longer constrains guest vCPU host CPU usage when
   sandbox_cgroup_only=false, because the vCPU threads share a cgroup
   with the unconstrained runtime. This is the price of allowing
   large guest RAM under container memory limits, and matches the
   behaviour the user opts into by setting sandbox_cgroup_only=false.
   Operators that need both vCPU limits AND large guest RAM should
   run an out-of-process VMM (clh).

   Verified on the same HGX A100 8-GPU / 6-NVSwitch bench as change
   kata-containers#1: with both changes in place, sandbox creation completes (shim
   in /kata_overhead/<sid>, ~27GB RSS for 24GB guest RAM, no
   FailedCreatePodSandBox events).

Signed-off-by: Jocelyn Berrendonner <jocelynb@microsoft.com>
Comment thread src/runtime-rs/crates/hypervisor/src/openvmm/inner_hypervisor.rs
Comment thread src/runtime-rs/crates/hypervisor/src/openvmm/inner_hypervisor.rs
@sprt sprt self-requested a review June 17, 2026 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants