Skip to content

Node-labeller amd.com/gpu.vram undercounts addressable memory on dpx_nps1 (divides by compute-partition count, ignores memory-partition mode) #555

@jitesh-gupta

Description

@jitesh-gupta

Summary

The AMD GPU Operator's node-labeller emits amd.com/gpu.vram=144G on an MI355X node configured for DPX compute partitioning + NPS1 memory partitioning. The correct value is 288G — in NPS1 memory is a single NUMA domain shared across all compute partitions, so each DPX partition is addressable against the full per-physical-GPU VRAM.

The label appears to be computed as physical_vram / compute_partition_count without consulting memory_partition_mode, which is incorrect for any NPS1 configuration.

Environment

GPU AMD Instinct MI355X OAM (PCI device-id 75a3)
Host OS / driver Linux, AMDGPU driver 6.14.14
Kubernetes v1.29
AMD GPU Operator manually installed (~90 days old)
node-labeller image bundled with the above operator release

Steps to reproduce

  1. Two MI355X nodes on the same cluster, identical hardware (same device-id, same chassis SKU), differing only in partition mode:
    • Node A: dpx_nps1 (DPX compute, NPS1 memory)
    • Node B: spx_nps1 (SPX compute, NPS1 memory)
  2. Dump the amd.com/gpu.* labels on each:
    kubectl get node <node> -o jsonpath='{.metadata.labels}' \
      | jq '. | to_entries | map(select(.key | startswith("amd.com/gpu"))) | from_entries'

Expected behavior

The amd.com/gpu.vram label should reflect per-partition addressable VRAM, which depends on the memory partition mode:

memory partition per-partition addressable VRAM
nps1 full physical VRAM (memory is not partitioned)
nps2 physical VRAM ÷ 2
nps4 physical VRAM ÷ 4

For dpx_nps1 on MI355X: vram=288G (same as SPX, since NPS1 memory is shared across DPX compute partitions).
For dpx_nps2: vram=144G.
For dpx_nps4: vram=72G.

Actual behavior

On the DPX+NPS1 node, vram=144G — i.e., physical VRAM divided by the compute partition count, ignoring the memory partition mode entirely.

Side-by-side diff (same hardware, different partition mode):

Label SPX/NPS1 node DPX/NPS1 node Halving correct?
amd.com/gpu.compute-memory-partition spx_nps1 dpx_nps1
amd.com/gpu.cu-count 256 128 Yes — compute IS partitioned by DPX
amd.com/gpu.simd-count 1024 512 Yes — same reason
amd.com/gpu.vram 288G 144G No — NPS1 means memory is NOT partitioned
amd.com/gpu.device-id 75a3 75a3 Same hardware
amd.com/gpu.driver-version 6.14.14 6.14.14 Same driver

Raw outputs:

SPX/NPS1 node:

{
  "amd.com/gpu.compute-memory-partition": "spx_nps1",
  "amd.com/gpu.compute-partitioning-supported": "true",
  "amd.com/gpu.cu-count": "256",
  "amd.com/gpu.device-id": "75a3",
  "amd.com/gpu.driver-version": "6.14.14",
  "amd.com/gpu.family": "AI",
  "amd.com/gpu.memory-partitioning-supported": "true",
  "amd.com/gpu.product-name": "AMD_Instinct_MI355_OAM",
  "amd.com/gpu.simd-count": "1024",
  "amd.com/gpu.vram": "288G"
}

DPX/NPS1 node:

{
  "amd.com/gpu.compute-memory-partition": "dpx_nps1",
  "amd.com/gpu.compute-partitioning-supported": "true",
  "amd.com/gpu.cu-count": "128",
  "amd.com/gpu.device-id": "75a3",
  "amd.com/gpu.driver-version": "6.14.14",
  "amd.com/gpu.family": "AI",
  "amd.com/gpu.memory-partitioning-supported": "true",
  "amd.com/gpu.product-name": "AMD_Instinct_MI355_OAM",
  "amd.com/gpu.simd-count": "512",
  "amd.com/gpu.vram": "144G"
}

Impact

Anything that consumes amd.com/gpu.vram for capacity planning, scheduler hinting, or workload-fit decisions on DPX+NPS1 nodes will see half of actual addressable memory. CU/SIMD labels are correct, so compute-fit decisions are unaffected.

Suggested fix

amd.com/gpu.vram should be derived from the memory partition mode, not the compute partition mode. Pseudocode:

physical_vram_per_gpu  = read from amdgpu sysfs
memory_partitions      = {nps1: 1, nps2: 2, nps4: 4}[memory_partition_mode]
vram_label             = physical_vram_per_gpu / memory_partitions

Compute-partition–related labels (cu-count, simd-count) remain divided by the compute partition count, as they are today.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions