Summary
The AMD GPU Operator's node-labeller emits amd.com/gpu.vram=144G on an MI355X node configured for DPX compute partitioning + NPS1 memory partitioning. The correct value is 288G — in NPS1 memory is a single NUMA domain shared across all compute partitions, so each DPX partition is addressable against the full per-physical-GPU VRAM.
The label appears to be computed as physical_vram / compute_partition_count without consulting memory_partition_mode, which is incorrect for any NPS1 configuration.
Environment
|
|
| GPU |
AMD Instinct MI355X OAM (PCI device-id 75a3) |
| Host OS / driver |
Linux, AMDGPU driver 6.14.14 |
| Kubernetes |
v1.29 |
| AMD GPU Operator |
manually installed (~90 days old) |
| node-labeller image |
bundled with the above operator release |
Steps to reproduce
- Two MI355X nodes on the same cluster, identical hardware (same
device-id, same chassis SKU), differing only in partition mode:
- Node A:
dpx_nps1 (DPX compute, NPS1 memory)
- Node B:
spx_nps1 (SPX compute, NPS1 memory)
- Dump the
amd.com/gpu.* labels on each:
kubectl get node <node> -o jsonpath='{.metadata.labels}' \
| jq '. | to_entries | map(select(.key | startswith("amd.com/gpu"))) | from_entries'
Expected behavior
The amd.com/gpu.vram label should reflect per-partition addressable VRAM, which depends on the memory partition mode:
| memory partition |
per-partition addressable VRAM |
nps1 |
full physical VRAM (memory is not partitioned) |
nps2 |
physical VRAM ÷ 2 |
nps4 |
physical VRAM ÷ 4 |
For dpx_nps1 on MI355X: vram=288G (same as SPX, since NPS1 memory is shared across DPX compute partitions).
For dpx_nps2: vram=144G.
For dpx_nps4: vram=72G.
Actual behavior
On the DPX+NPS1 node, vram=144G — i.e., physical VRAM divided by the compute partition count, ignoring the memory partition mode entirely.
Side-by-side diff (same hardware, different partition mode):
| Label |
SPX/NPS1 node |
DPX/NPS1 node |
Halving correct? |
amd.com/gpu.compute-memory-partition |
spx_nps1 |
dpx_nps1 |
— |
amd.com/gpu.cu-count |
256 |
128 |
Yes — compute IS partitioned by DPX |
amd.com/gpu.simd-count |
1024 |
512 |
Yes — same reason |
amd.com/gpu.vram |
288G |
144G |
No — NPS1 means memory is NOT partitioned |
amd.com/gpu.device-id |
75a3 |
75a3 |
Same hardware |
amd.com/gpu.driver-version |
6.14.14 |
6.14.14 |
Same driver |
Raw outputs:
SPX/NPS1 node:
{
"amd.com/gpu.compute-memory-partition": "spx_nps1",
"amd.com/gpu.compute-partitioning-supported": "true",
"amd.com/gpu.cu-count": "256",
"amd.com/gpu.device-id": "75a3",
"amd.com/gpu.driver-version": "6.14.14",
"amd.com/gpu.family": "AI",
"amd.com/gpu.memory-partitioning-supported": "true",
"amd.com/gpu.product-name": "AMD_Instinct_MI355_OAM",
"amd.com/gpu.simd-count": "1024",
"amd.com/gpu.vram": "288G"
}
DPX/NPS1 node:
{
"amd.com/gpu.compute-memory-partition": "dpx_nps1",
"amd.com/gpu.compute-partitioning-supported": "true",
"amd.com/gpu.cu-count": "128",
"amd.com/gpu.device-id": "75a3",
"amd.com/gpu.driver-version": "6.14.14",
"amd.com/gpu.family": "AI",
"amd.com/gpu.memory-partitioning-supported": "true",
"amd.com/gpu.product-name": "AMD_Instinct_MI355_OAM",
"amd.com/gpu.simd-count": "512",
"amd.com/gpu.vram": "144G"
}
Impact
Anything that consumes amd.com/gpu.vram for capacity planning, scheduler hinting, or workload-fit decisions on DPX+NPS1 nodes will see half of actual addressable memory. CU/SIMD labels are correct, so compute-fit decisions are unaffected.
Suggested fix
amd.com/gpu.vram should be derived from the memory partition mode, not the compute partition mode. Pseudocode:
physical_vram_per_gpu = read from amdgpu sysfs
memory_partitions = {nps1: 1, nps2: 2, nps4: 4}[memory_partition_mode]
vram_label = physical_vram_per_gpu / memory_partitions
Compute-partition–related labels (cu-count, simd-count) remain divided by the compute partition count, as they are today.
Summary
The AMD GPU Operator's node-labeller emits
amd.com/gpu.vram=144Gon an MI355X node configured forDPXcompute partitioning +NPS1memory partitioning. The correct value is288G— inNPS1memory is a single NUMA domain shared across all compute partitions, so each DPX partition is addressable against the full per-physical-GPU VRAM.The label appears to be computed as
physical_vram / compute_partition_countwithout consultingmemory_partition_mode, which is incorrect for anyNPS1configuration.Environment
75a3)6.14.14Steps to reproduce
device-id, same chassis SKU), differing only in partition mode:dpx_nps1(DPX compute, NPS1 memory)spx_nps1(SPX compute, NPS1 memory)amd.com/gpu.*labels on each:Expected behavior
The
amd.com/gpu.vramlabel should reflect per-partition addressable VRAM, which depends on the memory partition mode:nps1nps2nps4For
dpx_nps1on MI355X:vram=288G(same as SPX, since NPS1 memory is shared across DPX compute partitions).For
dpx_nps2:vram=144G.For
dpx_nps4:vram=72G.Actual behavior
On the DPX+NPS1 node,
vram=144G— i.e., physical VRAM divided by the compute partition count, ignoring the memory partition mode entirely.Side-by-side diff (same hardware, different partition mode):
amd.com/gpu.compute-memory-partitionspx_nps1dpx_nps1amd.com/gpu.cu-count256128amd.com/gpu.simd-count1024512amd.com/gpu.vram288G144Gamd.com/gpu.device-id75a375a3amd.com/gpu.driver-version6.14.146.14.14Raw outputs:
SPX/NPS1 node:
{ "amd.com/gpu.compute-memory-partition": "spx_nps1", "amd.com/gpu.compute-partitioning-supported": "true", "amd.com/gpu.cu-count": "256", "amd.com/gpu.device-id": "75a3", "amd.com/gpu.driver-version": "6.14.14", "amd.com/gpu.family": "AI", "amd.com/gpu.memory-partitioning-supported": "true", "amd.com/gpu.product-name": "AMD_Instinct_MI355_OAM", "amd.com/gpu.simd-count": "1024", "amd.com/gpu.vram": "288G" }DPX/NPS1 node:
{ "amd.com/gpu.compute-memory-partition": "dpx_nps1", "amd.com/gpu.compute-partitioning-supported": "true", "amd.com/gpu.cu-count": "128", "amd.com/gpu.device-id": "75a3", "amd.com/gpu.driver-version": "6.14.14", "amd.com/gpu.family": "AI", "amd.com/gpu.memory-partitioning-supported": "true", "amd.com/gpu.product-name": "AMD_Instinct_MI355_OAM", "amd.com/gpu.simd-count": "512", "amd.com/gpu.vram": "144G" }Impact
Anything that consumes
amd.com/gpu.vramfor capacity planning, scheduler hinting, or workload-fit decisions on DPX+NPS1 nodes will see half of actual addressable memory. CU/SIMD labels are correct, so compute-fit decisions are unaffected.Suggested fix
amd.com/gpu.vramshould be derived from the memory partition mode, not the compute partition mode. Pseudocode:Compute-partition–related labels (
cu-count,simd-count) remain divided by the compute partition count, as they are today.