Skip to content

[Bug] Ascend310P Pod 触发 fetchContainerInfo 数组越界 panic,已有 PR #35 未覆盖昇腾路径 #94

@cygnushan

Description

@cygnushan

问题描述

HAMi-WebUI backend(hami-webui-be-oss)在扫描到使用 Ascend310P 设备的 Pod 时发生 panic,进入 CrashLoopBackOff。

相似症状在 NVIDIA 场景下已通过 PR #35(见 #45 / #52)修复,但本次 root cause 位于 AscendGPUDevice / Ascend310PGPUDevice 分支缺少容器数量截断,与已修复的 NVIDIA 路径不是同一处代码。

环境信息

  • HAMi-WebUI 版本:v1.2.0(及更早版本)
  • 设备类型:Ascend310P
  • 集群中存在的 Pod:ascend-pytorch、ascend-soft-slice-pod、gpu-pod、gpu-pod-5(所在节点为 node-3 / node-5)

复现日志

INFO ts=2026-05-19T14:14:10+08:00 caller=data/pod.go:96 msg=Pod added: Name: gpu-pod-5, UID: 8c44a9bb-2e8d-48da-a566-5fbfbfd2f822, Namespace: default, NodeID: node-5
INFO ts=2026-05-19T14:14:10+08:00 caller=util/util.go:396 msg=Decoded pod annos: poddevices map[Ascend310P:[[{0 D7E96E64-2060C1F1-E8E618E4-AED8030A-94003019 Ascend310P 21527 100 }]]]
E0519 14:14:11.496133       1 runtime.go:79] Observed a panic: runtime.boundsError{x:1, y:1, signed:true, code:0x0} (runtime error: index out of range [1] with length 1)
panic({0x173f260?, 0x4000d727c8?})
    /usr/local/go/src/runtime/panic.go:783 +0x120
vgpu/internal/data.(*podRepo).fetchContainerInfo(0x4000020800, 0x4000ad4908)
    /src/internal/data/pod.go:148 +0x7a8
vgpu/internal/data.(*podRepo).addPod(0x4000020800, 0x4000ad4908, {0x40006e57ca, 0x6}, 0x4000d81bc0)
    /src/internal/data/pod.go:93 +0x74
vgpu/internal/data.(*podRepo).onAddPod(0x4000020800, {0x183fbc0?, 0x4000ad4908?})
    /src/internal/data/pod.go:70 +0x198
k8s.io/client-go/tools/cache.ResourceEventHandlerFuncs.OnAdd(...)
    /go/pkg/mod/k8s.io/client-go@v0.30.1/tools/cache/controller.go:239
k8s.io/client-go/tools/cache.(*processorListener).run.func1()
    /go/pkg/mod/k8s.io/client-go@v0.30.1/tools/cache/shared_informer.go:978 +0x94
k8s.io/apimachinery/pkg/util/wait.BackoffUntil(...)
    /go/pkg/mod/k8s.io/apimachinery@v0.30.1/pkg/util/wait/backoff.go:227 +0x90

根因分析

1. DecodePodDevices 缺少昇腾路径的容器截断

server/internal/provider/util/util.goDecodePodDevices 函数中,NVIDIA / Hygon / Metax 分支均已添加 if i >= len(pod.Spec.Containers) { break } 截断,但 AscendGPUDevice / Ascend310PGPUDevice 分支缺少该检查

// server/internal/provider/util/util.go:319-329
case AscendGPUDevice, Ascend310PGPUDevice:
    for _, s := range strings.Split(str, OnePodMultiContainerSplitSymbol) {
        cd, err := DecodeNpuContainerDevices(s)
        ...
        pd[devType] = append(pd[devType], cd)   // 无上限校验
    }

当 annotation 分割出的段数多于 pod.Spec.Containers 时,pd[devType]PodSingleDevice,即 []ContainerDevices)长度会大于实际容器数。

2. fetchContainerInfo 迭代时越界

server/internal/data/pod.go:145-148 根据 Pod spec 中的容器索引访问 bizContainerDevices,当 DecodePodDevices 返回的切片长度与容器数不一致时,触发 index out of range [1] with length 1 panic。

关联 Issue

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions