fix: make /metrics scrape-safe and cut page-side query fanout by luohua13 · Pull Request #96 · Project-HAMi/HAMi-WebUI

luohua13 · 2026-05-21T13:41:23Z

When HAMi-WebUI is wired to a large Prometheus / VictoriaMetrics backend, two classes of problems showed up that this change addresses end to end.

/metrics caused vmselect OOM and request timeouts ------------------------------------------------------ The /metrics handler called GenerateMetrics synchronously on every scrape, and GenerateMetrics fanned out one PromQL per device/container metric. At ~150 GPUs that is ~1800 instant queries per scrape, all sharing the incoming request's context. With the chart default of server.http.timeout=1s, that context was cancelled after one second, so most queries were aborted mid-flight; the burst also pushed vmselect into OOM. Prometheus pages and the platform's own monitor dashboards do not behave this way because they issue a few aggregated queries, not a per-entity fanout.

server/internal/exporter: MetricsGenerator now implements the kratos transport.Server interface. A single background goroutine refreshes the in-memory Prometheus registry on a configurable interval (default 30s) using its own bounded context (default 60s), and refuses to overlap cycles so a slow refresh can never stack parallel fanouts onto the upstream.
server/internal/server/http.go: /metrics now only serializes the registry (promhttp.Handler). Scrape latency is O(1) and no longer tied to server.http.timeout or to upstream query latency.
server/cmd/server/main.go: the collector is registered as a kratos server so it starts and drains with the app lifecycle.

Metric series flickered in and out (charts intermittently empty) ------------------------------------------------------------------- GenerateMetrics began each cycle with reset(), which wipes every GaugeVec. That was safe only while generation ran inside the scrape handler (scrape saw the fully repopulated state). Once generation moved to the background, every scrape landing between reset() and the end of repopulation observed partial or empty data, so series flickered at scrape boundaries and UI charts went to 0.

server/internal/exporter/cells.go (new): each Set now goes through a helper that records the (gauge, label tuple) it wrote. After a cycle completes, tuples present in the previous cycle but not the current one are removed via DeleteLabelValues; a cycle cut short by ctx cancellation is dropped so the last known-good snapshot is preserved. Existing series are overwritten in place, so there is no window where a known device is missing.
server/internal/exporter/metrics.go: reset() removed.

GPU / node list endpoints issued an N+1 query fanout ------------------------------------------------------- GetAllGPUs and GetAllNodes queried Prometheus once or twice per entity inside their loops and called StatisticsByDeviceId (which re-lists every container on each call) once per device. At scale the GPU list issued 300+ serial instant queries; this is why the compute-management page spun for ~10s.

server/internal/service/card.go, node.go: fetch the container list once and reuse biz.ContainersStatisticsInfo; batch the per-entity hami_core_size / hami_memory_size lookups into one query each, grouped by device_uuid (cards) or node (nodes), then map by label in memory. Per-endpoint query count drops from O(N) to O(1) while preserving the original avg/sum-by-instance values.

Informer-backed caches were read without locking --------------------------------------------------- The node and pod caches are filled by informer event handlers / a background rebuild, but the list/get readers iterated the maps without holding the mutex. Now that the list endpoints route their hot path through these reads, that race could trigger Go's "concurrent map read and map write" fatal error.

server/internal/data/node.go, pod.go: take r.mutex.RLock in every reader and Lock around the node map swap. Also harden podRepo.FindOne, which previously dereferenced a missing map entry and panicked for an unknown podUID.

Timeouts are no longer hardcoded -----------------------------------

charts/hami-webui: server.http/grpc.timeout default to 60s (the old 1s killed legitimate page-side APIs that take a few seconds), and the new metricsExporter.{interval,timeout} / externalPrometheus.timeout are exposed.
server/internal/conf/conf.proto, config/config.yaml: add the exporter config block with safe defaults (interval 30s, timeout 60s).
packages/web: the axios global timeout is configurable via VUE_APP_REQUEST_TIMEOUT (default 60000) instead of the hardcoded 5000, which was the direct cause of UI timeouts even after the backend was fixed.

hami-robot · 2026-05-21T13:41:34Z

Welcome @luohua13! It looks like this is your first PR to Project-HAMi/HAMi-WebUI 🎉

When HAMi-WebUI is wired to a large Prometheus / VictoriaMetrics backend, two classes of problems showed up that this change addresses end to end. 1. /metrics caused vmselect OOM and request timeouts ------------------------------------------------------ The /metrics handler called GenerateMetrics synchronously on every scrape, and GenerateMetrics fanned out one PromQL per device/container metric. At ~150 GPUs that is ~1800 instant queries per scrape, all sharing the incoming request's context. With the chart default of server.http.timeout=1s, that context was cancelled after one second, so most queries were aborted mid-flight; the burst also pushed vmselect into OOM. Prometheus pages and the platform's own monitor dashboards do not behave this way because they issue a few aggregated queries, not a per-entity fanout. * server/internal/exporter: MetricsGenerator now implements the kratos transport.Server interface. A single background goroutine refreshes the in-memory Prometheus registry on a configurable interval (default 30s) using its own bounded context (default 60s), and refuses to overlap cycles so a slow refresh can never stack parallel fanouts onto the upstream. * server/internal/server/http.go: /metrics now only serializes the registry (promhttp.Handler). Scrape latency is O(1) and no longer tied to server.http.timeout or to upstream query latency. * server/cmd/server/main.go: the collector is registered as a kratos server so it starts and drains with the app lifecycle. 2. Metric series flickered in and out (charts intermittently empty) ------------------------------------------------------------------- GenerateMetrics began each cycle with reset(), which wipes every GaugeVec. That was safe only while generation ran inside the scrape handler (scrape saw the fully repopulated state). Once generation moved to the background, every scrape landing between reset() and the end of repopulation observed partial or empty data, so series flickered at scrape boundaries and UI charts went to 0. * server/internal/exporter/cells.go (new): each Set now goes through a helper that records the (gauge, label tuple) it wrote. After a cycle completes, tuples present in the previous cycle but not the current one are removed via DeleteLabelValues; a cycle cut short by ctx cancellation is dropped so the last known-good snapshot is preserved. Existing series are overwritten in place, so there is no window where a known device is missing. * server/internal/exporter/metrics.go: reset() removed. 3. GPU / node list endpoints issued an N+1 query fanout ------------------------------------------------------- GetAllGPUs and GetAllNodes queried Prometheus once or twice per entity inside their loops and called StatisticsByDeviceId (which re-lists every container on each call) once per device. At scale the GPU list issued 300+ serial instant queries; this is why the compute-management page spun for ~10s. * server/internal/service/card.go, node.go: fetch the container list once and reuse biz.ContainersStatisticsInfo; batch the per-entity hami_core_size / hami_memory_size lookups into one query each, grouped by device_uuid (cards) or node (nodes), then map by label in memory. Per-endpoint query count drops from O(N) to O(1) while preserving the original avg/sum-by-instance values. 4. Informer-backed caches were read without locking --------------------------------------------------- The node and pod caches are filled by informer event handlers / a background rebuild, but the list/get readers iterated the maps without holding the mutex. Now that the list endpoints route their hot path through these reads, that race could trigger Go's "concurrent map read and map write" fatal error. * server/internal/data/node.go, pod.go: take r.mutex.RLock in every reader and Lock around the node map swap. Also harden podRepo.FindOne, which previously dereferenced a missing map entry and panicked for an unknown podUID. 5. Timeouts are no longer hardcoded ----------------------------------- * charts/hami-webui: server.http/grpc.timeout default to 60s (the old 1s killed legitimate page-side APIs that take a few seconds), and the new metricsExporter.{interval,timeout} / externalPrometheus.timeout are exposed. * server/internal/conf/conf.proto, config/config.yaml: add the exporter config block with safe defaults (interval 30s, timeout 60s). * packages/web: the axios global timeout is configurable via VUE_APP_REQUEST_TIMEOUT (default 60000) instead of the hardcoded 5000, which was the direct cause of UI timeouts even after the backend was fixed. 6. Frontend got "connection refused" before the backend was listening ---------------------------------------------------------------------- The backend only starts listening after its informer caches finish syncing, which can take seconds on a large cluster. The frontend BFF comes up first, so early UI requests proxied to the not-yet-listening backend and surfaced a wave of connection-refused popups until the backend caught up. * server/internal/server/http.go: add a /readyz endpoint that returns 200 once the server is accepting requests (reaching the handler implies informer caches have synced). * charts/hami-webui: add a configurable readinessProbe on the backend container hitting /readyz, so the pod stays out of the Service until the backend can serve. Deliberately gated on "listening", not on the metrics collector's first cycle, to avoid coupling UI availability to Prometheus. Signed-off-by: luohua13 <jcwang@alauda.io>

luohua13 · 2026-05-28T06:52:47Z

@archlitchi @FouoF When will this PR be merged?

archlitchi

/lgtm

hami-robot · 2026-06-01T02:48:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: archlitchi, FouoF, luohua13

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hami-robot Bot added the dco-signoff: yes label May 21, 2026

hami-robot Bot added the size/XL label May 21, 2026

luohua13 force-pushed the fix/metrics-timeout-oom-and-list-perf branch from e5757ec to 9ab66ed Compare May 22, 2026 08:09

FouoF approved these changes May 26, 2026

View reviewed changes

archlitchi self-requested a review June 1, 2026 02:47

archlitchi approved these changes Jun 1, 2026

View reviewed changes

hami-robot Bot assigned archlitchi Jun 1, 2026

hami-robot Bot added the lgtm label Jun 1, 2026

archlitchi merged commit 08bd9b4 into Project-HAMi:main Jun 1, 2026
7 of 8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: make /metrics scrape-safe and cut page-side query fanout#96

fix: make /metrics scrape-safe and cut page-side query fanout#96
archlitchi merged 1 commit into
Project-HAMi:mainfrom
luohua13:fix/metrics-timeout-oom-and-list-perf

luohua13 commented May 21, 2026

Uh oh!

hami-robot Bot commented May 21, 2026

Uh oh!

luohua13 commented May 28, 2026 •

edited

Loading

Uh oh!

archlitchi left a comment

Uh oh!

hami-robot Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

luohua13 commented May 21, 2026

Uh oh!

hami-robot Bot commented May 21, 2026

Uh oh!

luohua13 commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

archlitchi left a comment

Choose a reason for hiding this comment

Uh oh!

hami-robot Bot commented Jun 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

luohua13 commented May 28, 2026 •

edited

Loading