Skip to content

fix: make /metrics scrape-safe and cut page-side query fanout#96

Merged
archlitchi merged 1 commit into
Project-HAMi:mainfrom
luohua13:fix/metrics-timeout-oom-and-list-perf
Jun 1, 2026
Merged

fix: make /metrics scrape-safe and cut page-side query fanout#96
archlitchi merged 1 commit into
Project-HAMi:mainfrom
luohua13:fix/metrics-timeout-oom-and-list-perf

Conversation

@luohua13

Copy link
Copy Markdown
Contributor

When HAMi-WebUI is wired to a large Prometheus / VictoriaMetrics backend, two classes of problems showed up that this change addresses end to end.

  1. /metrics caused vmselect OOM and request timeouts ------------------------------------------------------ The /metrics handler called GenerateMetrics synchronously on every scrape, and GenerateMetrics fanned out one PromQL per device/container metric. At ~150 GPUs that is ~1800 instant queries per scrape, all sharing the incoming request's context. With the chart default of server.http.timeout=1s, that context was cancelled after one second, so most queries were aborted mid-flight; the burst also pushed vmselect into OOM. Prometheus pages and the platform's own monitor dashboards do not behave this way because they issue a few aggregated queries, not a per-entity fanout.
  • server/internal/exporter: MetricsGenerator now implements the kratos transport.Server interface. A single background goroutine refreshes the in-memory Prometheus registry on a configurable interval (default 30s) using its own bounded context (default 60s), and refuses to overlap cycles so a slow refresh can never stack parallel fanouts onto the upstream.
  • server/internal/server/http.go: /metrics now only serializes the registry (promhttp.Handler). Scrape latency is O(1) and no longer tied to server.http.timeout or to upstream query latency.
  • server/cmd/server/main.go: the collector is registered as a kratos server so it starts and drains with the app lifecycle.
  1. Metric series flickered in and out (charts intermittently empty) ------------------------------------------------------------------- GenerateMetrics began each cycle with reset(), which wipes every GaugeVec. That was safe only while generation ran inside the scrape handler (scrape saw the fully repopulated state). Once generation moved to the background, every scrape landing between reset() and the end of repopulation observed partial or empty data, so series flickered at scrape boundaries and UI charts went to 0.
  • server/internal/exporter/cells.go (new): each Set now goes through a helper that records the (gauge, label tuple) it wrote. After a cycle completes, tuples present in the previous cycle but not the current one are removed via DeleteLabelValues; a cycle cut short by ctx cancellation is dropped so the last known-good snapshot is preserved. Existing series are overwritten in place, so there is no window where a known device is missing.
  • server/internal/exporter/metrics.go: reset() removed.
  1. GPU / node list endpoints issued an N+1 query fanout ------------------------------------------------------- GetAllGPUs and GetAllNodes queried Prometheus once or twice per entity inside their loops and called StatisticsByDeviceId (which re-lists every container on each call) once per device. At scale the GPU list issued 300+ serial instant queries; this is why the compute-management page spun for ~10s.
  • server/internal/service/card.go, node.go: fetch the container list once and reuse biz.ContainersStatisticsInfo; batch the per-entity hami_core_size / hami_memory_size lookups into one query each, grouped by device_uuid (cards) or node (nodes), then map by label in memory. Per-endpoint query count drops from O(N) to O(1) while preserving the original avg/sum-by-instance values.
  1. Informer-backed caches were read without locking --------------------------------------------------- The node and pod caches are filled by informer event handlers / a background rebuild, but the list/get readers iterated the maps without holding the mutex. Now that the list endpoints route their hot path through these reads, that race could trigger Go's "concurrent map read and map write" fatal error.
  • server/internal/data/node.go, pod.go: take r.mutex.RLock in every reader and Lock around the node map swap. Also harden podRepo.FindOne, which previously dereferenced a missing map entry and panicked for an unknown podUID.
  1. Timeouts are no longer hardcoded -----------------------------------
  • charts/hami-webui: server.http/grpc.timeout default to 60s (the old 1s killed legitimate page-side APIs that take a few seconds), and the new metricsExporter.{interval,timeout} / externalPrometheus.timeout are exposed.
  • server/internal/conf/conf.proto, config/config.yaml: add the exporter config block with safe defaults (interval 30s, timeout 60s).
  • packages/web: the axios global timeout is configurable via VUE_APP_REQUEST_TIMEOUT (default 60000) instead of the hardcoded 5000, which was the direct cause of UI timeouts even after the backend was fixed.

@hami-robot

hami-robot Bot commented May 21, 2026

Copy link
Copy Markdown

Welcome @luohua13! It looks like this is your first PR to Project-HAMi/HAMi-WebUI 🎉

@hami-robot hami-robot Bot added the size/XL label May 21, 2026
When HAMi-WebUI is wired to a large Prometheus / VictoriaMetrics backend, two
classes of problems showed up that this change addresses end to end.

1. /metrics caused vmselect OOM and request timeouts
------------------------------------------------------
The /metrics handler called GenerateMetrics synchronously on every scrape, and
GenerateMetrics fanned out one PromQL per device/container metric. At ~150 GPUs
that is ~1800 instant queries per scrape, all sharing the incoming request's
context. With the chart default of server.http.timeout=1s, that context was
cancelled after one second, so most queries were aborted mid-flight; the burst
also pushed vmselect into OOM. Prometheus pages and the platform's own monitor
dashboards do not behave this way because they issue a few aggregated queries,
not a per-entity fanout.

  * server/internal/exporter: MetricsGenerator now implements the kratos
    transport.Server interface. A single background goroutine refreshes the
    in-memory Prometheus registry on a configurable interval (default 30s)
    using its own bounded context (default 60s), and refuses to overlap cycles
    so a slow refresh can never stack parallel fanouts onto the upstream.
  * server/internal/server/http.go: /metrics now only serializes the registry
    (promhttp.Handler). Scrape latency is O(1) and no longer tied to
    server.http.timeout or to upstream query latency.
  * server/cmd/server/main.go: the collector is registered as a kratos server
    so it starts and drains with the app lifecycle.

2. Metric series flickered in and out (charts intermittently empty)
-------------------------------------------------------------------
GenerateMetrics began each cycle with reset(), which wipes every GaugeVec.
That was safe only while generation ran inside the scrape handler (scrape saw
the fully repopulated state). Once generation moved to the background, every
scrape landing between reset() and the end of repopulation observed partial or
empty data, so series flickered at scrape boundaries and UI charts went to 0.

  * server/internal/exporter/cells.go (new): each Set now goes through a helper
    that records the (gauge, label tuple) it wrote. After a cycle completes,
    tuples present in the previous cycle but not the current one are removed via
    DeleteLabelValues; a cycle cut short by ctx cancellation is dropped so the
    last known-good snapshot is preserved. Existing series are overwritten in
    place, so there is no window where a known device is missing.
  * server/internal/exporter/metrics.go: reset() removed.

3. GPU / node list endpoints issued an N+1 query fanout
-------------------------------------------------------
GetAllGPUs and GetAllNodes queried Prometheus once or twice per entity inside
their loops and called StatisticsByDeviceId (which re-lists every container on
each call) once per device. At scale the GPU list issued 300+ serial instant
queries; this is why the compute-management page spun for ~10s.

  * server/internal/service/card.go, node.go: fetch the container list once and
    reuse biz.ContainersStatisticsInfo; batch the per-entity hami_core_size /
    hami_memory_size lookups into one query each, grouped by device_uuid (cards)
    or node (nodes), then map by label in memory. Per-endpoint query count drops
    from O(N) to O(1) while preserving the original avg/sum-by-instance values.

4. Informer-backed caches were read without locking
---------------------------------------------------
The node and pod caches are filled by informer event handlers / a background
rebuild, but the list/get readers iterated the maps without holding the mutex.
Now that the list endpoints route their hot path through these reads, that race
could trigger Go's "concurrent map read and map write" fatal error.

  * server/internal/data/node.go, pod.go: take r.mutex.RLock in every reader and
    Lock around the node map swap. Also harden podRepo.FindOne, which previously
    dereferenced a missing map entry and panicked for an unknown podUID.

5. Timeouts are no longer hardcoded
-----------------------------------
  * charts/hami-webui: server.http/grpc.timeout default to 60s (the old 1s
    killed legitimate page-side APIs that take a few seconds), and the new
    metricsExporter.{interval,timeout} / externalPrometheus.timeout are exposed.
  * server/internal/conf/conf.proto, config/config.yaml: add the exporter config
    block with safe defaults (interval 30s, timeout 60s).
  * packages/web: the axios global timeout is configurable via
    VUE_APP_REQUEST_TIMEOUT (default 60000) instead of the hardcoded 5000, which
    was the direct cause of UI timeouts even after the backend was fixed.

6. Frontend got "connection refused" before the backend was listening
----------------------------------------------------------------------
The backend only starts listening after its informer caches finish syncing,
which can take seconds on a large cluster. The frontend BFF comes up first, so
early UI requests proxied to the not-yet-listening backend and surfaced a wave
of connection-refused popups until the backend caught up.

  * server/internal/server/http.go: add a /readyz endpoint that returns 200 once
    the server is accepting requests (reaching the handler implies informer
    caches have synced).
  * charts/hami-webui: add a configurable readinessProbe on the backend
    container hitting /readyz, so the pod stays out of the Service until the
    backend can serve. Deliberately gated on "listening", not on the metrics
    collector's first cycle, to avoid coupling UI availability to Prometheus.

Signed-off-by: luohua13 <jcwang@alauda.io>
@luohua13 luohua13 force-pushed the fix/metrics-timeout-oom-and-list-perf branch from e5757ec to 9ab66ed Compare May 22, 2026 08:09
@luohua13

luohua13 commented May 28, 2026

Copy link
Copy Markdown
Contributor Author

@archlitchi @FouoF When will this PR be merged?

@archlitchi archlitchi self-requested a review June 1, 2026 02:47

@archlitchi archlitchi left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@hami-robot

hami-robot Bot commented Jun 1, 2026

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: archlitchi, FouoF, luohua13

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@archlitchi archlitchi merged commit 08bd9b4 into Project-HAMi:main Jun 1, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants