fix: make /metrics scrape-safe and cut page-side query fanout#96
Merged
archlitchi merged 1 commit intoJun 1, 2026
Merged
Conversation
|
Welcome @luohua13! It looks like this is your first PR to Project-HAMi/HAMi-WebUI 🎉 |
When HAMi-WebUI is wired to a large Prometheus / VictoriaMetrics backend, two
classes of problems showed up that this change addresses end to end.
1. /metrics caused vmselect OOM and request timeouts
------------------------------------------------------
The /metrics handler called GenerateMetrics synchronously on every scrape, and
GenerateMetrics fanned out one PromQL per device/container metric. At ~150 GPUs
that is ~1800 instant queries per scrape, all sharing the incoming request's
context. With the chart default of server.http.timeout=1s, that context was
cancelled after one second, so most queries were aborted mid-flight; the burst
also pushed vmselect into OOM. Prometheus pages and the platform's own monitor
dashboards do not behave this way because they issue a few aggregated queries,
not a per-entity fanout.
* server/internal/exporter: MetricsGenerator now implements the kratos
transport.Server interface. A single background goroutine refreshes the
in-memory Prometheus registry on a configurable interval (default 30s)
using its own bounded context (default 60s), and refuses to overlap cycles
so a slow refresh can never stack parallel fanouts onto the upstream.
* server/internal/server/http.go: /metrics now only serializes the registry
(promhttp.Handler). Scrape latency is O(1) and no longer tied to
server.http.timeout or to upstream query latency.
* server/cmd/server/main.go: the collector is registered as a kratos server
so it starts and drains with the app lifecycle.
2. Metric series flickered in and out (charts intermittently empty)
-------------------------------------------------------------------
GenerateMetrics began each cycle with reset(), which wipes every GaugeVec.
That was safe only while generation ran inside the scrape handler (scrape saw
the fully repopulated state). Once generation moved to the background, every
scrape landing between reset() and the end of repopulation observed partial or
empty data, so series flickered at scrape boundaries and UI charts went to 0.
* server/internal/exporter/cells.go (new): each Set now goes through a helper
that records the (gauge, label tuple) it wrote. After a cycle completes,
tuples present in the previous cycle but not the current one are removed via
DeleteLabelValues; a cycle cut short by ctx cancellation is dropped so the
last known-good snapshot is preserved. Existing series are overwritten in
place, so there is no window where a known device is missing.
* server/internal/exporter/metrics.go: reset() removed.
3. GPU / node list endpoints issued an N+1 query fanout
-------------------------------------------------------
GetAllGPUs and GetAllNodes queried Prometheus once or twice per entity inside
their loops and called StatisticsByDeviceId (which re-lists every container on
each call) once per device. At scale the GPU list issued 300+ serial instant
queries; this is why the compute-management page spun for ~10s.
* server/internal/service/card.go, node.go: fetch the container list once and
reuse biz.ContainersStatisticsInfo; batch the per-entity hami_core_size /
hami_memory_size lookups into one query each, grouped by device_uuid (cards)
or node (nodes), then map by label in memory. Per-endpoint query count drops
from O(N) to O(1) while preserving the original avg/sum-by-instance values.
4. Informer-backed caches were read without locking
---------------------------------------------------
The node and pod caches are filled by informer event handlers / a background
rebuild, but the list/get readers iterated the maps without holding the mutex.
Now that the list endpoints route their hot path through these reads, that race
could trigger Go's "concurrent map read and map write" fatal error.
* server/internal/data/node.go, pod.go: take r.mutex.RLock in every reader and
Lock around the node map swap. Also harden podRepo.FindOne, which previously
dereferenced a missing map entry and panicked for an unknown podUID.
5. Timeouts are no longer hardcoded
-----------------------------------
* charts/hami-webui: server.http/grpc.timeout default to 60s (the old 1s
killed legitimate page-side APIs that take a few seconds), and the new
metricsExporter.{interval,timeout} / externalPrometheus.timeout are exposed.
* server/internal/conf/conf.proto, config/config.yaml: add the exporter config
block with safe defaults (interval 30s, timeout 60s).
* packages/web: the axios global timeout is configurable via
VUE_APP_REQUEST_TIMEOUT (default 60000) instead of the hardcoded 5000, which
was the direct cause of UI timeouts even after the backend was fixed.
6. Frontend got "connection refused" before the backend was listening
----------------------------------------------------------------------
The backend only starts listening after its informer caches finish syncing,
which can take seconds on a large cluster. The frontend BFF comes up first, so
early UI requests proxied to the not-yet-listening backend and surfaced a wave
of connection-refused popups until the backend caught up.
* server/internal/server/http.go: add a /readyz endpoint that returns 200 once
the server is accepting requests (reaching the handler implies informer
caches have synced).
* charts/hami-webui: add a configurable readinessProbe on the backend
container hitting /readyz, so the pod stays out of the Service until the
backend can serve. Deliberately gated on "listening", not on the metrics
collector's first cycle, to avoid coupling UI availability to Prometheus.
Signed-off-by: luohua13 <jcwang@alauda.io>
e5757ec to
9ab66ed
Compare
FouoF
approved these changes
May 26, 2026
Contributor
Author
|
@archlitchi @FouoF When will this PR be merged? |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: archlitchi, FouoF, luohua13 The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
When HAMi-WebUI is wired to a large Prometheus / VictoriaMetrics backend, two classes of problems showed up that this change addresses end to end.