Skip to content

Rust load generator + soak mode; FE-worker gribjump/metadata; worker/broker/frontend updates#200

Merged
jameshawkes merged 28 commits into
upstreamfrom
feature/rust-loadgen
Jun 24, 2026
Merged

Rust load generator + soak mode; FE-worker gribjump/metadata; worker/broker/frontend updates#200
jameshawkes merged 28 commits into
upstreamfrom
feature/rust-loadgen

Conversation

@jameshawkes

Copy link
Copy Markdown
Contributor

Changes

Load generator

  • Add Rust load generator crate; document and package it.
  • Emit live progress snapshots.
  • Add peak windowed throughput metric.
  • Add duration-based soak mode.

FE worker

  • Bump gribjump/FDB stack to 0.12.0 / 5.21.3 for new-system extraction.
  • Lock down FE worker metadata pass-through from Rust to PyO3.
  • Fix metadata options merge when base config has no options.
  • Add tests for Python FE worker metadata options merging.

Workers / broker

  • Route worker callbacks to dispatching broker.
  • Rebalance broker polling between jobs.
  • Debounce broker-poll failure warnings (warn only after 30s).
  • Propagate MARS stream errors through worker delivery.
  • Emit delivered byte count on worker.job.completed.
  • Keep BOBS create connection pool warm; re-load-balance creates across BOBS.
  • Bump bits for worker error completion semantics.

Frontend

  • Add isolated internal poll listener.
  • Add Polytope-Mock-User header for synthetic user impersonation; require Polytope-Mock-Roles when set.
  • Copy loadgen workspace member into the Docker build.

Observability

  • Per-stage timing for workers and frontend submit path.

test-worker

  • Weighted per-job response_bytes_choices sampling.
  • Remove max_response_bytes cap (response is streamed).

CI

  • cargo fmt --all -- --check
  • cargo test -p polytope-server
  • cargo test -p polytope-server-integration-tests

…extraction

The old fe-worker FDB/gribjump client (gribjump 0.10.2, fdb 5.14.0, eckit 1.28.3,
metkit 1.11.22) hangs on remote extraction against new-system stores: extract()
stalls on the FDB-remote data channel with a broken-pipe TCPException. Bump the
whole stack to the versions the working mn5 deployment runs:

  eckit 1.33.2, eccodes 2.46.0, metkit 1.18.1, fdb 5.21.3, gribjump 0.12.0,
  ecbuild 3.12.0, pyfdb 5.21.0.19 (PyPI), pygribjump 0.12.0,
  polytope-mars 0.3.12, polytope-python 2.1.15, covjsonkit 0.2.20.

metkit 1.18.1 links eckit::geo::grid::ORCA (mars2grib/geo) which our standalone
eckit build does not provide; disable tests, build-tools and mars2grib in the
metkit build (and tests in fdb/gribjump) so the runtime libs build without the
unused ORCA dependency. pyfdb has no git tag matching FDB 5.21.x, so install the
matching PyPI wheel instead of cloning a tag.

Verified against mn5-dev: feature extraction (timeseries/boundingbox/
verticalprofile/polygon) and direct path both pass.
Workers used one long-lived reqwest client for every poll and callback. Since the
broker URL is the frontend ClusterIP service, kube-proxy pins that pooled TCP
connection to one frontend pod, so each worker lane can stay attached to one
frontend-broker forever. With multiple frontends and low worker counts this leaves
some brokers uncovered and jobs can hang.

Build a fresh reqwest client at the start of each poll cycle. The poll and any
job it returns still use that same client through heartbeat, delivery and
completion, preserving the required in-job stickiness. The next loop iteration
then drops the old pool and re-samples the frontend service.

Also lower the worker default long-poll timeout from 30s to 3s. The timeout is
only reached on empty brokers, so this improves idle-worker coverage convergence
without delaying busy brokers. Covers mars, fdb, feature and test workers.

Verified:
- cargo fmt --all -- --check
- cargo test -p polytope-worker-common
- cargo test -p polytope-fe-worker -p fdb-worker -p test-worker
- cargo test --manifest-path workers/mars-worker/Cargo.toml attempted; fails locally
  due missing native eckit headers, so Docker image build remains the compile gate
- rebuilt/pushed majh-dev worker images and mn5-test 3-frontend/2-worker verify
  passed 3x: 24/24 each run.
Adds low-overhead structured timing to localise where stress-test requests
spend time, without flooding normal operation:

- worker.job.completed now carries idle_ms (pickup wait since last job,
  same-pod clock so skew-free), poll_wait_ms, process_ms, deliver_ms
  (BOBS write) and complete_ms (broker completion POST) — one line per job.
- api.job.submitted carries enqueue_ms (route_handle.submit duration).
- High-frequency probes (worker.broker.poll.empty per empty poll,
  api.request.profiled with auth_ms/total_ms per request) log at DEBUG so
  they are available when profiling but silent by default.

These together showed the frontend/bits submit path is ~24ms (auth cache
hit + instant enqueue); the apparent "submit latency" under load was a
client-side connection-pool limit, not server time.
Adds a polytope-mock-user request header that overrides the effective username,
alongside the existing polytope-mock-roles (realm/roles). Gated by the same
admin_bypass_roles check, so only a configured admin can use it. Constrained to
names prefixed "mock-" so an admin can impersonate clearly-synthetic identities
(e.g. load-test users) but never a real account. Audited: MockRolesAudit retains
the real username and the api.auth.mock_accepted event logs the mocked username.

Enables stress tests on real-auth (DestinE) clusters to run as many distinct,
log-distinguishable mock-<n> users under one admin credential.
Hardening (Warp residual #2): a mock username without mock roles would inherit
the admin caller's own realm/roles. Reject that combination so a synthetic
mock-<n> identity always has an explicitly-scoped realm/roles and never silently
carries admin authority.
The streaming (bobs) delivery path forwards the result body without ever
measuring it, so worker.job.completed carried no size and only the
covjson fe-worker path logged any byte count. Add encode_stream_counted,
which tallies post-encoding bytes as the body streams to delivery, and
log the settled total as the 'bytes' field on worker.job.completed.

Gives a per-job, all-worker data-volume signal once log ingestion
surfaces attributes as first-class fields.
The poll loop warned on every non-2xx/non-204 broker response and every
connection error, so a frontend rollout or a momentary 503 produced a burst
of 'worker poll returned unexpected status' WARN lines. Workers deliberately
rebuild their client each cycle to fan out across broker replicas, so a single
failed cycle is normal and self-heals on the next.

Add PollHealth: failures log at debug (silent at RUST_LOG=info) until they
persist continuously for 30s, then warn once and re-warn at most every 30s;
any successful poll (work or a clean 204) resets it. Because the per-cycle
reconnect keeps landing the worker on healthy replicas during a rollout,
transient failures never reach the threshold -- so rollouts are silent with
no rollout-specific handling. Threshold logic takes an injected now and has a
unit test.
Expose a minimal auth-exempt GET /internal/poll/{id} router on a dedicated optional listener for broker-to-broker polling. The public app and auth middleware remain unchanged; the internal router shares the same AppState/Bits instance and mounts no public routes.

This intentionally does not alter bits or propagate user auth. Network isolation is enforced by the chart layer.
Temporarily point authotron-client/types at fix/coalesce-auth-cache so the mn5-test frontend image includes the auth cache single-flight fix while PR #72 is pending. Revert to branch=main after the PR lands.
The frontend Docker build copies Cargo.toml for the whole workspace, so it must include every workspace member referenced by the manifest. Copy loadgen/ so cargo build -p polytope-server works inside the container.
The BOBS create_client was built with pool_max_idle_per_host(0), so every
create ran on a cold TCP+h2 connection while write/complete reused the warm
body_client pool. Under load the cold create path failed with 'error sending
request for url (.../create)', which the broker surfaced as 'dispatch failed'
+ 'job failed' (and the worker side as 'result delivery failed'). Keep the
create pool warm (still a separate client so 16 MiB bodies don't HoL-block
creates).
…ss BOBS

The pooled h2 create_client pinned ALL of a worker's creates to a single BOBS
pod (one persistent connection to the service), concentrating every spool that
worker produced -- and its memory -- on one pod, driving per-pod OOM spikes.
Restore pool_max_idle_per_host(0): a fresh connection per create re-load-balances
across BOBS pods so objects spread evenly. (Each object's write/complete still
follow its create's per-pod write_url, so a single object's writes stay on one
BOBS, as required.)
response_bytes drives a lazy StressStream (one reused chunk buffer), so
there is nothing to bound and the silent .min(max_response_bytes) clamp was
just a footgun -- a request for a large object got clamped with no error.
Drop the field, its default/const, and the clamp; keep max_chunk_bytes
(chunk is a real per-chunk allocation). Removes the clamp-only test.
Enable mixed object-size load from a single loadgen payload: if the
request's stress block carries response_bytes_choices ([{bytes, weight}]),
the worker samples a size per job (keyed by job id -> deterministic yet
well-spread) instead of using a fixed response_bytes. The loadgen only
sends one payload, so the size distribution has to live in the worker
where the object is generated. No new deps (std DefaultHasher).
aggregate_mibps divides total bytes by the full first-read-to-last-read
span, so a heavy-tailed mix (one large object draining alone for minutes)
dilutes it. Add peak_windowed_mibps: the max average read throughput over
any 10s window, modeling each download as constant-rate across its
interval and summing concurrent reads. Captures the dense concurrent-read
plateau independent of ramp-up and the single-stream tail.
Preserve per-request MARS source errors out-of-band from the streamed body so
BOBS/S3/direct delivery failures do not mask the actual mars-client message.
When a source stream fails mid-body, delivery now posts Completion::Error with
the mapped user message instead of a generic reqwest/BOBS transport error or a
half-written object redirect.

Mars errors are classified by approved string matches and surfaced through the
existing JobResult::Error -> HTTP 400 contract. Adds regression coverage for
BOBS source-error precedence and the worker loop posting /complete/error rather
than /complete/redirect.
Use bits ec18c4c so remote-pool /complete/error maps to JobResult::Error
(HTTP 400 with the worker message) instead of JobResult::Failed/500 with a
canned internal-server-error response.
Merge/rename duplicate test module scope and include callback_url in the test
WorkItem literal so mars-worker compiles in the native container build.
Emit LOADGEN_PROGRESS JSON during measured runs so external tools can display
true live stress-test metrics. The progress schema is credential-free, contains
window_s=60, cumulative counts/bytes, window RPS/MiB/s, and raw ready latency
samples for downstream rolling p95 aggregation. LOADGEN_SUMMARY remains
unchanged; LOADGEN_PROGRESS_INTERVAL_MS=0 disables progress.

cargo test -p loadgen passed.
- Fix pyo3_round_trip compile error by adding explicit callback_url: None
- Strengthen test to verify metadata is serialized into PyO3 payload
- Add Python assertions that fail if metadata field is missing or wrong
- Test now fails if metadata is removed from serialization (verified)

Addresses task 2 from lumi-test-deploy plan.
Tests prove:
- Trusted metadata can override datacube and options per request
- Client request fields (polytope_mars, metadata, pre_path, use_catalogue) do not influence config
- Two sequential requests on one datasource use different metadata blocks
- self.config is never mutated by metadata overlay
- change_grids() intra-dataset refinement runs after metadata overlay
- Validation errors for non-dict metadata values

All 9 tests pass. Completes task 3 from lumi-test-deploy plan.
Add a wall-clock duration run mode to the in-cluster Rust loadgen so it can
sustain steady offered load until a deadline instead of running a fixed
iteration count.

- RunLimit parses LOADGEN_DURATION_S (+ optional LOADGEN_RPS); iteration mode
  preserved when duration is unset.
- rps!=null: open-model scheduler at target rate, capped by concurrency,
  counting missed_starts/backpressure rather than queueing unbounded work.
- rps==null: closed-loop concurrency fill until deadline.
- Stops scheduling at the deadline, then gracefully drains in-flight
  submit/poll/download cycles.
- Summary/config adds run_limit, target_duration_s, submission/drain
  durations, scheduled, missed_starts, and time-bucketed metrics
  (throughput/error/bytes/ready-p95) for early-vs-late degradation checks.
- Security: auth header / POLYTOPE_EMAIL / POLYTOPE_KEY excluded from
  summary/progress/config serialisation, with tests asserting no leakage.
The single-FE-pool LUMI deployment supplies every per-dataset datacube/options
block via job metadata (set_metadata action), so the static pool config has no
'options' key. The merge did polytope_mars_config['options'].update(...), which
raised KeyError: 'options' on every FE request, and it left the metadata-supplied
pre_path (a list of axis names) unconverted. setdefault('options', {}) and build
the per-request pre_path dict from the metadata-supplied (or base) axis list.
… unset

A multi-broker worker_server needs advertised_addr (from POD_IP) so workers send
completions to the dispatching broker. If POD_IP is missing the broker advertised
no direct callback and workers silently fell back to the load-balanced URL,
misrouting completions and stranding jobs. Refuse to start instead (+test).
Logs worker.callback.base.resolved (outcome=direct|lb_fallback) so we can see at
runtime whether a worker is posting completions to the broker pod IP or the LB.
@jameshawkes jameshawkes merged commit 46ccd6c into upstream Jun 24, 2026
2 of 4 checks passed
@jameshawkes jameshawkes deleted the feature/rust-loadgen branch June 24, 2026 14:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant