Rust load generator + soak mode; FE-worker gribjump/metadata; worker/broker/frontend updates by jameshawkes · Pull Request #200 · ecmwf/polytope-server

jameshawkes · 2026-06-22T20:24:57Z

Changes

Load generator

Add Rust load generator crate; document and package it.
Emit live progress snapshots.
Add peak windowed throughput metric.
Add duration-based soak mode.

FE worker

Bump gribjump/FDB stack to 0.12.0 / 5.21.3 for new-system extraction.
Lock down FE worker metadata pass-through from Rust to PyO3.
Fix metadata options merge when base config has no options.
Add tests for Python FE worker metadata options merging.

Workers / broker

Route worker callbacks to dispatching broker.
Rebalance broker polling between jobs.
Debounce broker-poll failure warnings (warn only after 30s).
Propagate MARS stream errors through worker delivery.
Emit delivered byte count on worker.job.completed.
Keep BOBS create connection pool warm; re-load-balance creates across BOBS.
Bump bits for worker error completion semantics.

Frontend

Add isolated internal poll listener.
Add Polytope-Mock-User header for synthetic user impersonation; require Polytope-Mock-Roles when set.
Copy loadgen workspace member into the Docker build.

Observability

Per-stage timing for workers and frontend submit path.

test-worker

Weighted per-job response_bytes_choices sampling.
Remove max_response_bytes cap (response is streamed).

CI

cargo fmt --all -- --check
cargo test -p polytope-server
cargo test -p polytope-server-integration-tests

…extraction The old fe-worker FDB/gribjump client (gribjump 0.10.2, fdb 5.14.0, eckit 1.28.3, metkit 1.11.22) hangs on remote extraction against new-system stores: extract() stalls on the FDB-remote data channel with a broken-pipe TCPException. Bump the whole stack to the versions the working mn5 deployment runs: eckit 1.33.2, eccodes 2.46.0, metkit 1.18.1, fdb 5.21.3, gribjump 0.12.0, ecbuild 3.12.0, pyfdb 5.21.0.19 (PyPI), pygribjump 0.12.0, polytope-mars 0.3.12, polytope-python 2.1.15, covjsonkit 0.2.20. metkit 1.18.1 links eckit::geo::grid::ORCA (mars2grib/geo) which our standalone eckit build does not provide; disable tests, build-tools and mars2grib in the metkit build (and tests in fdb/gribjump) so the runtime libs build without the unused ORCA dependency. pyfdb has no git tag matching FDB 5.21.x, so install the matching PyPI wheel instead of cloning a tag. Verified against mn5-dev: feature extraction (timeseries/boundingbox/ verticalprofile/polygon) and direct path both pass.

Workers used one long-lived reqwest client for every poll and callback. Since the broker URL is the frontend ClusterIP service, kube-proxy pins that pooled TCP connection to one frontend pod, so each worker lane can stay attached to one frontend-broker forever. With multiple frontends and low worker counts this leaves some brokers uncovered and jobs can hang. Build a fresh reqwest client at the start of each poll cycle. The poll and any job it returns still use that same client through heartbeat, delivery and completion, preserving the required in-job stickiness. The next loop iteration then drops the old pool and re-samples the frontend service. Also lower the worker default long-poll timeout from 30s to 3s. The timeout is only reached on empty brokers, so this improves idle-worker coverage convergence without delaying busy brokers. Covers mars, fdb, feature and test workers. Verified: - cargo fmt --all -- --check - cargo test -p polytope-worker-common - cargo test -p polytope-fe-worker -p fdb-worker -p test-worker - cargo test --manifest-path workers/mars-worker/Cargo.toml attempted; fails locally due missing native eckit headers, so Docker image build remains the compile gate - rebuilt/pushed majh-dev worker images and mn5-test 3-frontend/2-worker verify passed 3x: 24/24 each run.

Adds low-overhead structured timing to localise where stress-test requests spend time, without flooding normal operation: - worker.job.completed now carries idle_ms (pickup wait since last job, same-pod clock so skew-free), poll_wait_ms, process_ms, deliver_ms (BOBS write) and complete_ms (broker completion POST) — one line per job. - api.job.submitted carries enqueue_ms (route_handle.submit duration). - High-frequency probes (worker.broker.poll.empty per empty poll, api.request.profiled with auth_ms/total_ms per request) log at DEBUG so they are available when profiling but silent by default. These together showed the frontend/bits submit path is ~24ms (auth cache hit + instant enqueue); the apparent "submit latency" under load was a client-side connection-pool limit, not server time.

Adds a polytope-mock-user request header that overrides the effective username, alongside the existing polytope-mock-roles (realm/roles). Gated by the same admin_bypass_roles check, so only a configured admin can use it. Constrained to names prefixed "mock-" so an admin can impersonate clearly-synthetic identities (e.g. load-test users) but never a real account. Audited: MockRolesAudit retains the real username and the api.auth.mock_accepted event logs the mocked username. Enables stress tests on real-auth (DestinE) clusters to run as many distinct, log-distinguishable mock-<n> users under one admin credential.

Hardening (Warp residual #2): a mock username without mock roles would inherit the admin caller's own realm/roles. Reject that combination so a synthetic mock-<n> identity always has an explicitly-scoped realm/roles and never silently carries admin authority.

The streaming (bobs) delivery path forwards the result body without ever measuring it, so worker.job.completed carried no size and only the covjson fe-worker path logged any byte count. Add encode_stream_counted, which tallies post-encoding bytes as the body streams to delivery, and log the settled total as the 'bytes' field on worker.job.completed. Gives a per-job, all-worker data-volume signal once log ingestion surfaces attributes as first-class fields.

The poll loop warned on every non-2xx/non-204 broker response and every connection error, so a frontend rollout or a momentary 503 produced a burst of 'worker poll returned unexpected status' WARN lines. Workers deliberately rebuild their client each cycle to fan out across broker replicas, so a single failed cycle is normal and self-heals on the next. Add PollHealth: failures log at debug (silent at RUST_LOG=info) until they persist continuously for 30s, then warn once and re-warn at most every 30s; any successful poll (work or a clean 204) resets it. Because the per-cycle reconnect keeps landing the worker on healthy replicas during a rollout, transient failures never reach the threshold -- so rollouts are silent with no rollout-specific handling. Threshold logic takes an injected now and has a unit test.

Expose a minimal auth-exempt GET /internal/poll/{id} router on a dedicated optional listener for broker-to-broker polling. The public app and auth middleware remain unchanged; the internal router shares the same AppState/Bits instance and mounts no public routes. This intentionally does not alter bits or propagate user auth. Network isolation is enforced by the chart layer.

Temporarily point authotron-client/types at fix/coalesce-auth-cache so the mn5-test frontend image includes the auth cache single-flight fix while PR #72 is pending. Revert to branch=main after the PR lands.

The frontend Docker build copies Cargo.toml for the whole workspace, so it must include every workspace member referenced by the manifest. Copy loadgen/ so cargo build -p polytope-server works inside the container.

The BOBS create_client was built with pool_max_idle_per_host(0), so every create ran on a cold TCP+h2 connection while write/complete reused the warm body_client pool. Under load the cold create path failed with 'error sending request for url (.../create)', which the broker surfaced as 'dispatch failed' + 'job failed' (and the worker side as 'result delivery failed'). Keep the create pool warm (still a separate client so 16 MiB bodies don't HoL-block creates).

…ss BOBS The pooled h2 create_client pinned ALL of a worker's creates to a single BOBS pod (one persistent connection to the service), concentrating every spool that worker produced -- and its memory -- on one pod, driving per-pod OOM spikes. Restore pool_max_idle_per_host(0): a fresh connection per create re-load-balances across BOBS pods so objects spread evenly. (Each object's write/complete still follow its create's per-pod write_url, so a single object's writes stay on one BOBS, as required.)

response_bytes drives a lazy StressStream (one reused chunk buffer), so there is nothing to bound and the silent .min(max_response_bytes) clamp was just a footgun -- a request for a large object got clamped with no error. Drop the field, its default/const, and the clamp; keep max_chunk_bytes (chunk is a real per-chunk allocation). Removes the clamp-only test.

Enable mixed object-size load from a single loadgen payload: if the request's stress block carries response_bytes_choices ([{bytes, weight}]), the worker samples a size per job (keyed by job id -> deterministic yet well-spread) instead of using a fixed response_bytes. The loadgen only sends one payload, so the size distribution has to live in the worker where the object is generated. No new deps (std DefaultHasher).

aggregate_mibps divides total bytes by the full first-read-to-last-read span, so a heavy-tailed mix (one large object draining alone for minutes) dilutes it. Add peak_windowed_mibps: the max average read throughput over any 10s window, modeling each download as constant-rate across its interval and summing concurrent reads. Captures the dense concurrent-read plateau independent of ramp-up and the single-stream tail.

Preserve per-request MARS source errors out-of-band from the streamed body so BOBS/S3/direct delivery failures do not mask the actual mars-client message. When a source stream fails mid-body, delivery now posts Completion::Error with the mapped user message instead of a generic reqwest/BOBS transport error or a half-written object redirect. Mars errors are classified by approved string matches and surfaced through the existing JobResult::Error -> HTTP 400 contract. Adds regression coverage for BOBS source-error precedence and the worker loop posting /complete/error rather than /complete/redirect.

Use bits ec18c4c so remote-pool /complete/error maps to JobResult::Error (HTTP 400 with the worker message) instead of JobResult::Failed/500 with a canned internal-server-error response.

Merge/rename duplicate test module scope and include callback_url in the test WorkItem literal so mars-worker compiles in the native container build.

Emit LOADGEN_PROGRESS JSON during measured runs so external tools can display true live stress-test metrics. The progress schema is credential-free, contains window_s=60, cumulative counts/bytes, window RPS/MiB/s, and raw ready latency samples for downstream rolling p95 aggregation. LOADGEN_SUMMARY remains unchanged; LOADGEN_PROGRESS_INTERVAL_MS=0 disables progress. cargo test -p loadgen passed.

- Fix pyo3_round_trip compile error by adding explicit callback_url: None - Strengthen test to verify metadata is serialized into PyO3 payload - Add Python assertions that fail if metadata field is missing or wrong - Test now fails if metadata is removed from serialization (verified) Addresses task 2 from lumi-test-deploy plan.

Tests prove: - Trusted metadata can override datacube and options per request - Client request fields (polytope_mars, metadata, pre_path, use_catalogue) do not influence config - Two sequential requests on one datasource use different metadata blocks - self.config is never mutated by metadata overlay - change_grids() intra-dataset refinement runs after metadata overlay - Validation errors for non-dict metadata values All 9 tests pass. Completes task 3 from lumi-test-deploy plan.

Add a wall-clock duration run mode to the in-cluster Rust loadgen so it can sustain steady offered load until a deadline instead of running a fixed iteration count. - RunLimit parses LOADGEN_DURATION_S (+ optional LOADGEN_RPS); iteration mode preserved when duration is unset. - rps!=null: open-model scheduler at target rate, capped by concurrency, counting missed_starts/backpressure rather than queueing unbounded work. - rps==null: closed-loop concurrency fill until deadline. - Stops scheduling at the deadline, then gracefully drains in-flight submit/poll/download cycles. - Summary/config adds run_limit, target_duration_s, submission/drain durations, scheduled, missed_starts, and time-bucketed metrics (throughput/error/bytes/ready-p95) for early-vs-late degradation checks. - Security: auth header / POLYTOPE_EMAIL / POLYTOPE_KEY excluded from summary/progress/config serialisation, with tests asserting no leakage.

The single-FE-pool LUMI deployment supplies every per-dataset datacube/options block via job metadata (set_metadata action), so the static pool config has no 'options' key. The merge did polytope_mars_config['options'].update(...), which raised KeyError: 'options' on every FE request, and it left the metadata-supplied pre_path (a list of axis names) unconverted. setdefault('options', {}) and build the per-request pre_path dict from the metadata-supplied (or base) axis list.

… unset A multi-broker worker_server needs advertised_addr (from POD_IP) so workers send completions to the dispatching broker. If POD_IP is missing the broker advertised no direct callback and workers silently fell back to the load-balanced URL, misrouting completions and stranding jobs. Refuse to start instead (+test).

Logs worker.callback.base.resolved (outcome=direct|lb_fallback) so we can see at runtime whether a worker is posting completions to the broker pod IP or the LB.

jameshawkes added 28 commits June 16, 2026 22:32

Add Rust load generator crate

8785b25

Document and package load generator

aed9989

test(frontend): pin auth-o-tron cache fix for mn5 validation

e96a9dd

Temporarily point authotron-client/types at fix/coalesce-auth-cache so the mn5-test frontend image includes the auth cache single-flight fix while PR #72 is pending. Revert to branch=main after the PR lands.

fix(frontend): copy loadgen workspace member into Docker build

e9e877d

The frontend Docker build copies Cargo.toml for the whole workspace, so it must include every workspace member referenced by the manifest. Copy loadgen/ so cargo build -p polytope-server works inside the container.

Route worker callbacks to dispatching broker

85548c2

Bump bits for worker error completion semantics

811a4f9

Use bits ec18c4c so remote-pool /complete/error maps to JobResult::Error (HTTP 400 with the worker message) instead of JobResult::Failed/500 with a canned internal-server-error response.

Fix mars-worker test module after error mapping change

44bd2b7

Merge/rename duplicate test module scope and include callback_url in the test WorkItem literal so mars-worker compiles in the native container build.

obs(worker): debug-log the resolved callback base url

a7d58e9

Logs worker.callback.base.resolved (outcome=direct|lb_fallback) so we can see at runtime whether a worker is posting completions to the broker pod IP or the LB.

jameshawkes merged commit 46ccd6c into upstream Jun 24, 2026
2 of 4 checks passed

jameshawkes deleted the feature/rust-loadgen branch June 24, 2026 14:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rust load generator + soak mode; FE-worker gribjump/metadata; worker/broker/frontend updates#200

Rust load generator + soak mode; FE-worker gribjump/metadata; worker/broker/frontend updates#200
jameshawkes merged 28 commits into
upstreamfrom
feature/rust-loadgen

jameshawkes commented Jun 22, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jameshawkes commented Jun 22, 2026

Changes

Load generator

FE worker

Workers / broker

Frontend

Observability

test-worker

CI

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant