Rust load generator + soak mode; FE-worker gribjump/metadata; worker/broker/frontend updates#200
Merged
Merged
Conversation
…extraction The old fe-worker FDB/gribjump client (gribjump 0.10.2, fdb 5.14.0, eckit 1.28.3, metkit 1.11.22) hangs on remote extraction against new-system stores: extract() stalls on the FDB-remote data channel with a broken-pipe TCPException. Bump the whole stack to the versions the working mn5 deployment runs: eckit 1.33.2, eccodes 2.46.0, metkit 1.18.1, fdb 5.21.3, gribjump 0.12.0, ecbuild 3.12.0, pyfdb 5.21.0.19 (PyPI), pygribjump 0.12.0, polytope-mars 0.3.12, polytope-python 2.1.15, covjsonkit 0.2.20. metkit 1.18.1 links eckit::geo::grid::ORCA (mars2grib/geo) which our standalone eckit build does not provide; disable tests, build-tools and mars2grib in the metkit build (and tests in fdb/gribjump) so the runtime libs build without the unused ORCA dependency. pyfdb has no git tag matching FDB 5.21.x, so install the matching PyPI wheel instead of cloning a tag. Verified against mn5-dev: feature extraction (timeseries/boundingbox/ verticalprofile/polygon) and direct path both pass.
Workers used one long-lived reqwest client for every poll and callback. Since the broker URL is the frontend ClusterIP service, kube-proxy pins that pooled TCP connection to one frontend pod, so each worker lane can stay attached to one frontend-broker forever. With multiple frontends and low worker counts this leaves some brokers uncovered and jobs can hang. Build a fresh reqwest client at the start of each poll cycle. The poll and any job it returns still use that same client through heartbeat, delivery and completion, preserving the required in-job stickiness. The next loop iteration then drops the old pool and re-samples the frontend service. Also lower the worker default long-poll timeout from 30s to 3s. The timeout is only reached on empty brokers, so this improves idle-worker coverage convergence without delaying busy brokers. Covers mars, fdb, feature and test workers. Verified: - cargo fmt --all -- --check - cargo test -p polytope-worker-common - cargo test -p polytope-fe-worker -p fdb-worker -p test-worker - cargo test --manifest-path workers/mars-worker/Cargo.toml attempted; fails locally due missing native eckit headers, so Docker image build remains the compile gate - rebuilt/pushed majh-dev worker images and mn5-test 3-frontend/2-worker verify passed 3x: 24/24 each run.
Adds low-overhead structured timing to localise where stress-test requests spend time, without flooding normal operation: - worker.job.completed now carries idle_ms (pickup wait since last job, same-pod clock so skew-free), poll_wait_ms, process_ms, deliver_ms (BOBS write) and complete_ms (broker completion POST) — one line per job. - api.job.submitted carries enqueue_ms (route_handle.submit duration). - High-frequency probes (worker.broker.poll.empty per empty poll, api.request.profiled with auth_ms/total_ms per request) log at DEBUG so they are available when profiling but silent by default. These together showed the frontend/bits submit path is ~24ms (auth cache hit + instant enqueue); the apparent "submit latency" under load was a client-side connection-pool limit, not server time.
Adds a polytope-mock-user request header that overrides the effective username, alongside the existing polytope-mock-roles (realm/roles). Gated by the same admin_bypass_roles check, so only a configured admin can use it. Constrained to names prefixed "mock-" so an admin can impersonate clearly-synthetic identities (e.g. load-test users) but never a real account. Audited: MockRolesAudit retains the real username and the api.auth.mock_accepted event logs the mocked username. Enables stress tests on real-auth (DestinE) clusters to run as many distinct, log-distinguishable mock-<n> users under one admin credential.
Hardening (Warp residual #2): a mock username without mock roles would inherit the admin caller's own realm/roles. Reject that combination so a synthetic mock-<n> identity always has an explicitly-scoped realm/roles and never silently carries admin authority.
The streaming (bobs) delivery path forwards the result body without ever measuring it, so worker.job.completed carried no size and only the covjson fe-worker path logged any byte count. Add encode_stream_counted, which tallies post-encoding bytes as the body streams to delivery, and log the settled total as the 'bytes' field on worker.job.completed. Gives a per-job, all-worker data-volume signal once log ingestion surfaces attributes as first-class fields.
The poll loop warned on every non-2xx/non-204 broker response and every connection error, so a frontend rollout or a momentary 503 produced a burst of 'worker poll returned unexpected status' WARN lines. Workers deliberately rebuild their client each cycle to fan out across broker replicas, so a single failed cycle is normal and self-heals on the next. Add PollHealth: failures log at debug (silent at RUST_LOG=info) until they persist continuously for 30s, then warn once and re-warn at most every 30s; any successful poll (work or a clean 204) resets it. Because the per-cycle reconnect keeps landing the worker on healthy replicas during a rollout, transient failures never reach the threshold -- so rollouts are silent with no rollout-specific handling. Threshold logic takes an injected now and has a unit test.
Expose a minimal auth-exempt GET /internal/poll/{id} router on a dedicated optional listener for broker-to-broker polling. The public app and auth middleware remain unchanged; the internal router shares the same AppState/Bits instance and mounts no public routes.
This intentionally does not alter bits or propagate user auth. Network isolation is enforced by the chart layer.
Temporarily point authotron-client/types at fix/coalesce-auth-cache so the mn5-test frontend image includes the auth cache single-flight fix while PR #72 is pending. Revert to branch=main after the PR lands.
The frontend Docker build copies Cargo.toml for the whole workspace, so it must include every workspace member referenced by the manifest. Copy loadgen/ so cargo build -p polytope-server works inside the container.
The BOBS create_client was built with pool_max_idle_per_host(0), so every create ran on a cold TCP+h2 connection while write/complete reused the warm body_client pool. Under load the cold create path failed with 'error sending request for url (.../create)', which the broker surfaced as 'dispatch failed' + 'job failed' (and the worker side as 'result delivery failed'). Keep the create pool warm (still a separate client so 16 MiB bodies don't HoL-block creates).
…ss BOBS The pooled h2 create_client pinned ALL of a worker's creates to a single BOBS pod (one persistent connection to the service), concentrating every spool that worker produced -- and its memory -- on one pod, driving per-pod OOM spikes. Restore pool_max_idle_per_host(0): a fresh connection per create re-load-balances across BOBS pods so objects spread evenly. (Each object's write/complete still follow its create's per-pod write_url, so a single object's writes stay on one BOBS, as required.)
response_bytes drives a lazy StressStream (one reused chunk buffer), so there is nothing to bound and the silent .min(max_response_bytes) clamp was just a footgun -- a request for a large object got clamped with no error. Drop the field, its default/const, and the clamp; keep max_chunk_bytes (chunk is a real per-chunk allocation). Removes the clamp-only test.
Enable mixed object-size load from a single loadgen payload: if the
request's stress block carries response_bytes_choices ([{bytes, weight}]),
the worker samples a size per job (keyed by job id -> deterministic yet
well-spread) instead of using a fixed response_bytes. The loadgen only
sends one payload, so the size distribution has to live in the worker
where the object is generated. No new deps (std DefaultHasher).
aggregate_mibps divides total bytes by the full first-read-to-last-read span, so a heavy-tailed mix (one large object draining alone for minutes) dilutes it. Add peak_windowed_mibps: the max average read throughput over any 10s window, modeling each download as constant-rate across its interval and summing concurrent reads. Captures the dense concurrent-read plateau independent of ramp-up and the single-stream tail.
Preserve per-request MARS source errors out-of-band from the streamed body so BOBS/S3/direct delivery failures do not mask the actual mars-client message. When a source stream fails mid-body, delivery now posts Completion::Error with the mapped user message instead of a generic reqwest/BOBS transport error or a half-written object redirect. Mars errors are classified by approved string matches and surfaced through the existing JobResult::Error -> HTTP 400 contract. Adds regression coverage for BOBS source-error precedence and the worker loop posting /complete/error rather than /complete/redirect.
Use bits ec18c4c so remote-pool /complete/error maps to JobResult::Error (HTTP 400 with the worker message) instead of JobResult::Failed/500 with a canned internal-server-error response.
Merge/rename duplicate test module scope and include callback_url in the test WorkItem literal so mars-worker compiles in the native container build.
Emit LOADGEN_PROGRESS JSON during measured runs so external tools can display true live stress-test metrics. The progress schema is credential-free, contains window_s=60, cumulative counts/bytes, window RPS/MiB/s, and raw ready latency samples for downstream rolling p95 aggregation. LOADGEN_SUMMARY remains unchanged; LOADGEN_PROGRESS_INTERVAL_MS=0 disables progress. cargo test -p loadgen passed.
- Fix pyo3_round_trip compile error by adding explicit callback_url: None - Strengthen test to verify metadata is serialized into PyO3 payload - Add Python assertions that fail if metadata field is missing or wrong - Test now fails if metadata is removed from serialization (verified) Addresses task 2 from lumi-test-deploy plan.
Tests prove: - Trusted metadata can override datacube and options per request - Client request fields (polytope_mars, metadata, pre_path, use_catalogue) do not influence config - Two sequential requests on one datasource use different metadata blocks - self.config is never mutated by metadata overlay - change_grids() intra-dataset refinement runs after metadata overlay - Validation errors for non-dict metadata values All 9 tests pass. Completes task 3 from lumi-test-deploy plan.
Add a wall-clock duration run mode to the in-cluster Rust loadgen so it can sustain steady offered load until a deadline instead of running a fixed iteration count. - RunLimit parses LOADGEN_DURATION_S (+ optional LOADGEN_RPS); iteration mode preserved when duration is unset. - rps!=null: open-model scheduler at target rate, capped by concurrency, counting missed_starts/backpressure rather than queueing unbounded work. - rps==null: closed-loop concurrency fill until deadline. - Stops scheduling at the deadline, then gracefully drains in-flight submit/poll/download cycles. - Summary/config adds run_limit, target_duration_s, submission/drain durations, scheduled, missed_starts, and time-bucketed metrics (throughput/error/bytes/ready-p95) for early-vs-late degradation checks. - Security: auth header / POLYTOPE_EMAIL / POLYTOPE_KEY excluded from summary/progress/config serialisation, with tests asserting no leakage.
The single-FE-pool LUMI deployment supplies every per-dataset datacube/options
block via job metadata (set_metadata action), so the static pool config has no
'options' key. The merge did polytope_mars_config['options'].update(...), which
raised KeyError: 'options' on every FE request, and it left the metadata-supplied
pre_path (a list of axis names) unconverted. setdefault('options', {}) and build
the per-request pre_path dict from the metadata-supplied (or base) axis list.
… unset A multi-broker worker_server needs advertised_addr (from POD_IP) so workers send completions to the dispatching broker. If POD_IP is missing the broker advertised no direct callback and workers silently fell back to the load-balanced URL, misrouting completions and stranding jobs. Refuse to start instead (+test).
Logs worker.callback.base.resolved (outcome=direct|lb_fallback) so we can see at runtime whether a worker is posting completions to the broker pod IP or the LB.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
Load generator
FE worker
Workers / broker
worker.job.completed.Frontend
Polytope-Mock-Userheader for synthetic user impersonation; requirePolytope-Mock-Roleswhen set.Observability
test-worker
response_bytes_choicessampling.max_response_bytescap (response is streamed).CI
cargo fmt --all -- --checkcargo test -p polytope-servercargo test -p polytope-server-integration-tests