Skip to content

fix(http_utils): disable httpx keepalive to spread load across uvicorn workers#29

Open
rmfan wants to merge 1 commit into
prodfrom
fix/http-client-no-keepalive
Open

fix(http_utils): disable httpx keepalive to spread load across uvicorn workers#29
rmfan wants to merge 1 commit into
prodfrom
fix/http-client-no-keepalive

Conversation

@rmfan
Copy link
Copy Markdown
Collaborator

@rmfan rmfan commented May 29, 2026

Summary

init_http_client builds a process-wide httpx.AsyncClient singleton with HTTP/1.1 keepalive at default. When that client targets a uvicorn --workers N server, all /run traffic gets pinned to the small subset of workers that originally accept()-won the pooled TCP connections — because:

  • uvicorn's multi-worker supervisor binds one listening socket in the parent (uvicorn/config.py: bind_socket sets SO_REUSEADDR only, not SO_REUSEPORT) and shares its fd with all worker children.
  • Workers race on accept() against the shared listen queue. Dispatch is per-TCP-connection, not per-request. Once a connection lands on worker N, every HTTP/1.1 keepalive request on that connection stays on worker N for the connection's lifetime.
  • No work-stealing between workers.

max_keepalive_connections=0 closes the TCP after each response, so every /run runs its own accept() race and load actually spreads across workers.

Observed impact (harbor_server, RL360 slurm_job 1694138, 2026-05-29)

Per-minute distinct workers ever calling _run_inflight += 1:

stat value
min 1
p50 3
p95 6
max 32
n_minutes 164

i.e. 75% of minutes used 3 of the 32 workers. Single-worker peak inflight_after_acquire=32 (the per-worker Semaphore(max_concurrent=32) cap) showed up against n_workers_active=2 — meaning the cluster's effective ceiling was 2 × 32 = 64 trials, not the nominal 32 × 32 = 1024. The other 30 workers sat idle.

Source instrumentation: harbor_server.py:49 (module-level _run_inflight), harbor_server.py:1107-1113 (acquire + counter), log_format.py:54,147 (per-record pid).

Risk

  • Cost per request: one extra TCP handshake. In-VPC this is ~1ms — negligible compared to the per-trial /run latency (seconds to minutes).
  • No API change: still the same _http_client.post(...) interface.
  • Connection-pool sizing: max_connections=_client_concurrency is unchanged, so the high-water concurrency cap is the same.
  • Scope: only affects the main _http_client singleton. The Ray-distributed _HttpPosterActor path (http_utils.py:265-266) has the same pattern and likely the same issue — left out of this PR because (a) the user request was specifically the main client and (b) it's gated behind use_distributed_post. Worth a follow-up if the deployment uses it.

Test plan

  • Confirm no regression in single-trial /run latency
  • Re-run the 800-task verification workload that surfaced the imbalance and re-query ~/scripts/athena_harbor_samples.py — expect n_workers_active p50 to climb from 3 toward 32, wait_secs p99 to drop substantially
  • ss -tn dport = :<harbor_port> on the caller during a hot minute should show short-lived rather than long-lived connections

🤖 Generated with Claude Code

…n workers

A pooled httpx.AsyncClient against a uvicorn --workers N server pins all
requests to the small subset of workers that accept()-won the pooled TCP
connections (uvicorn shares one listen socket across workers; no
SO_REUSEPORT, no work-stealing). Observed in a harbor_server run:
n_workers_active = 2 of 32 for most minutes, with those 2 workers
saturated at their per-process Semaphore cap while the other 30 sat idle.

Setting max_keepalive_connections=0 closes the TCP after each response,
so every /run gets its own accept() race and load spreads.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@rmfan rmfan requested a review from a team as a code owner May 29, 2026 21:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant