fix(http_utils): disable httpx keepalive to spread load across uvicorn workers by rmfan · Pull Request #29 · LLM360/miles

rmfan · 2026-05-29T21:50:06Z

Summary

init_http_client builds a process-wide httpx.AsyncClient singleton with HTTP/1.1 keepalive at default. When that client targets a uvicorn --workers N server, all /run traffic gets pinned to the small subset of workers that originally accept()-won the pooled TCP connections — because:

uvicorn's multi-worker supervisor binds one listening socket in the parent (uvicorn/config.py: bind_socket sets SO_REUSEADDR only, not SO_REUSEPORT) and shares its fd with all worker children.
Workers race on accept() against the shared listen queue. Dispatch is per-TCP-connection, not per-request. Once a connection lands on worker N, every HTTP/1.1 keepalive request on that connection stays on worker N for the connection's lifetime.
No work-stealing between workers.

max_keepalive_connections=0 closes the TCP after each response, so every /run runs its own accept() race and load actually spreads across workers.

Observed impact (harbor_server, RL360 slurm_job 1694138, 2026-05-29)

Per-minute distinct workers ever calling _run_inflight += 1:

stat	value
min	1
p50	3
p95	6
max	32
n_minutes	164

i.e. 75% of minutes used ≤3 of the 32 workers. Single-worker peak inflight_after_acquire=32 (the per-worker Semaphore(max_concurrent=32) cap) showed up against n_workers_active=2 — meaning the cluster's effective ceiling was 2 × 32 = 64 trials, not the nominal 32 × 32 = 1024. The other 30 workers sat idle.

Source instrumentation: harbor_server.py:49 (module-level _run_inflight), harbor_server.py:1107-1113 (acquire + counter), log_format.py:54,147 (per-record pid).

Risk

Cost per request: one extra TCP handshake. In-VPC this is ~1ms — negligible compared to the per-trial /run latency (seconds to minutes).
No API change: still the same _http_client.post(...) interface.
Connection-pool sizing: max_connections=_client_concurrency is unchanged, so the high-water concurrency cap is the same.
Scope: only affects the main _http_client singleton. The Ray-distributed _HttpPosterActor path (http_utils.py:265-266) has the same pattern and likely the same issue — left out of this PR because (a) the user request was specifically the main client and (b) it's gated behind use_distributed_post. Worth a follow-up if the deployment uses it.

Test plan

Confirm no regression in single-trial /run latency
Re-run the 800-task verification workload that surfaced the imbalance and re-query ~/scripts/athena_harbor_samples.py — expect n_workers_active p50 to climb from 3 toward 32, wait_secs p99 to drop substantially
ss -tn dport = :<harbor_port> on the caller during a hot minute should show short-lived rather than long-lived connections

🤖 Generated with Claude Code

…n workers A pooled httpx.AsyncClient against a uvicorn --workers N server pins all requests to the small subset of workers that accept()-won the pooled TCP connections (uvicorn shares one listen socket across workers; no SO_REUSEPORT, no work-stealing). Observed in a harbor_server run: n_workers_active = 2 of 32 for most minutes, with those 2 workers saturated at their per-process Semaphore cap while the other 30 sat idle. Setting max_keepalive_connections=0 closes the TCP after each response, so every /run gets its own accept() race and load spreads. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

rmfan requested a review from a team as a code owner May 29, 2026 21:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(http_utils): disable httpx keepalive to spread load across uvicorn workers#29

fix(http_utils): disable httpx keepalive to spread load across uvicorn workers#29
rmfan wants to merge 1 commit into
prodfrom
fix/http-client-no-keepalive

rmfan commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rmfan commented May 29, 2026

Summary

Observed impact (harbor_server, RL360 slurm_job 1694138, 2026-05-29)

Risk

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant