Skip to content

feat(client): share adaptive concurrency controllers per (session, endpoint)#871

Open
assafvayner wants to merge 3 commits into
mainfrom
worktree-cosmic-foraging-glade
Open

feat(client): share adaptive concurrency controllers per (session, endpoint)#871
assafvayner wants to merge 3 commits into
mainfrom
worktree-cosmic-foraging-glade

Conversation

@assafvayner

@assafvayner assafvayner commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Human Summary:

Having all downloads and all uploads share the same concurrency controller within the same XetSession (all download groups and all upload commit builders under the same session)

2 download groups or 2 upload commits concurrently share the same semaphores for concurrently downloading segments/uploading objects. more importantly they cannot interfere with each other.

when 1 group/upload finishes, if another one starts, then next will have the previous adaptive concurrency state as the baseline so no need for a "ramp back up" (or down).

uses the dynamic cache system to allocate the adaptors. adaptor is "per endpoint" in the very rare chance we end up hitting 2 separate endpoints.

minor refactor required, the adaptive concurrency held a XetContext Arc which under the new additions would cause a circular Arc dependency (memory leak guaranteed). Only the XetConfig was used so we add the xet config instead of XetContext.

Summary

  • Adds a runtime-scoped, endpoint-keyed cache of AdaptiveConcurrencyControllers (xet_client/src/cas_client/adaptive_concurrency/cache.rs), stored via the existing XetCommon::cache_get_or_create mechanism.
  • RemoteClient::new_with_socket now fetches its upload/download controllers from this cache instead of constructing fresh ones, so all upload commits, file download groups, and download stream groups created from one XetSession and targeting the same CAS endpoint share one adaptive concurrency model and one concurrency limit (uploads and downloads remain independent).
  • AdaptiveConcurrencyController now stores Arc<XetConfig> instead of a full XetContext. Since the cache lives inside XetCommon, a controller holding XetContext (which owns Arc<XetCommon>) would form a strong reference cycle that keeps each session's XetCommon alive for the process lifetime. The controller only ever read ctx.config, so this drops the over-broad dependency; a regression test pins the property (drop the context, assert Weak<XetCommon> no longer upgrades).

Behavior change

Previously each XetUploadCommit / download group builder ramped its own controller independently, so a session's effective connection ceiling was N builders × ac_max_*_concurrency with the controllers competing blindly. Now ac_max_upload_concurrency / ac_max_download_concurrency bound the total in-flight requests per (session, endpoint), and later builders warm-start from the already-learned concurrency level instead of cold-starting.

Notes:

  • File-level limits (max_concurrent_file_ingestion / max_concurrent_file_downloads) were already session-shared via XetCommon and are untouched.
  • LocalClient / MemoryClient intentionally keep per-instance controllers.
  • The simulation local_server test path creates two RemoteClients on one ctx+endpoint and now shares controllers between them; audited — no test asserts independent ramping.

Test Plan

  • New unit tests for the cache: same ctx+endpoint shares (Arc::ptr_eq), different endpoint distinct, different ctx distinct, upload vs download distinct
  • New regression test that cached controllers don't keep XetCommon alive (no Arc cycle)
  • New RemoteClient test covering the sharing matrix end-to-end through the constructor, including non-eviction when a second endpoint is added
  • cargo test -p xet-client: 138 passed, 0 failed (4 network tests ignored as usual)
  • cargo check -p xet-data -p hf-xet clean; cargo +nightly fmt + cargo clippy -p xet-client --all-targets show no warnings in changed files

Note

Medium Risk
Changes how concurrent uploads/downloads compete within a session (global semaphore per direction/endpoint instead of per RemoteClient), which can alter throughput and ramp-up behavior for multi-group workloads.

Overview
Session-wide adaptive concurrency for RemoteClient: upload and download AdaptiveConcurrencyController instances are now looked up from a new endpoint-keyed cache on XetCommon (cache.rs) instead of being created per client. All clients in the same XetSession and CAS endpoint share one upload and one download controller (separate maps), so ac_max_*_concurrency caps total in-flight CAS requests and later work reuses the learned limit instead of cold-starting per upload commit or download group.

Refactor to avoid leaks: AdaptiveConcurrencyController stores Arc<XetConfig> instead of XetContext, because caching controllers inside XetCommon would otherwise create an Arc cycle (XetContextXetCommon → controller → XetContext). Tests assert sharing by ctx/endpoint, non-eviction when adding a second endpoint, and that dropping the context releases XetCommon.

Reviewed by Cursor Bugbot for commit 22b55fe. Bugbot is set up for automated code reviews on this repo. Configure here.

AdaptiveConcurrencyController only ever read ctx.config, but storing the
full XetContext gave cached controllers a strong Arc<XetCommon> back-edge:
XetCommon -> runtime_cache -> controller -> ctx.common, keeping every
session's XetCommon alive for the process lifetime. Store Arc<XetConfig>
instead, and pin the property with a Weak-upgrade regression test.
@assafvayner assafvayner requested review from hoytak and seanses June 11, 2026 20:50
@assafvayner assafvayner marked this pull request as ready for review June 11, 2026 20:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant