[TENT] add priority and byte credits to runtime queue dispatch#2655
[TENT] add priority and byte credits to runtime queue dispatch#2655zbchi wants to merge 15 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a weighted byte-deficit scheduler with priority aging to the admission queue, adds bounded queue support to the proxy manager shards, and optimizes the progress worker loop. The review feedback highlights critical concurrency issues in ProxyManager where stage_buffers_ is accessed without synchronization, a logic bug in ProgressWorker that could stall the runtime queue, load-balancing limitations due to the use of thread_local in shard selection, and potential crashes from unhandled JSON parsing exceptions in the control plane.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
|
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Description
#2132
This PR extends the TENT runtime queue from FIFO dispatch to priority-aware dispatch with byte credits, and tightens the staging path when queue pressure builds up.
Queued owners are picked by
Request.priority. User work and staging-internal work keep separate lanes inside each priority, so internal staging traffic can be accounted for separately without overriding the original request priority. Medium and low priority owners can also age into high priority through config.Large queued owners now need enough byte credit before dispatch. This keeps dispatch charged by transfer size, not only owner count, while preserving FIFO order within each lane.
The staging path now reports pressure back to the runtime queue.
ProxyManager::submit()has a bounded shard queue and returnsTooManyRequestswhen the shard is full. Runtime-queued staged owners are requeued on temporary proxy pressure; non- retryable proxy submit failures are completed as failures. Stage-buffer pin errors are returned as status errors instead of aborting the process.The latest changes also reduce runtime queue overhead by keeping public task lookup state per batch and draining progress-worker notifications in batches, avoiding extra map work and repeated progress passes.
Main changes:
PRIO_HIGH,PRIO_MEDIUM, andPRIO_LOWRequest.priorityProxyManagershard queues withstaging/max_queued_tasks_per_shardTooManyRequestsThe runtime queue remains disabled by default, and public transfer APIs are unchanged.
Module
mooncake-transfer-engine)How Has This Been Tested?
I also ran two Aliyun eRDMA benchmark checks with TENT tebench.(16 vCPU / 64GB RAM)
First, I compared direct submit with the runtime queue path on a steady 4KB workload:
Results, 4KB request size, batch size 64, 1 thread, 5s per run:
In this steady small-request run, the runtime queue path was effectively even with direct submit for throughput and average latency.
I also ran a queue-specific burst benchmark from
tent-runtime-queue-benchagainst both the direct path and the runtime queue path. This uses the same workload for both modes: each burst submits low-priority work first, then high-priority work, and only polls completions after the burst is submitted.For the runtime queue run, the dispatch window was capped at 64 owners. That is large enough to keep the transport busy, but still leaves backlog for the queue to schedule.
Runtime queue settings for this run:
{ "enable_runtime_queue": true, "enable_progress_worker": true, "runtime_queue": { "max_dispatch_owners": 64, "max_dispatch_bytes": 1073741824 } }Results, 4KB request size, batch size 16, burst depth 512, 4 threads, 5s per run, averaged over 5 runs:
Throughput Latis the latency derived from total throughput. TheBatch Txcolumns measure each burst batch from submit time to observed completion, so they include backlog and polling delay.In this burst-backlog workload, the runtime queue traded 6% throughput for lower observed batch completion latency and better high-priority latency. This is expected because the queue keeps the transport in-flight window bounded instead of flooding the whole burst into the transport at once. High-priority completion latency dropped from 32.3 ms to 13.3 ms on average, and high-priority P99 dropped from 36.9 ms to 24.4 ms.
Checklist
./scripts/code_format.shpre-commit run --all-filesand all hooks passAI Assistance Disclosure
Claude Code was used to assist with implementation and review. The final changes were reviewed and validated by myself.