Skip to content

[TENT] add priority and byte credits to runtime queue dispatch#2655

Open
zbchi wants to merge 15 commits into
kvcache-ai:mainfrom
zbchi:tent-queue-3
Open

[TENT] add priority and byte credits to runtime queue dispatch#2655
zbchi wants to merge 15 commits into
kvcache-ai:mainfrom
zbchi:tent-queue-3

Conversation

@zbchi

@zbchi zbchi commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Description

#2132
This PR extends the TENT runtime queue from FIFO dispatch to priority-aware dispatch with byte credits, and tightens the staging path when queue pressure builds up.

Queued owners are picked by Request.priority. User work and staging-internal work keep separate lanes inside each priority, so internal staging traffic can be accounted for separately without overriding the original request priority. Medium and low priority owners can also age into high priority through config.

Large queued owners now need enough byte credit before dispatch. This keeps dispatch charged by transfer size, not only owner count, while preserving FIFO order within each lane.

The staging path now reports pressure back to the runtime queue. ProxyManager::submit() has a bounded shard queue and returns TooManyRequests when the shard is full. Runtime-queued staged owners are requeued on temporary proxy pressure; non- retryable proxy submit failures are completed as failures. Stage-buffer pin errors are returned as status errors instead of aborting the process.

The latest changes also reduce runtime queue overhead by keeping public task lookup state per batch and draining progress-worker notifications in batches, avoiding extra map work and repeated progress passes.

Main changes:

  • dispatch queued owners by PRIO_HIGH, PRIO_MEDIUM, and PRIO_LOW
  • add byte credits to dispatch selection
  • keep separate user and staging-internal lanes within each priority
  • add configurable aging for queued owners
  • preserve staging-internal admission reserves without treating owner kind as priority
  • make internal staging requests inherit Request.priority
  • bound ProxyManager shard queues with staging/max_queued_tasks_per_shard
  • requeue runtime-queued staged owners on proxy TooManyRequests
  • reduce queue overhead in public task lookup and progress-worker wakeups

The runtime queue remains disabled by default, and public transfer APIs are unchanged.

Module

  • Transfer Engine (mooncake-transfer-engine)

How Has This Been Tested?

cmake --build build --target admission_queue_test tent_runtime_queue_dispatch_test tent_progress_worker_test tebench

./build/mooncake-transfer-engine/tent/tests/admission_queue_test
./build/mooncake-transfer-engine/tent/tests/tent_runtime_queue_dispatch_test
./build/mooncake-transfer-engine/tent/tests/tent_progress_worker_test

I also ran two Aliyun eRDMA benchmark checks with TENT tebench.(16 vCPU / 64GB RAM)

First, I compared direct submit with the runtime queue path on a steady 4KB workload:
Results, 4KB request size, batch size 64, 1 thread, 5s per run:

Mode Run 1 GB/s Run 2 GB/s Run 3 GB/s Avg GB/s Avg Lat
direct 3.239529 3.239593 3.239587 3.239570 80.9 us
runtime queue 3.239583 3.239519 3.239504 3.239535 80.9 us

In this steady small-request run, the runtime queue path was effectively even with direct submit for throughput and average latency.

I also ran a queue-specific burst benchmark from tent-runtime-queue-bench against both the direct path and the runtime queue path. This uses the same workload for both modes: each burst submits low-priority work first, then high-priority work, and only polls completions after the burst is submitted.

For the runtime queue run, the dispatch window was capped at 64 owners. That is large enough to keep the transport busy, but still leaves backlog for the queue to schedule.

Runtime queue settings for this run:

{
  "enable_runtime_queue": true,
  "enable_progress_worker": true,
  "runtime_queue": {
    "max_dispatch_owners": 64,
    "max_dispatch_bytes": 1073741824
  }
}

Results, 4KB request size, batch size 16, burst depth 512, 4 threads, 5s per run, averaged over 5 runs:
Throughput Lat is the latency derived from total throughput. The Batch Tx columns measure each burst batch from submit time to observed completion, so they include backlog and polling delay.

Mode Avg BW GB/s Throughput Lat Batch Avg Tx Batch P99 Tx Batch P999 Tx High Avg Tx High P99 Tx Low Avg Tx Low P99 Tx
direct 3.238890 80.9 us 34148.5 us 37063.2 us 38758.8 us 32272.0 us 36881.5 us 34416.6 us 37075.8 us
runtime queue 3.042866 86.3 us 20126.3 us 29177.5 us 31852.7 us 13268.5 us 24400.4 us 21106.0 us 29377.8 us

In this burst-backlog workload, the runtime queue traded 6% throughput for lower observed batch completion latency and better high-priority latency. This is expected because the queue keeps the transport in-flight window bounded instead of flooding the whole burst into the transport at once. High-priority completion latency dropped from 32.3 ms to 13.3 ms on average, and high-priority P99 dropped from 36.9 ms to 24.4 ms.

Checklist

  • I have performed a self-review of my own code
  • I have formatted my code using ./scripts/code_format.sh
  • I have run pre-commit run --all-files and all hooks pass
  • I have updated the documentation (if applicable)
  • I have added tests to prove my changes are effective
  • For changes >500 LOC: I have filed an RFC issue

AI Assistance Disclosure

  • No AI tools were used
  • AI tools were used (specify below)
    Claude Code was used to assist with implementation and review. The final changes were reviewed and validated by myself.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a weighted byte-deficit scheduler with priority aging to the admission queue, adds bounded queue support to the proxy manager shards, and optimizes the progress worker loop. The review feedback highlights critical concurrency issues in ProxyManager where stage_buffers_ is accessed without synchronization, a logic bug in ProgressWorker that could stall the runtime queue, load-balancing limitations due to the use of thread_local in shard selection, and potential crashes from unhandled JSON parsing exceptions in the control plane.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread mooncake-transfer-engine/tent/src/runtime/proxy_manager.cpp
Comment thread mooncake-transfer-engine/tent/include/tent/runtime/proxy_manager.h
Comment thread mooncake-transfer-engine/tent/src/runtime/proxy_manager.cpp Outdated
Comment thread mooncake-transfer-engine/tent/src/runtime/proxy_manager.cpp
Comment thread mooncake-transfer-engine/tent/src/runtime/progress_worker.cpp Outdated
Comment thread mooncake-transfer-engine/tent/include/tent/runtime/proxy_manager.h
Comment thread mooncake-transfer-engine/tent/src/runtime/proxy_manager.cpp Outdated
Comment thread mooncake-transfer-engine/tent/src/runtime/control_plane.cpp Outdated
@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Comment thread mooncake-transfer-engine/tent/src/runtime/control_plane.cpp Outdated
Comment thread mooncake-transfer-engine/tent/src/runtime/admission_queue.cpp Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants