You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
poll2() tight loop can bypass Tokio cooperative scheduling — exploring yield-between-frames for streaming workloads
Summary
On long-lived streaming connections (SSE, gRPC streaming, IoT feeds), I'm seeing a bimodal inter-frame latency pattern with h2: frames delivered within a TCP burst arrive in microseconds, but the gap between bursts can be hundreds of milliseconds.
From reading the code path, a likely contributor is that Connection::poll2() will process all already-buffered frames in one poll loop before yielding back to the runtime. In bursty conditions this can:
monopolize the connection task for the duration of the burst,
reduce fairness vs other tasks on the same Tokio runtime.
This is great for throughput / request-response, but can be awkward for high-frequency small-message streaming where tail latency matters more than peak throughput. I'm exploring whether an opt-in or ecosystem-standard yielding approach makes sense here, and would love maintainer guidance on preferred direction.
Evidence
All runs use tcp_nodelay(true) (so this shouldn't be Nagle-related; cf. hyper #3187 discussions). Same sender, same wire data.
Benchmark results (200s, 324K+ samples each)
Measured inter-arrival time between consecutive body.frame().await returns on the receiver.
HTTP/2 Library
p50 latency
p95 latency
Max gap
Jitter (σ)
h2 0.4.x
1 μs
1,242 μs
440,904 μs
36,692 μs
proxygen (C++)
6 μs
72 μs
101,360 μs
3,542 μs
Key observation: h2's p50 is faster (1 μs vs 6 μs), but tail is worse: p95 ~17× and max gap ~4× under this burst pattern.
Environment & methodology
h2: 0.4.13, hyper: 1.8.1
Tokio: multi-threaded runtime
OS: macOS 14 (arm64) and Linux (similar behavior)
Workload: 800 Hz sensor stream over USB (low latency / high bandwidth tends to amplify TCP burstiness)
Measurement: monotonic clock; inter-arrival between consecutive body.frame().await returns; stable across runs
Analysis (what I think is happening)
In src/proto/connection.rs, Connection::poll2() runs a tight loop and continues as long as codec.poll_next(cx) is Ready:
When TCP delivers a burst containing N complete HTTP/2 frames, a single socket read can fill the codec buffer. After that, multiple poll_next() calls may decode frames from memory without additional I/O. In that situation, Tokio's normal cooperative scheduling pressure that happens around repeated poll_read calls may not kick in mid-burst, so poll2() can drain the entire burst before yielding.
Secondary contributor: WINDOW_UPDATE timing
Separately, WINDOW_UPDATEs are intentionally batched (e.g. UNCLAIMED_DENOMINATOR = 2), and they're flushed from poll_complete(). If poll2() stays in its loop for a long burst, poll_complete() (and thus WINDOW_UPDATE emission) is delayed until the loop finishes. For streaming, yielding more frequently could improve flow-control responsiveness.
Why "typical" gRPC streaming may not show this
At lower rates (e.g., <100 Hz) and/or with larger messages, each poll tends to process only a small number of frames, so you don't hit the "many small frames coalesced into one read buffer" regime. This seems to show up most strongly with high-frequency, small DATA frames.
This pattern matches what the tonic project discussed/changed around streaming tail latency (frame batching + yielding).
Possible approaches
Option A (preferred to discuss): participate in Tokio cooperative scheduling (no new API)
Use Tokio's coop budget mechanism to occasionally yield from the loop:
poll_proceed is a public API in Tokio, gated behind the rt feature (which h2 already enables). This is the idiomatic way to avoid tight-loop monopolization in Tokio-based crates, without adding any user-facing knobs.
If maintainers prefer to keep h2 more runtime-agnostic, a manual budget counter with cx.waker().wake_by_ref() + Poll::Pending after N iterations would achieve the same effect without the Tokio coupling — though the budget would be disconnected from Tokio's actual task-level coop budget.
Option B: configurable limit (more control, more invasive)
A per-connection configuration like:
builder.max_data_frames_per_poll(1);
Caveat: today ReceivedFrame::Continue does not distinguish DATA vs control frames, so implementing "count DATA frames" likely requires threading frame-type info out of recv_frame() or adding a dedicated variant—more invasive than Option A.
Trade-offs / concerns (why I'm asking before coding)
Throughput vs latency: current behavior is throughput-optimal; yielding introduces scheduling overhead and increases total burst drain time.
Fairness is best-effort: even if we return Pending, the task may be re-polled quickly if the runtime is otherwise idle.
API surface: Option B adds knobs; Option A avoids that.
poll2()tight loop can bypass Tokio cooperative scheduling — exploring yield-between-frames for streaming workloadsSummary
On long-lived streaming connections (SSE, gRPC streaming, IoT feeds), I'm seeing a bimodal inter-frame latency pattern with
h2: frames delivered within a TCP burst arrive in microseconds, but the gap between bursts can be hundreds of milliseconds.From reading the code path, a likely contributor is that
Connection::poll2()will process all already-buffered frames in one poll loop before yielding back to the runtime. In bursty conditions this can:poll_complete()(and therefore flushing WINDOW_UPDATEs),This is great for throughput / request-response, but can be awkward for high-frequency small-message streaming where tail latency matters more than peak throughput. I'm exploring whether an opt-in or ecosystem-standard yielding approach makes sense here, and would love maintainer guidance on preferred direction.
Evidence
All runs use
tcp_nodelay(true)(so this shouldn't be Nagle-related; cf. hyper #3187 discussions). Same sender, same wire data.Benchmark results (200s, 324K+ samples each)
Measured inter-arrival time between consecutive
body.frame().awaitreturns on the receiver.Key observation: h2's p50 is faster (1 μs vs 6 μs), but tail is worse: p95 ~17× and max gap ~4× under this burst pattern.
Environment & methodology
body.frame().awaitreturns; stable across runsAnalysis (what I think is happening)
In
src/proto/connection.rs,Connection::poll2()runs a tight loop and continues as long ascodec.poll_next(cx)isReady:When TCP delivers a burst containing N complete HTTP/2 frames, a single socket read can fill the codec buffer. After that, multiple
poll_next()calls may decode frames from memory without additional I/O. In that situation, Tokio's normal cooperative scheduling pressure that happens around repeatedpoll_readcalls may not kick in mid-burst, sopoll2()can drain the entire burst before yielding.Secondary contributor: WINDOW_UPDATE timing
Separately, WINDOW_UPDATEs are intentionally batched (e.g.
UNCLAIMED_DENOMINATOR = 2), and they're flushed frompoll_complete(). Ifpoll2()stays in its loop for a long burst,poll_complete()(and thus WINDOW_UPDATE emission) is delayed until the loop finishes. For streaming, yielding more frequently could improve flow-control responsiveness.Why "typical" gRPC streaming may not show this
At lower rates (e.g., <100 Hz) and/or with larger messages, each poll tends to process only a small number of frames, so you don't hit the "many small frames coalesced into one read buffer" regime. This seems to show up most strongly with high-frequency, small DATA frames.
This pattern matches what the tonic project discussed/changed around streaming tail latency (frame batching + yielding).
Possible approaches
Option A (preferred to discuss): participate in Tokio cooperative scheduling (no new API)
Use Tokio's coop budget mechanism to occasionally yield from the loop:
poll_proceedis a public API in Tokio, gated behind thertfeature (which h2 already enables). This is the idiomatic way to avoid tight-loop monopolization in Tokio-based crates, without adding any user-facing knobs.If maintainers prefer to keep h2 more runtime-agnostic, a manual budget counter with
cx.waker().wake_by_ref()+Poll::Pendingafter N iterations would achieve the same effect without the Tokio coupling — though the budget would be disconnected from Tokio's actual task-level coop budget.Option B: configurable limit (more control, more invasive)
A per-connection configuration like:
Caveat: today
ReceivedFrame::Continuedoes not distinguish DATA vs control frames, so implementing "count DATA frames" likely requires threading frame-type info out ofrecv_frame()or adding a dedicated variant—more invasive than Option A.Trade-offs / concerns (why I'm asking before coding)
Pending, the task may be re-polled quickly if the runtime is otherwise idle.Related work / precedent
coop::poll_proceedintended for tight polling loops (e.g. #4498 / #7405)max_send_buffer_size)Next steps
If maintainers think this is worth addressing, I'm happy to:
I'm also very open to being told "this is by design" for h2's goals, or to alternative suggestions consistent with h2's architecture.