Napi#629
Conversation
|
Warning Review limit reached
Next review available in: 33 minutes Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available. How can I continue?After more reviews become available, a review can be triggered using the To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews. How do review limits work?CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability. For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window. Please refer docs for additional details. Review details⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
📝 WalkthroughWalkthroughThis PR adds NAPI receive mode behind a new Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modules/infra/datapath/main_loop.c`:
- Around line 224-227: The code unconditionally calls vec_add(*registered, *qm)
after attempting to register the queue with
rte_eth_dev_rx_intr_ctl_q(qm->port_id, qm->queue_id, RTE_EPOLL_PER_THREAD,
RTE_INTR_EVENT_ADD, NULL); so a failing registration still marks the queue as
registered and later gets skipped in napi_wait(). Change this to capture the
return value of rte_eth_dev_rx_intr_ctl_q, check for success (e.g., ret == 0),
only call vec_add(*registered, *qm) on success, and handle/log the failure path
(using process logging or similar) and do not mark the queue as registered when
the call fails; reference symbols: rte_eth_dev_rx_intr_ctl_q, vec_add,
registered, qm, napi_wait.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7c9daab1-60e9-44a6-b828-7de881b450d8
📒 Files selected for processing (4)
main/config.hmain/main.cmodules/infra/control/port.cmodules/infra/datapath/main_loop.c
vjardin
left a comment
There was a problem hiding this comment.
sched_setattr() shall be used from glibc.
| // utilization and downclocks the core even at line rate. Pin uclamp_min to the | ||
| // max capacity: the governor runs the core at full speed while the worker is | ||
| // runnable and lets it drop only when it actually sleeps on the interrupt. | ||
| // glibc exposes neither struct sched_attr nor a sched_setattr() wrapper. |
There was a problem hiding this comment.
|
Disclaimer: I haven't looked at the patch in detail, so my feedback is high-level only. Have you measured the RX interrupt wakeup latency? In other words: Are user-space interrupts fast enough for general use, or only for low-traffic hours? Another detail: |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modules/infra/control/worker.c`:
- Around line 50-54: The issue is that attr (used with pthread_attr_destroy) may
be uninitialized if eventfd() fails and the function jumps to end; to fix,
ensure pthread_attr_t attr is initialized before any early goto that can skip
pthread_attr_init or rearrange the control flow so pthread_attr_destroy is only
called when pthread_attr_init succeeded: either move the
pthread_attr_init(&attr) before the eventfd() call (so attr is always
initialized) or add a boolean/flag (e.g., attr_initialized) set after
pthread_attr_init and check it before calling pthread_attr_destroy(&attr);
update references in this function where worker->wakeup_fd, eventfd,
pthread_attr_init, pthread_attr_destroy, and attr are used.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 90a04b41-6c86-4ef5-a98d-1c2a45c64c6d
📒 Files selected for processing (7)
main/config.hmain/main.cmeson.buildmodules/infra/control/port.cmodules/infra/control/worker.cmodules/infra/control/worker.hmodules/infra/datapath/main_loop.c
✅ Files skipped from review due to trivial changes (1)
- main/config.h
🚧 Files skipped from review as they are similar to previous changes (3)
- main/main.c
- modules/infra/control/port.c
- modules/infra/datapath/main_loop.c
|
Thanks @MortenBroerup , both points are spot on. Let me split them. Wakeup latency / timerslackThe timerslack concern doesn't apply to the napi path: an idle worker doesn't So the dominant latency term isn't the scheduler, it's the PMD's interrupt To answer the underlying question directly: this is opt-in ( Other wakeup sourcesAgreed this is the important design point. Two cases: Control-plane events already work: the worker's own wakeup eventfd is in the Datapath timers: today grout's datapath is purely packet-driven, there's no |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
modules/infra/control/port.c (2)
259-282: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winWake workers before returning on MTU-set errors.
p->startedis cleared before these early returns, butworker_wakeup_all()only runs on success. A NAPI worker can stay blocked inrte_epoll_wait(-1)until some unrelated event arrives, so the failure path needs the same kick.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modules/infra/control/port.c` around lines 259 - 282, The MTU update path in the port control logic leaves workers asleep on error because `worker_wakeup_all()` only runs after a successful `rte_eth_dev_start` in the MTU-setting flow. Update the MTU change handling around `rte_eth_dev_stop`, `rte_eth_dev_set_mtu`, and `rte_eth_dev_start` so any early return after `p->started` is cleared also wakes workers before exiting. Keep the wakeup behavior consistent with the success path in the same function that updates `iface->mtu` and `subinterfaces`.
313-368: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick winWake workers on failed reconfig
p->startedis cleared before the stop/reset/configure path, but every early return before the finalworker_wakeup_all()skips the kick entirely. That leaves NAPI workers blocked on the old graph until some unrelated event arrives. Move the wakeup into a shared cleanup path so failures still break the sleep.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modules/infra/control/port.c` around lines 313 - 368, The reconfiguration path in port handling clears p->started and then has several early returns before reaching worker_wakeup_all(), which means workers are not kicked when stop/reset/configure fails. Update the control flow in the port reconfig logic so worker_wakeup_all() runs through a shared cleanup/exit path in the function that performs port_unplug(), port_configure(), rte_eth_dev_start(), and the reset/stop branch, ensuring workers are always awakened even on failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modules/infra/control/worker.c`:
- Around line 50-54: The early eventfd() failure path in worker initialization
is using the shared end cleanup even though the worker has not yet been inserted
into the queue or had its thread created. Update the worker setup flow in the
worker initialization routine to track acquisition state with flags like
inserted and thread_created, or split into staged cleanup labels, so
STAILQ_REMOVE() and pthread_cancel() only run after those resources actually
exist. Make sure the cleanup path for the eventfd() failure exits before any
teardown that assumes insertion or thread creation has occurred.
- Around line 139-140: The wakeup write in worker wakeup handling does not retry
interrupted writes, so an EINTR can leave the NAPI wakeup undelivered and stall
teardown. Update the wakeup path around the write() call in the worker wakeup
logic to loop and retry when errno is EINTR, while still treating EAGAIN as a
harmless already-pending wakeup; keep the existing error logging in the worker
code for real failures only.
---
Outside diff comments:
In `@modules/infra/control/port.c`:
- Around line 259-282: The MTU update path in the port control logic leaves
workers asleep on error because `worker_wakeup_all()` only runs after a
successful `rte_eth_dev_start` in the MTU-setting flow. Update the MTU change
handling around `rte_eth_dev_stop`, `rte_eth_dev_set_mtu`, and
`rte_eth_dev_start` so any early return after `p->started` is cleared also wakes
workers before exiting. Keep the wakeup behavior consistent with the success
path in the same function that updates `iface->mtu` and `subinterfaces`.
- Around line 313-368: The reconfiguration path in port handling clears
p->started and then has several early returns before reaching
worker_wakeup_all(), which means workers are not kicked when
stop/reset/configure fails. Update the control flow in the port reconfig logic
so worker_wakeup_all() runs through a shared cleanup/exit path in the function
that performs port_unplug(), port_configure(), rte_eth_dev_start(), and the
reset/stop branch, ensuring workers are always awakened even on failure.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 226ac4be-c212-4ebd-ace0-499661461c6e
📒 Files selected for processing (7)
main/config.hmain/main.cmeson.buildmodules/infra/control/port.cmodules/infra/control/worker.cmodules/infra/control/worker.hmodules/infra/datapath/main_loop.c
🚧 Files skipped from review as they are similar to previous changes (5)
- main/config.h
- meson.build
- modules/infra/control/worker.h
- main/main.c
- modules/infra/datapath/main_loop.c
| worker->wakeup_fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC); | ||
| if (worker->wakeup_fd < 0) { | ||
| ret = errno; | ||
| goto end; | ||
| } |
There was a problem hiding this comment.
🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win
Split cleanup by acquired resource state.
Line 53 can jump to end before the worker is inserted or the thread is created, but the error path still runs STAILQ_REMOVE() and pthread_cancel() on those unacquired resources. Track inserted/thread_created flags or use staged cleanup labels before routing early eventfd() failures through the shared teardown.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modules/infra/control/worker.c` around lines 50 - 54, The early eventfd()
failure path in worker initialization is using the shared end cleanup even
though the worker has not yet been inserted into the queue or had its thread
created. Update the worker setup flow in the worker initialization routine to
track acquisition state with flags like inserted and thread_created, or split
into staged cleanup labels, so STAILQ_REMOVE() and pthread_cancel() only run
after those resources actually exist. Make sure the cleanup path for the
eventfd() failure exits before any teardown that assumes insertion or thread
creation has occurred.
| if (write(w->wakeup_fd, &one, sizeof(one)) != sizeof(one) && errno != EAGAIN) | ||
| LOG(ERR, "worker %u wakeup_fd write: %s", w->cpu_id, strerror(errno)); |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Retry interrupted eventfd writes.
If write() returns -1/EINTR, the NAPI wakeup is not delivered; a worker blocked in rte_epoll_wait() can remain asleep and make teardown hang. Retry EINTR, while keeping EAGAIN as “already pending”.
Proposed fix
- if (write(w->wakeup_fd, &one, sizeof(one)) != sizeof(one) && errno != EAGAIN)
- LOG(ERR, "worker %u wakeup_fd write: %s", w->cpu_id, strerror(errno));
+ ssize_t n;
+
+ do {
+ n = write(w->wakeup_fd, &one, sizeof(one));
+ } while (n < 0 && errno == EINTR);
+
+ if (n < 0 && errno != EAGAIN)
+ LOG(ERR, "worker %u wakeup_fd write: %s", w->cpu_id, strerror(errno));
+ else if (n >= 0 && n != sizeof(one))
+ LOG(ERR, "worker %u wakeup_fd short write", w->cpu_id);📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if (write(w->wakeup_fd, &one, sizeof(one)) != sizeof(one) && errno != EAGAIN) | |
| LOG(ERR, "worker %u wakeup_fd write: %s", w->cpu_id, strerror(errno)); | |
| ssize_t n; | |
| do { | |
| n = write(w->wakeup_fd, &one, sizeof(one)); | |
| } while (n < 0 && errno == EINTR); | |
| if (n < 0 && errno != EAGAIN) | |
| LOG(ERR, "worker %u wakeup_fd write: %s", w->cpu_id, strerror(errno)); | |
| else if (n >= 0 && n != sizeof(one)) | |
| LOG(ERR, "worker %u wakeup_fd short write", w->cpu_id); |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modules/infra/control/worker.c` around lines 139 - 140, The wakeup write in
worker wakeup handling does not retry interrupted writes, so an EINTR can leave
the NAPI wakeup undelivered and stall teardown. Update the wakeup path around
the write() call in the worker wakeup logic to loop and retry when errno is
EINTR, while still treating EAGAIN as a harmless already-pending wakeup; keep
the existing error logging in the worker code for real failures only.
Add an opt-in --napi mode where an idle worker arms the interrupts on its rx queues and blocks on them through the generic rte_eth_dev_rx_intr_* / rte_epoll_wait API instead of busy-polling. A packet wakes the worker, which disarms and resumes polling: the usual poll/interrupt hybrid, with the interrupt acting only as a doorbell since frames are still pulled by the graph walk. A worker blocks only after staying idle for NAPI_EMPTY_WINDOWS housekeeping windows with all of its queues empty, so a single busy queue keeps it polling. --napi implies poll-mode and replaces the micro-sleep ramp with the interrupt block. As that block can last up to a second it is measured explicitly and the timestamp advanced past it, keeping the sleep in total_cycles but out of busy_cycles. napi_wait() tracks the queues it actually armed and disarms them through a single exit path, so a queue without interrupt support does not leave its predecessors armed, and only marks a queue epoll-registered once. A PMD without rx queue interrupt support keeps polling. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
In --napi mode an idle worker blocks on the rxq interrupt, so the schedutil governor sees a low utilization and downclocks the core even when it later runs at line rate. Pin the worker's uclamp_min to the max capacity through sched_setattr(): the governor keeps the core at full speed while the worker is runnable and lets it drop only when it actually sleeps on the interrupt. glibc exposes neither struct sched_attr nor a sched_setattr() wrapper, so both are declared locally. The syscall fails on kernels without uclamp support or without the privilege to set it, which would otherwise warn on every worker. Report the expected EOPNOTSUPP/ENOSYS/EPERM/EINVAL cases at NOTICE and keep WARNING for anything unexpected. Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Opt-in --napi mode: an idle worker stops busy-polling and blocks on its rx queue interrupts via rte_eth_dev_rx_intr_* / rte_epoll_wait, resuming polling when a packet wakes it. A second commit pins the worker's uclamp_min to max so schedutil keeps the core at full clock while runnable, dropping only when it actually sleeps.
NAPI opt-in idle mode: RX interrupt blocking via per-thread epoll
Configuration / CLI
--napi(gr_config.napi): enables adaptive interrupt (NAPI) RX and forcesgr_config.poll_mode = true(so the idle behavior uses interrupt blocking rather than the non-poll sleep path).bool napi;tostruct gr_config(main/config.h).port_configure()enables RX queue interrupts (conf.intr_conf.rxq = 1) only whengr_config.napiis set.Port start wakeups
rte_eth_dev_start()succeeds,worker_wakeup_all()is called (inport_mtu_set()andiface_port_reconfig()) so workers can (re)arm/adjust their idle epoll state once the port is started.Idle detection and blocking (
modules/infra/datapath/main_loop.c)gr_config.napiis enabled:ctx.last_count == 0, incrementsnapi_emptyand afterNAPI_EMPTY_WINDOWS(=2) consecutive empty windows callsnapi_wait(w, &napi_registered).napi_emptyand attributes cycles tobusy_cycles.napi_wait()mechanics (napi_wait(struct worker *w, vec struct queue_map **registered))w->wakeup_fdinto the same per-thread epoll instance for this wait (once per thread), viarte_epoll_ctl(... EPOLL_CTL_ADD ...); tolerateserrno == EEXIST.w->rxqs:rte_eth_dev_rx_intr_enable(port_id, queue_id).armedlist.rte_eth_dev_rx_intr_ctl_q(..., RTE_INTR_EVENT_ADD, ...).registered(entries are not removed later; relies on epoll/single-portal sharing semantics).-EEXISTas already-registered.rte_eth_rx_queue_count(port_id, queue_id) > 0for any armed queue, it disables RX interrupts forarmedand returns without blocking.rte_rcu_qsbr_thread_offline(...).NAPI_SETTLE_TRIES(=3) short waits withrte_epoll_wait(..., NAPI_SETTLE_MS=100ms).rte_epoll_wait(..., -1).rte_rcu_qsbr_thread_online(...).w->wakeup_fdwith a loopedread()so the fd doesn’t stay readable on the next epoll wait.rte_eth_dev_rx_intr_disable) for queues in thearmedlist.Accounting while blocked
napi_wait(),sleep_cyclesandn_sleepsare incremented.timestamp_tmp/cyclesso the subsequent housekeeping accounting bills it as sleep rather than busy.CPU clock management while runnable (
worker_perf_floor)gr_config.napiis enabled, the worker callsworker_perf_floor()at startup:sched_setattrwithSCHED_FLAG_UTIL_CLAMP_MINandsched_util_min = 1024(uclamp_min max / SCHED_CAPACITY_SCALE), when available (HAVE_SCHED_SETATTRfrommeson.build).uclamp_minavailability failures atNOTICEforEOPNOTSUPP/ENOSYS/EPERM/EINVAL, otherwiseWARNING.Worker wakeup signaling for epoll-blocked workers
int wakeup_fdtostruct worker(worker.h) as an eventfd written by the control plane and drained by the dataplane NAPI epoll path.worker_create()createswakeup_fdaseventfd(0, EFD_NONBLOCK | EFD_CLOEXEC).worker_wakeup():uint64_t=1towakeup_fd.EAGAINis treated as “already pending/undrained”.worker_destroy()closeswakeup_fd.worker_wakeup_all()wakes every worker viaworker_wakeup().