Skip to content

Napi#629

Open
maxime-leroy wants to merge 2 commits into
DPDK:mainfrom
maxime-leroy:napi
Open

Napi#629
maxime-leroy wants to merge 2 commits into
DPDK:mainfrom
maxime-leroy:napi

Conversation

@maxime-leroy

@maxime-leroy maxime-leroy commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Opt-in --napi mode: an idle worker stops busy-polling and blocks on its rx queue interrupts via rte_eth_dev_rx_intr_* / rte_epoll_wait, resuming polling when a packet wakes it. A second commit pins the worker's uclamp_min to max so schedutil keeps the core at full clock while runnable, dropping only when it actually sleeps.

NAPI opt-in idle mode: RX interrupt blocking via per-thread epoll

Configuration / CLI

  • Added --napi (gr_config.napi): enables adaptive interrupt (NAPI) RX and forces gr_config.poll_mode = true (so the idle behavior uses interrupt blocking rather than the non-poll sleep path).
  • Added bool napi; to struct gr_config (main/config.h).
  • port_configure() enables RX queue interrupts (conf.intr_conf.rxq = 1) only when gr_config.napi is set.

Port start wakeups

  • After rte_eth_dev_start() succeeds, worker_wakeup_all() is called (in port_mtu_set() and iface_port_reconfig()) so workers can (re)arm/adjust their idle epoll state once the port is started.

Idle detection and blocking (modules/infra/datapath/main_loop.c)

  • In the housekeeping interval, when gr_config.napi is enabled:
    • If ctx.last_count == 0, increments napi_empty and after NAPI_EMPTY_WINDOWS (=2) consecutive empty windows calls napi_wait(w, &napi_registered).
    • Otherwise (non-empty window), resets napi_empty and attributes cycles to busy_cycles.

napi_wait() mechanics (napi_wait(struct worker *w, vec struct queue_map **registered))

  • Registers the worker’s w->wakeup_fd into the same per-thread epoll instance for this wait (once per thread), via rte_epoll_ctl(... EPOLL_CTL_ADD ...); tolerates errno == EEXIST.
  • For each owned RX queue in w->rxqs:
    • Skips RX queues whose port iface is not registered/started.
    • Tries to enable RX interrupts with rte_eth_dev_rx_intr_enable(port_id, queue_id).
    • Tracks only the successfully enabled queues in a local armed list.
  • Ensures each armed RX queue portal is present in this thread’s epoll set:
    • Uses rte_eth_dev_rx_intr_ctl_q(..., RTE_INTR_EVENT_ADD, ...).
    • Tracks newly registered queues in registered (entries are not removed later; relies on epoll/single-portal sharing semantics).
    • Treats -EEXIST as already-registered.
  • Rechecks for races before blocking:
    • If rte_eth_rx_queue_count(port_id, queue_id) > 0 for any armed queue, it disables RX interrupts for armed and returns without blocking.
  • Blocks:
    • Temporarily takes the worker QSBR thread offline via rte_rcu_qsbr_thread_offline(...).
    • Performs up to NAPI_SETTLE_TRIES (=3) short waits with rte_epoll_wait(..., NAPI_SETTLE_MS=100ms).
    • If still idle, blocks indefinitely with rte_epoll_wait(..., -1).
    • Brings the QSBR thread back online via rte_rcu_qsbr_thread_online(...).
  • Drains wakeups:
    • Drains pending writes from w->wakeup_fd with a looped read() so the fd doesn’t stay readable on the next epoll wait.
  • Disarms only what it armed:
    • Disables RX interrupts (rte_eth_dev_rx_intr_disable) for queues in the armed list.

Accounting while blocked

  • After waking from napi_wait(), sleep_cycles and n_sleeps are incremented.
  • The blocked interval is “folded” into timestamp_tmp/cycles so the subsequent housekeeping accounting bills it as sleep rather than busy.

CPU clock management while runnable (worker_perf_floor)

  • When gr_config.napi is enabled, the worker calls worker_perf_floor() at startup:
    • Applies sched_setattr with SCHED_FLAG_UTIL_CLAMP_MIN and sched_util_min = 1024 (uclamp_min max / SCHED_CAPACITY_SCALE), when available (HAVE_SCHED_SETATTR from meson.build).
    • Logs uclamp_min availability failures at NOTICE for EOPNOTSUPP/ENOSYS/EPERM/EINVAL, otherwise WARNING.

Worker wakeup signaling for epoll-blocked workers

  • Added int wakeup_fd to struct worker (worker.h) as an eventfd written by the control plane and drained by the dataplane NAPI epoll path.
  • worker_create() creates wakeup_fd as eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC).
  • worker_wakeup():
    • Signals the existing condition-variable, and also writes uint64_t=1 to wakeup_fd.
    • EAGAIN is treated as “already pending/undrained”.
  • worker_destroy() closes wakeup_fd.
  • worker_wakeup_all() wakes every worker via worker_wakeup().

@coderabbitai

coderabbitai Bot commented Jun 9, 2026

Copy link
Copy Markdown

Review Change Stack

Warning

Review limit reached

@maxime-leroy, you've reached your PR review limit, so we couldn't start this review.

Next review available in: 33 minutes

Enable usage-based reviews in Billing to review now. Otherwise, wait until the next included review is available.
You're only billed for reviews past your plan's rate limits ($0.25/file).

How can I continue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

To avoid repeated limits, reduce automatic review volume by pausing incremental auto-reviews earlier, using label-based review opt-in, excluding WIP or generated PR titles, or requesting reviews manually when the PR is ready. If your team needs uninterrupted high-volume reviews, an organization admin can enable usage-based reviews.

How do review limits work?

CodeRabbit enforces per-developer PR review limits for each organization. Most developers receive the normal plan review availability.

For paid Pro and Pro+ PR reviews, CodeRabbit uses adaptive limits for sustained high-volume activity. When a developer's recent PR review activity reaches the 95th percentile or higher among CodeRabbit users, additional reviews become available more gradually as earlier reviews age out of the rolling window.

Please refer docs for additional details.

Review details
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: c3bc828c-d03e-4623-8e37-b0bc60e43e54

📥 Commits

Reviewing files that changed from the base of the PR and between 7c1e504 and 8667ee6.

📒 Files selected for processing (8)
  • main/config.h
  • main/main.c
  • meson.build
  • modules/infra/control/iface_test.c
  • modules/infra/control/port.c
  • modules/infra/control/worker.c
  • modules/infra/control/worker.h
  • modules/infra/datapath/main_loop.c
📝 Walkthrough

Walkthrough

This PR adds NAPI receive mode behind a new -n/--napi flag and gr_config.napi. Port setup now enables RX queue interrupts when NAPI is active, workers gain eventfd-based wakeups, and the datapath worker loop can block on epoll after consecutive idle windows while handling QSBR offline/online transitions. The build also detects sched_setattr support for the worker CPU floor path.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modules/infra/datapath/main_loop.c`:
- Around line 224-227: The code unconditionally calls vec_add(*registered, *qm)
after attempting to register the queue with
rte_eth_dev_rx_intr_ctl_q(qm->port_id, qm->queue_id, RTE_EPOLL_PER_THREAD,
RTE_INTR_EVENT_ADD, NULL); so a failing registration still marks the queue as
registered and later gets skipped in napi_wait(). Change this to capture the
return value of rte_eth_dev_rx_intr_ctl_q, check for success (e.g., ret == 0),
only call vec_add(*registered, *qm) on success, and handle/log the failure path
(using process logging or similar) and do not mark the queue as registered when
the call fails; reference symbols: rte_eth_dev_rx_intr_ctl_q, vec_add,
registered, qm, napi_wait.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7c9daab1-60e9-44a6-b828-7de881b450d8

📥 Commits

Reviewing files that changed from the base of the PR and between e850e7a and 5d8ba94.

📒 Files selected for processing (4)
  • main/config.h
  • main/main.c
  • modules/infra/control/port.c
  • modules/infra/datapath/main_loop.c

Comment thread modules/infra/datapath/main_loop.c Outdated

@vjardin vjardin left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sched_setattr() shall be used from glibc.

Comment thread modules/infra/datapath/main_loop.c Outdated
// utilization and downclocks the core even at line rate. Pin uclamp_min to the
// max capacity: the governor runs the core at full speed while the worker is
// runnable and lets it drop only when it actually sleeps on the interrupt.
// glibc exposes neither struct sched_attr nor a sched_setattr() wrapper.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MortenBroerup

Copy link
Copy Markdown
Contributor

Disclaimer: I haven't looked at the patch in detail, so my feedback is high-level only.

Have you measured the RX interrupt wakeup latency? In other words: Are user-space interrupts fast enough for general use, or only for low-traffic hours?
AFAIK, sleep(), usleep() and nanosleep() all end up in the nanosleep() syscall.
With the default timerslack of 50 µs, nanosleep() wakeup latency exceeds 50 µs.
With timerslack 1 - prctl(PR_SET_TIMERSLACK, 1, 0, 0, 0) - I have seen nanosleep() wakeup latency around 2.5 µs on hardware, and 15-20 µs on a VMware virtual appliance.

Another detail:
We considered using RX interrupts in the SmartShare too, and I like the concept.
You should consider that other events may want to trigger graph wakeup too...
E.g. a host-originating packet (e.g. a CLI output via SSH).
Or a timer wheel or other pollling-based dataplane timer triggering some high-frequency event, e.g. for calling rte_sched_port_dequeue().

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modules/infra/control/worker.c`:
- Around line 50-54: The issue is that attr (used with pthread_attr_destroy) may
be uninitialized if eventfd() fails and the function jumps to end; to fix,
ensure pthread_attr_t attr is initialized before any early goto that can skip
pthread_attr_init or rearrange the control flow so pthread_attr_destroy is only
called when pthread_attr_init succeeded: either move the
pthread_attr_init(&attr) before the eventfd() call (so attr is always
initialized) or add a boolean/flag (e.g., attr_initialized) set after
pthread_attr_init and check it before calling pthread_attr_destroy(&attr);
update references in this function where worker->wakeup_fd, eventfd,
pthread_attr_init, pthread_attr_destroy, and attr are used.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 90a04b41-6c86-4ef5-a98d-1c2a45c64c6d

📥 Commits

Reviewing files that changed from the base of the PR and between 5d8ba94 and a5d9ec9.

📒 Files selected for processing (7)
  • main/config.h
  • main/main.c
  • meson.build
  • modules/infra/control/port.c
  • modules/infra/control/worker.c
  • modules/infra/control/worker.h
  • modules/infra/datapath/main_loop.c
✅ Files skipped from review due to trivial changes (1)
  • main/config.h
🚧 Files skipped from review as they are similar to previous changes (3)
  • main/main.c
  • modules/infra/control/port.c
  • modules/infra/datapath/main_loop.c

@maxime-leroy

Copy link
Copy Markdown
Collaborator Author

Thanks @MortenBroerup , both points are spot on. Let me split them.

Wakeup latency / timerslack

The timerslack concern doesn't apply to the napi path: an idle worker doesn't
sleep on a timer, it blocks on rte_epoll_wait() on the rx queue interrupt
eventfd (VFIO/UIO), so the wakeup is interrupt-driven and timerslack never
enters the picture. timerslack only widens the timeout part of a sleep/poll,
not the delivery of a real fd event. (For completeness, the existing
sleep-based poll mode already lowers timerslack to 1us via
PR_SET_TIMERSLACK at worker start, so it isn't stuck at the 50us default
either.)

So the dominant latency term isn't the scheduler, it's the PMD's interrupt
coalescing. The IRQ fires on the first of "N frames queued" or "holdoff timer
expires". On DPAA2 the default is threshold = 7 frames, holdoff = 100us. Under
any real load the frame threshold trips first and the holdoff never matters;
the holdoff only adds latency in the trickle case (1-2 frames then silence),
i.e. exactly the near-idle regime this mode targets. Worst-case single-packet
wakeup at trickle is therefore bounded by the holdoff (~100us on DPAA2,
tunable via DPAA2_PORTAL_INTR_TIMEOUT).

To answer the underlying question directly: this is opt-in (--napi) and aimed
at cutting power/clocks during quiet periods, not at minimum-latency max-pps
forwarding. It's "general use during low/moderate load", and you'd leave it off
on a latency-critical fast path. I haven't done a rigorous latency sweep across
PMDs yet; if that would help I can gather numbers.

Other wakeup sources

Agreed this is the important design point. Two cases:

Control-plane events already work: the worker's own wakeup eventfd is in the
same per-thread epoll set, so a reconfig or shutdown breaks the block
immediately instead of waiting for a packet. That's the generic kick path, so
anything host-originated that needs to poke a sleeping worker can reuse it.

Datapath timers: today grout's datapath is purely packet-driven, there's no
timer wheel, no rte_sched, nothing time-driven that a sleeping worker would
starve, so blocking until the next packet is safe as-is. The day a periodic
fast-path event shows up (QoS dequeue / shaping via rte_sched_port_dequeue(),
TX pacing, datapath aging), an indefinite block would indeed starve it, and the
fix is the standard one: add a timerfd to the same epoll set (or cap the
epoll_wait timeout to the next deadline) so the worker wakes on a packet or a
deadline. Same shape as the wakeup eventfd that's already wired in, so the
extension point is there when it's needed.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
modules/infra/control/port.c (2)

259-282: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Wake workers before returning on MTU-set errors. p->started is cleared before these early returns, but worker_wakeup_all() only runs on success. A NAPI worker can stay blocked in rte_epoll_wait(-1) until some unrelated event arrives, so the failure path needs the same kick.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/infra/control/port.c` around lines 259 - 282, The MTU update path in
the port control logic leaves workers asleep on error because
`worker_wakeup_all()` only runs after a successful `rte_eth_dev_start` in the
MTU-setting flow. Update the MTU change handling around `rte_eth_dev_stop`,
`rte_eth_dev_set_mtu`, and `rte_eth_dev_start` so any early return after
`p->started` is cleared also wakes workers before exiting. Keep the wakeup
behavior consistent with the success path in the same function that updates
`iface->mtu` and `subinterfaces`.

313-368: 🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Wake workers on failed reconfig

p->started is cleared before the stop/reset/configure path, but every early return before the final worker_wakeup_all() skips the kick entirely. That leaves NAPI workers blocked on the old graph until some unrelated event arrives. Move the wakeup into a shared cleanup path so failures still break the sleep.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/infra/control/port.c` around lines 313 - 368, The reconfiguration
path in port handling clears p->started and then has several early returns
before reaching worker_wakeup_all(), which means workers are not kicked when
stop/reset/configure fails. Update the control flow in the port reconfig logic
so worker_wakeup_all() runs through a shared cleanup/exit path in the function
that performs port_unplug(), port_configure(), rte_eth_dev_start(), and the
reset/stop branch, ensuring workers are always awakened even on failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modules/infra/control/worker.c`:
- Around line 50-54: The early eventfd() failure path in worker initialization
is using the shared end cleanup even though the worker has not yet been inserted
into the queue or had its thread created. Update the worker setup flow in the
worker initialization routine to track acquisition state with flags like
inserted and thread_created, or split into staged cleanup labels, so
STAILQ_REMOVE() and pthread_cancel() only run after those resources actually
exist. Make sure the cleanup path for the eventfd() failure exits before any
teardown that assumes insertion or thread creation has occurred.
- Around line 139-140: The wakeup write in worker wakeup handling does not retry
interrupted writes, so an EINTR can leave the NAPI wakeup undelivered and stall
teardown. Update the wakeup path around the write() call in the worker wakeup
logic to loop and retry when errno is EINTR, while still treating EAGAIN as a
harmless already-pending wakeup; keep the existing error logging in the worker
code for real failures only.

---

Outside diff comments:
In `@modules/infra/control/port.c`:
- Around line 259-282: The MTU update path in the port control logic leaves
workers asleep on error because `worker_wakeup_all()` only runs after a
successful `rte_eth_dev_start` in the MTU-setting flow. Update the MTU change
handling around `rte_eth_dev_stop`, `rte_eth_dev_set_mtu`, and
`rte_eth_dev_start` so any early return after `p->started` is cleared also wakes
workers before exiting. Keep the wakeup behavior consistent with the success
path in the same function that updates `iface->mtu` and `subinterfaces`.
- Around line 313-368: The reconfiguration path in port handling clears
p->started and then has several early returns before reaching
worker_wakeup_all(), which means workers are not kicked when
stop/reset/configure fails. Update the control flow in the port reconfig logic
so worker_wakeup_all() runs through a shared cleanup/exit path in the function
that performs port_unplug(), port_configure(), rte_eth_dev_start(), and the
reset/stop branch, ensuring workers are always awakened even on failure.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 226ac4be-c212-4ebd-ace0-499661461c6e

📥 Commits

Reviewing files that changed from the base of the PR and between a5d9ec9 and 7c1e504.

📒 Files selected for processing (7)
  • main/config.h
  • main/main.c
  • meson.build
  • modules/infra/control/port.c
  • modules/infra/control/worker.c
  • modules/infra/control/worker.h
  • modules/infra/datapath/main_loop.c
🚧 Files skipped from review as they are similar to previous changes (5)
  • main/config.h
  • meson.build
  • modules/infra/control/worker.h
  • main/main.c
  • modules/infra/datapath/main_loop.c

Comment on lines +50 to +54
worker->wakeup_fd = eventfd(0, EFD_NONBLOCK | EFD_CLOEXEC);
if (worker->wakeup_fd < 0) {
ret = errno;
goto end;
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical | ⚡ Quick win

Split cleanup by acquired resource state.

Line 53 can jump to end before the worker is inserted or the thread is created, but the error path still runs STAILQ_REMOVE() and pthread_cancel() on those unacquired resources. Track inserted/thread_created flags or use staged cleanup labels before routing early eventfd() failures through the shared teardown.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/infra/control/worker.c` around lines 50 - 54, The early eventfd()
failure path in worker initialization is using the shared end cleanup even
though the worker has not yet been inserted into the queue or had its thread
created. Update the worker setup flow in the worker initialization routine to
track acquisition state with flags like inserted and thread_created, or split
into staged cleanup labels, so STAILQ_REMOVE() and pthread_cancel() only run
after those resources actually exist. Make sure the cleanup path for the
eventfd() failure exits before any teardown that assumes insertion or thread
creation has occurred.

Comment on lines +139 to +140
if (write(w->wakeup_fd, &one, sizeof(one)) != sizeof(one) && errno != EAGAIN)
LOG(ERR, "worker %u wakeup_fd write: %s", w->cpu_id, strerror(errno));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Retry interrupted eventfd writes.

If write() returns -1/EINTR, the NAPI wakeup is not delivered; a worker blocked in rte_epoll_wait() can remain asleep and make teardown hang. Retry EINTR, while keeping EAGAIN as “already pending”.

Proposed fix
-	if (write(w->wakeup_fd, &one, sizeof(one)) != sizeof(one) && errno != EAGAIN)
-		LOG(ERR, "worker %u wakeup_fd write: %s", w->cpu_id, strerror(errno));
+	ssize_t n;
+
+	do {
+		n = write(w->wakeup_fd, &one, sizeof(one));
+	} while (n < 0 && errno == EINTR);
+
+	if (n < 0 && errno != EAGAIN)
+		LOG(ERR, "worker %u wakeup_fd write: %s", w->cpu_id, strerror(errno));
+	else if (n >= 0 && n != sizeof(one))
+		LOG(ERR, "worker %u wakeup_fd short write", w->cpu_id);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if (write(w->wakeup_fd, &one, sizeof(one)) != sizeof(one) && errno != EAGAIN)
LOG(ERR, "worker %u wakeup_fd write: %s", w->cpu_id, strerror(errno));
ssize_t n;
do {
n = write(w->wakeup_fd, &one, sizeof(one));
} while (n < 0 && errno == EINTR);
if (n < 0 && errno != EAGAIN)
LOG(ERR, "worker %u wakeup_fd write: %s", w->cpu_id, strerror(errno));
else if (n >= 0 && n != sizeof(one))
LOG(ERR, "worker %u wakeup_fd short write", w->cpu_id);
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@modules/infra/control/worker.c` around lines 139 - 140, The wakeup write in
worker wakeup handling does not retry interrupted writes, so an EINTR can leave
the NAPI wakeup undelivered and stall teardown. Update the wakeup path around
the write() call in the worker wakeup logic to loop and retry when errno is
EINTR, while still treating EAGAIN as a harmless already-pending wakeup; keep
the existing error logging in the worker code for real failures only.

Add an opt-in --napi mode where an idle worker arms the interrupts on
its rx queues and blocks on them through the generic
rte_eth_dev_rx_intr_* / rte_epoll_wait API instead of busy-polling. A
packet wakes the worker, which disarms and resumes polling: the usual
poll/interrupt hybrid, with the interrupt acting only as a doorbell
since frames are still pulled by the graph walk.

A worker blocks only after staying idle for NAPI_EMPTY_WINDOWS
housekeeping windows with all of its queues empty, so a single busy
queue keeps it polling. --napi implies poll-mode and replaces the
micro-sleep ramp with the interrupt block. As that block can last up
to a second it is measured explicitly and the timestamp advanced past
it, keeping the sleep in total_cycles but out of busy_cycles.

napi_wait() tracks the queues it actually armed and disarms them
through a single exit path, so a queue without interrupt support does
not leave its predecessors armed, and only marks a queue
epoll-registered once. A PMD without rx queue interrupt support keeps
polling.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
In --napi mode an idle worker blocks on the rxq interrupt, so the
schedutil governor sees a low utilization and downclocks the core even
when it later runs at line rate. Pin the worker's uclamp_min to the max
capacity through sched_setattr(): the governor keeps the core at full
speed while the worker is runnable and lets it drop only when it
actually sleeps on the interrupt. glibc exposes neither struct
sched_attr nor a sched_setattr() wrapper, so both are declared locally.

The syscall fails on kernels without uclamp support or without the
privilege to set it, which would otherwise warn on every worker. Report
the expected EOPNOTSUPP/ENOSYS/EPERM/EINVAL cases at NOTICE and keep
WARNING for anything unexpected.

Signed-off-by: Maxime Leroy <maxime@leroys.fr>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants