mm/memcontrol: optimize __memcg_slab_post_alloc_hook batch charging#3
Draft
Copilot wants to merge 2 commits into
Draft
mm/memcontrol: optimize __memcg_slab_post_alloc_hook batch charging#3Copilot wants to merge 2 commits into
Copilot wants to merge 2 commits into
Conversation
…oc_hook
Rewrite the per-object slab memcg charging loop with several improvements:
1. Hoist the loop-invariant max_run computation before the loop.
The cap (INT_MAX - PAGE_SIZE) / obj_size is constant across all
iterations.
2. Replace the single upfront bulk charge + per-object vmstat updates
with per-pgdat-run batching. The scan-ahead loop groups consecutive
objects on the same pgdat node, then:
- obj_cgroup_charge() is called once per run (not once for all
objects upfront)
- mod_objcg_state() is called once per run instead of once per
object, reducing repeated local_lock_irqsave/irqrestore from
N acquisitions to approximately one per pgdat run
- Objects that fail alloc_slab_obj_exts() are simply skipped
rather than being charged upfront and then individually uncharged
3. Avoid redundant virt_to_slab() for the first object of each run
by reusing the slab pointer already computed at the top of the
outer loop.
4. Use run_end absolute index with while (++i < run_end) to advance i
directly, eliminating a separate run_len counter and relative
indexing. The skip_next flag pattern is not needed: when
alloc_slab_obj_exts() fails in the scan-ahead, we simply break
and the next outer iteration retries that object naturally.
Agent-Logs-Url: https://github.com/teawater/linux/sessions/cd4a349b-c91d-46eb-889a-fa3f75c83e73
Co-authored-by: teawater <432382+teawater@users.noreply.github.com>
Copilot
AI
changed the title
[WIP] Fix performance issues in batch memcg charging implementation
mm/memcontrol: optimize __memcg_slab_post_alloc_hook batch charging
Apr 16, 2026
teawater
pushed a commit
that referenced
this pull request
May 7, 2026
Eduard Zingerman says:
====================
bpf: replace min/max fields with struct cnum{32,64}
This RFC replaces s64, u64, s32, u32 scalar range domains tracked by
verifier by a pair of circular numbers (cnums): one for 64-bit domain
and another for 32-bit domain. Each cnum represents a range as a
single arc on the circular number line, from which signed and unsigned
bounds are derived on demand. See also wrapped intervals
representation as in [1].
The use of such representation simplifies arithmetic and conditions
handling in verifier.c and allows to express 32 <-> 64 bit deductions
in a more mathematically rigorous way.
[1] https://jorgenavas.github.io/papers/ACM-TOPLAS-wrapped.pdf
Changelog
=========
v2 -> v1:
- Fixes in source code comments highlighted by bot+bpf-ci.
RFCv1 -> v2:
- Dropped RFC tag.
- Dropped cnum{32,64}_mul(), too much complexity and no veristat
or selftests gains.
- cnum32_from_cnum64() normalizes to CNUM32_EMPTY when input
cnum64.size >= U32_MAX, previously only checked for > U32_MAX
(bot+bpf-ci).
- cnum32_from_cnum64() and cnum64_cnum32_intersect() now check for
empty inputs (sashiko).
- In regs_refine_cond_op() case BPF_JSLT use cnum{32,64}_from_srange
constructors instead of unsigned variants (sashiko).
- cnum{32,64}_intersect_with{,_urange,_srange}() helpers added
(Alexei)
[RFCv1] https://lore.kernel.org/bpf/0c47b0b7ea476647746806c46fded4353be885f7.camel@gmail.com/T/
[v2] https://lore.kernel.org/bpf/c4fb60bafd526ae2d92b86e2250255aef0ba5ee1.camel@gmail.com/T/
Patch set layout
================
Patch #1 introduces the cnum (circular number) data type and basic
operations: intersection, addition, multiplication, negation, 32-to-64
and 64-to-32 projections.
CBMC based proofs for these operations are available at [2].
(The proofs lag behind the current patch set and need an update if
this RFC moves further. Although, I don't expect any significant
changes).
[2] https://github.com/eddyz87/cnum-verif
Patch #2 mechanically converts all direct accesses to bpf_reg_state's
min/max fields to use accessor functions, preparing for the field
replacement.
Patch #3 replaces the eight independent min/max fields in
bpf_reg_state with two cnum fields. The signed and unsigned bounds are
derived on demand from the cnum via the accessor functions.
This eliminates the separate signed<->unsigned cross-deduction logic,
simplifies reg_bounds_sync(), regs_refine_cond_op() and some
arithmetic operations.
Patch #4 adds selftest cases for improved 32->64 range refinements
enabled by cnum64_cnum32_intersect().
Precision trade-offs
====================
A cnum represents a contiguous arc on the circular number line.
Current master tracks four independent ranges (s64, u64, s32, u32)
whose intersection could implicitly represent value sets that are
two disjoint arcs on the circle - something a single cnum cannot.
Cnums lose precision when a cross-domain conditional comparison
(e.g., unsigned comparison on a register with a signed-derived range)
produces an intersection that would be two disjoint arcs.
The cnum must over-approximate by returning one of the input arcs.
E.g. a pair of ranges u64:[1,U64_MAX] s:[-1,1] represents a set of two
values: {-1,1}, this can only be approximated as a set of three values
by cnum: {-1,0,1}.
Cnums gain precision in arithmetic: when addition, subtraction or
multiplication causes register value to cross both signed and unsigned
min/max boundaries. Current master resets the register as unbound in
such cases, while cnums preserve the exact wrapping arc.
In practice precision loss turned out to matter only for a set of
specifically crafted selftests from reg_bounds.c (see patch #3 for
more details). On the other hand, precision gains are observable for a
set real-life programs.
Verification performance measurements
=====================================
Tested on 6683 programs across four corpora: Cilium (134 programs),
selftests (4635), sched_ext (376), and Meta internal BPF corpus
(1540). Program verification verdicts are identical across all four
program sets - no program changes between success and failure.
Instruction count comparison (cnum vs signed/unsigned pair):
- 83 programs improve, 44 regress, 6596 unchanged
- Total saved: ~290K insns, total added: 10K insns
- Largest improvements:
-26% insns in a internal network firewall program;
-47% insns in rusty/wd40 sched_ext schedulers.
- Largest regression: +24% insns in one Cilium program (5062 insns
added).
Raw performance data is at the bottom of the email.
Raw verification performance metrics
====================================
Filtered out by -f insns_pct>5 -f !insns<10000
========= selftests: master vs cnums-everywhere-v2 =========
File Program Insns (A) Insns (B) Insns (DIFF)
---- ------- --------- --------- ------------
Total progs: 4665
total_insns diff min: -15.71%
total_insns diff max: 45.45%
-20 .. -10 %: 3
-5 .. 0 %: 6
0 .. 5 %: 4654
45 .. 50 %: 2
========= scx: master vs cnums-everywhere-v2 =========
File Program Insns (A) Insns (B) Insns (DIFF)
--------------- -------------- --------- --------- ----------------
scx_rusty.bpf.o rusty_enqueue 39842 22053 -17789 (-44.65%)
scx_rusty.bpf.o rusty_stopping 37738 19949 -17789 (-47.14%)
scx_wd40.bpf.o wd40_stopping 37729 19880 -17849 (-47.31%)
Total progs: 376
total_insns diff min: -47.31%
total_insns diff max: 19.61%
-50 .. -40 %: 3
-5 .. 0 %: 5
0 .. 5 %: 366
5 .. 15 %: 1
15 .. 20 %: 1
========= meta: master vs cnums-everywhere-v2 =========
File Program Insns (A) Insns (B) Insns (DIFF)
--------------------------------- ---------- --------- --------- ----------------
<meta-internal-firewall-program> ...egress 222327 164648 -57679 (-25.94%)
<meta-internal-firewall-program> ..._tc_eg 222839 164772 -58067 (-26.06%)
<meta-internal-firewall-program> ...egress 222327 164648 -57679 (-25.94%)
<meta-internal-firewall-program> ..._tc_eg 222839 164772 -58067 (-26.06%)
Total progs: 1540
total_insns diff min: -26.06%
total_insns diff max: 6.69%
-30 .. -20 %: 4
-15 .. -5 %: 6
-5 .. 0 %: 36
0 .. 5 %: 1493
5 .. 10 %: 1
========= cilium: master vs cnums-everywhere-v2 =========
File Program Insns (A) Insns (B) Insns (DIFF)
---------- ------------------------------- --------- --------- ---------------
bpf_host.o tail_handle_ipv4_cont_from_host 20962 26024 +5062 (+24.15%)
bpf_host.o tail_handle_ipv6_cont_from_host 17036 18672 +1636 (+9.60%)
Total progs: 134
total_insns diff min: -3.32%
total_insns diff max: 24.15%
-5 .. 0 %: 12
0 .. 5 %: 120
5 .. 15 %: 1
20 .. 25 %: 1
---
====================
Link: https://patch.msgid.link/20260424-cnums-everywhere-rfc-v1-v3-0-ca434b39a486@gmail.com
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
teawater
pushed a commit
that referenced
this pull request
May 21, 2026
When unregistered my self-written scx scheduler, the following panic occurs. [ 229.923133] Kernel text patching generated an invalid instruction at 0xffff80009bc2c1f8! [ 229.923146] Internal error: Oops - BRK: 00000000f2000100 [#1] SMP [ 230.077871] CPU: 48 UID: 0 PID: 1760 Comm: kworker/u583:7 Not tainted 7.0.0+ #3 PREEMPT(full) [ 230.086677] Hardware name: NVIDIA GB200 NVL/P3809-BMC, BIOS 02.05.12 20251107 [ 230.093972] Workqueue: events_unbound bpf_map_free_deferred [ 230.099675] Sched_ext: invariant_0.1.0_aarch64_unknown_linux_gnu_debug (disabling), task: runnable_at=-174ms [ 230.116843] pc : 0xffff80009bc2c1f8 [ 230.120406] lr : dequeue_task_scx+0x270/0x2d0 [ 230.217749] Call trace: [ 230.228515] 0xffff80009bc2c1f8 (P) [ 230.232077] dequeue_task+0x84/0x188 [ 230.235728] sched_change_begin+0x1dc/0x250 [ 230.240000] __set_cpus_allowed_ptr_locked+0x17c/0x240 [ 230.245250] __set_cpus_allowed_ptr+0x74/0xf0 [ 230.249701] ___migrate_enable+0x4c/0xa0 [ 230.253707] bpf_map_free_deferred+0x1a4/0x1b0 [ 230.258246] process_one_work+0x184/0x540 [ 230.262342] worker_thread+0x19c/0x348 [ 230.266170] kthread+0x13c/0x150 [ 230.269465] ret_from_fork+0x10/0x20 [ 230.281393] Code: d4202000 d4202000 d4202000 d4202000 (d4202000) [ 230.287621] ---[ end trace 0000000000000000 ]--- [ 231.160046] Kernel panic - not syncing: Oops - BRK: Fatal exception in interrupt The root cause is that the JIT page backing ops->quiescent() is freed before all callers of that function have stopped. The expected ordering during teardown is: bitmap_zero(sch->has_op) + synchronize_rcu() -> guarantees no CPU will ever call sch->ops.* again -> only THEN free the BPF struct_ops JIT page bpf_scx_unreg() is supposed to enforce the order, but after commit f4a6c50 ("sched_ext: Always bounce scx_disable() through irq_work"), disable_work is no longer queued directly, causing kthread_flush_work() to be a noop. Thus, the caller drops the struct_ops map too early and poisoned with AARCH64_BREAK_FAULT before disable_workfn ever execute. So the subsequent dequeue_task() still sees SCX_HAS_OP(sch, quiescent) as true and calls ops.quiescent, which hit on the poisoned page and BRK panic. Add a helper scx_flush_disable_work() so the future use cases that want to flush disable_work can use it. Also amend the call for scx_root_enable_workfn() and scx_sub_enable_workfn() which have similar pattern in the error path. Fixes: f4a6c50 ("sched_ext: Always bounce scx_disable() through irq_work") Signed-off-by: Richard Cheng <icheng@nvidia.com> Reviewed-by: Andrea Righi <arighi@nvidia.com> Reviewed-by: Cheng-Yang Chou <yphbchou0911@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>
teawater
pushed a commit
that referenced
this pull request
May 21, 2026
ice_reset_all_vfs() ignores the return value of ice_vf_rebuild_vsi().
When the VSI rebuild fails (e.g. during NVM firmware update via
nvmupdate64e), ice_vsi_rebuild() tears down the VSI on its error path,
leaving txq_map and rxq_map as NULL. The subsequent unconditional call
to ice_vf_post_vsi_rebuild() leads to a NULL pointer dereference in
ice_ena_vf_q_mappings() when it accesses vsi->txq_map[0].
The single-VF reset path in ice_reset_vf() already handles this
correctly by checking the return value of ice_vf_reconfig_vsi() and
skipping ice_vf_post_vsi_rebuild() on failure.
Apply the same pattern to ice_reset_all_vfs(): check the return value
of ice_vf_rebuild_vsi() and skip ice_vf_post_vsi_rebuild() and
ice_eswitch_attach_vf() on failure. The VF is left safely disabled
(ICE_VF_STATE_INIT not set, VFGEN_RSTAT not set to VFACTIVE) and can
be recovered via a VFLR triggered by a PCI reset of the VF
(sysfs reset or driver rebind).
Note that this patch does not prevent the VF VSI rebuild from failing
during NVM update — the underlying cause is firmware being in a
transitional state while the EMP reset is processed, which can cause
Admin Queue commands (ice_add_vsi, ice_cfg_vsi_lan) to fail. This
patch only prevents the subsequent NULL pointer dereference that
crashes the kernel when the rebuild does fail.
crash> bt
PID: 50795 TASK: ff34c9ee708dc680 CPU: 1 COMMAND: "kworker/u512:5"
#0 [ff72159bcfe5bb50] machine_kexec at ffffffffaa8850ee
#1 [ff72159bcfe5bba8] __crash_kexec at ffffffffaaa15fba
#2 [ff72159bcfe5bc68] crash_kexec at ffffffffaaa16540
#3 [ff72159bcfe5bc70] oops_end at ffffffffaa837eda
#4 [ff72159bcfe5bc90] page_fault_oops at ffffffffaa893997
#5 [ff72159bcfe5bce8] exc_page_fault at ffffffffab528595
#6 [ff72159bcfe5bd10] asm_exc_page_fault at ffffffffab600bb2
[exception RIP: ice_ena_vf_q_mappings+0x79]
RIP: ffffffffc0a85b29 RSP: ff72159bcfe5bdc8 RFLAGS: 00010206
RAX: 00000000000f0000 RBX: ff34c9efc9c00000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000010 RDI: ff34c9efc9c00000
RBP: ff34c9efc27d4828 R8: 0000000000000093 R9: 0000000000000040
R10: ff34c9efc27d4828 R11: 0000000000000040 R12: 0000000000100000
R13: 0000000000000010 R14: R15:
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#7 [ff72159bcfe5bdf8] ice_sriov_post_vsi_rebuild at ffffffffc0a85e2e [ice]
#8 [ff72159bcfe5be08] ice_reset_all_vfs at ffffffffc0a920b4 [ice]
#9 [ff72159bcfe5be48] ice_service_task at ffffffffc0a31519 [ice]
#10 [ff72159bcfe5be88] process_one_work at ffffffffaa93dca4
#11 [ff72159bcfe5bec8] worker_thread at ffffffffaa93e9de
#12 [ff72159bcfe5bf18] kthread at ffffffffaa946663
#13 [ff72159bcfe5bf50] ret_from_fork at ffffffffaa8086b9
The panic occurs attempting to dereference the NULL pointer in RDX at
ice_sriov.c:294, which loads vsi->txq_map (offset 0x4b8 in ice_vsi).
The faulting VSI is an allocated slab object but not fully initialized
after a failed ice_vsi_rebuild():
crash> struct ice_vsi 0xff34c9efc27d4828
netdev = 0x0,
rx_rings = 0x0,
tx_rings = 0x0,
q_vectors = 0x0,
txq_map = 0x0,
rxq_map = 0x0,
alloc_txq = 0x10,
num_txq = 0x10,
alloc_rxq = 0x10,
num_rxq = 0x10,
The nvmupdate64e process was performing NVM firmware update:
crash> bt 0xff34c9edd1a30000
PID: 49858 TASK: ff34c9edd1a30000 CPU: 1 COMMAND: "nvmupdate64e"
#0 [ff72159bcd617618] __schedule at ffffffffab5333f8
#4 [ff72159bcd617750] ice_sq_send_cmd at ffffffffc0a35347 [ice]
#5 [ff72159bcd6177a8] ice_sq_send_cmd_retry at ffffffffc0a35b47 [ice]
#6 [ff72159bcd617810] ice_aq_send_cmd at ffffffffc0a38018 [ice]
#7 [ff72159bcd617848] ice_aq_read_nvm at ffffffffc0a40254 [ice]
#8 [ff72159bcd6178b8] ice_read_flat_nvm at ffffffffc0a4034c [ice]
#9 [ff72159bcd617918] ice_devlink_nvm_snapshot at ffffffffc0a6ffa5 [ice]
dmesg:
ice 0000:13:00.0: firmware recommends not updating fw.mgmt, as it
may result in a downgrade. continuing anyways
ice 0000:13:00.1: ice_init_nvm failed -5
ice 0000:13:00.1: Rebuild failed, unload and reload driver
Fixes: 12bb018 ("ice: Refactor VF reset")
Signed-off-by: Petr Oros <poros@redhat.com>
Tested-by: Rafal Romanowski <rafal.romanowski@intel.com>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Link: https://patch.msgid.link/20260427-jk-iwl-net-petr-oros-fixes-v1-5-cdcb48303fd8@intel.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
teawater
pushed a commit
that referenced
this pull request
May 21, 2026
When a controller reset is triggered via sysfs (by writing to
/sys/class/nvme/<nvmedev>/reset_controller), the reset work tears down
and re-establishes all queues. The socket release using fput() defers
the actual cleanup to task_work delayed_fput workqueue. This deferred
cleanup can race with the subsequent queue re-allocation during reset,
potentially leading to use-after-free or resource conflicts.
Replace fput() with __fput_sync() to ensure synchronous socket release,
guaranteeing that all socket resources are fully cleaned up before the
function returns. This prevents races during controller reset where
new queue setup may begin before the old socket is fully released.
* Call chain during reset:
nvme_reset_ctrl_work()
-> nvme_tcp_teardown_ctrl()
-> nvme_tcp_teardown_io_queues()
-> nvme_tcp_free_io_queues()
-> nvme_tcp_free_queue() <-- fput() -> __fput_sync()
-> nvme_tcp_teardown_admin_queue()
-> nvme_tcp_free_admin_queue()
-> nvme_tcp_free_queue() <-- fput() -> __fput_sync()
-> nvme_tcp_setup_ctrl() <-- race with deferred fput
memalloc_noreclaim_save() sets PF_MEMALLOC which is intended for tasks
performing memory reclaim work that need reserve access. While PF_MEMALLOC
prevents the task from entering direct reclaim (causing __need_reclaim() to
return false), it does not strip __GFP_IO from gfp flags. The allocator can
therefore still trigger writeback I/O when __GFP_IO remains set, which is
unsafe when the caller holds block layer locks.
Switch to memalloc_noio_save() which sets PF_MEMALLOC_NOIO. This causes
current_gfp_context() to strip __GFP_IO|__GFP_FS from every allocation in
the scope, making it safe to allocate memory while holding elevator_lock and
set->srcu.
* The issue can be reproduced using blktests:
nvme_trtype=tcp ./check nvme/005
blktests (master) # nvme_trtype=tcp ./check nvme/005
nvme/005 (tr=tcp) (reset local loopback target) [failed]
runtime 0.725s ... 0.798s
something found in dmesg:
[ 108.473940] run blktests nvme/005 at 2025-11-22 16:12:20
[...]
...
(See '/root/blktests/results/nodev_tr_tcp/nvme/005.dmesg' for the entire message)
blktests (master) # cat /root/blktests/results/nodev_tr_tcp/nvme/005.dmesg
[ 108.473940] run blktests nvme/005 at 2025-11-22 16:12:20
[ 108.526983] loop0: detected capacity change from 0 to 2097152
[ 108.555606] nvmet: adding nsid 1 to subsystem blktests-subsystem-1
[ 108.572531] nvmet_tcp: enabling port 0 (127.0.0.1:4420)
[ 108.613061] nvmet: Created nvm controller 1 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 108.616832] nvme nvme0: creating 48 I/O queues.
[ 108.630791] nvme nvme0: mapped 48/0/0 default/read/poll queues.
[ 108.661892] nvme nvme0: new ctrl: NQN "blktests-subsystem-1", addr 127.0.0.1:4420, hostnqn: nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349
[ 108.746639] nvmet: Created nvm controller 2 for subsystem blktests-subsystem-1 for NQN nqn.2014-08.org.nvmexpress:uuid:0f01fb42-9f7f-4856-b0b3-51e60b8de349.
[ 108.748466] nvme nvme0: creating 48 I/O queues.
[ 108.802984] nvme nvme0: mapped 48/0/0 default/read/poll queues.
[ 108.829983] nvme nvme0: Removing ctrl: NQN "blktests-subsystem-1"
[ 108.854288] block nvme0n1: no available path - failing I/O
[ 108.854344] block nvme0n1: no available path - failing I/O
[ 108.854373] Buffer I/O error on dev nvme0n1, logical block 1, async page read
[ 108.891693] ======================================================
[ 108.895912] WARNING: possible circular locking dependency detected
[ 108.900184] 6.17.0nvme+ #3 Tainted: G N
[ 108.903913] ------------------------------------------------------
[ 108.908171] nvme/2734 is trying to acquire lock:
[ 108.911957] ffff88810210e610 (set->srcu){.+.+}-{0:0}, at: __synchronize_srcu+0x17/0x170
[ 108.917587]
but task is already holding lock:
[ 108.921570] ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0
[ 108.927361]
which lock already depends on the new lock.
[ 108.933018]
the existing dependency chain (in reverse order) is:
[ 108.938223]
-> #4 (&q->elevator_lock){+.+.}-{4:4}:
[ 108.942988] __mutex_lock+0xa2/0x1150
[ 108.945873] elevator_change+0xa8/0x1c0
[ 108.948925] elv_iosched_store+0xdf/0x140
[ 108.952043] kernfs_fop_write_iter+0x16a/0x220
[ 108.955367] vfs_write+0x378/0x520
[ 108.957598] ksys_write+0x67/0xe0
[ 108.959721] do_syscall_64+0x76/0xbb0
[ 108.962052] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 108.965145]
-> #3 (&q->q_usage_counter(io)){++++}-{0:0}:
[ 108.968923] blk_alloc_queue+0x30e/0x350
[ 108.972117] blk_mq_alloc_queue+0x61/0xd0
[ 108.974677] scsi_alloc_sdev+0x2a0/0x3e0
[ 108.977092] scsi_probe_and_add_lun+0x1bd/0x430
[ 108.979921] __scsi_add_device+0x109/0x120
[ 108.982504] ata_scsi_scan_host+0x97/0x1c0
[ 108.984365] async_run_entry_fn+0x2d/0x130
[ 108.986109] process_one_work+0x20e/0x630
[ 108.987830] worker_thread+0x184/0x330
[ 108.989473] kthread+0x10a/0x250
[ 108.990852] ret_from_fork+0x297/0x300
[ 108.992491] ret_from_fork_asm+0x1a/0x30
[ 108.994159]
-> #2 (fs_reclaim){+.+.}-{0:0}:
[ 108.996320] fs_reclaim_acquire+0x99/0xd0
[ 108.998058] kmem_cache_alloc_node_noprof+0x4e/0x3c0
[ 109.000123] __alloc_skb+0x15f/0x190
[ 109.002195] tcp_send_active_reset+0x3f/0x1e0
[ 109.004038] tcp_disconnect+0x50b/0x720
[ 109.005695] __tcp_close+0x2b8/0x4b0
[ 109.007227] tcp_close+0x20/0x80
[ 109.008663] inet_release+0x31/0x60
[ 109.010175] __sock_release+0x3a/0xc0
[ 109.011778] sock_close+0x14/0x20
[ 109.013263] __fput+0xee/0x2c0
[ 109.014673] delayed_fput+0x31/0x50
[ 109.016183] process_one_work+0x20e/0x630
[ 109.017897] worker_thread+0x184/0x330
[ 109.019543] kthread+0x10a/0x250
[ 109.020929] ret_from_fork+0x297/0x300
[ 109.022565] ret_from_fork_asm+0x1a/0x30
[ 109.024194]
-> #1 (sk_lock-AF_INET-NVME){+.+.}-{0:0}:
[ 109.026634] lock_sock_nested+0x2e/0x70
[ 109.028251] tcp_sendmsg+0x1a/0x40
[ 109.029783] sock_sendmsg+0xed/0x110
[ 109.031321] nvme_tcp_try_send_cmd_pdu+0x13e/0x260 [nvme_tcp]
[ 109.034263] nvme_tcp_try_send+0xb3/0x330 [nvme_tcp]
[ 109.036375] nvme_tcp_queue_rq+0x342/0x3d0 [nvme_tcp]
[ 109.038528] blk_mq_dispatch_rq_list+0x297/0x800
[ 109.040448] __blk_mq_sched_dispatch_requests+0x3db/0x5f0
[ 109.042677] blk_mq_sched_dispatch_requests+0x29/0x70
[ 109.044787] blk_mq_run_work_fn+0x76/0x1b0
[ 109.046535] process_one_work+0x20e/0x630
[ 109.048245] worker_thread+0x184/0x330
[ 109.049890] kthread+0x10a/0x250
[ 109.051331] ret_from_fork+0x297/0x300
[ 109.053024] ret_from_fork_asm+0x1a/0x30
[ 109.054740]
-> #0 (set->srcu){.+.+}-{0:0}:
[ 109.056850] __lock_acquire+0x1468/0x2210
[ 109.058614] lock_sync+0xa5/0x110
[ 109.060048] __synchronize_srcu+0x49/0x170
[ 109.061802] elevator_switch+0xc9/0x330
[ 109.063950] elevator_change+0x128/0x1c0
[ 109.065675] elevator_set_none+0x4c/0x90
[ 109.067316] blk_unregister_queue+0xa8/0x110
[ 109.069165] __del_gendisk+0x14e/0x3c0
[ 109.070824] del_gendisk+0x75/0xa0
[ 109.072328] nvme_ns_remove+0xf2/0x230 [nvme_core]
[ 109.074365] nvme_remove_namespaces+0xf2/0x150 [nvme_core]
[ 109.076652] nvme_do_delete_ctrl+0x71/0x90 [nvme_core]
[ 109.078775] nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core]
[ 109.081009] nvme_sysfs_delete+0x34/0x40 [nvme_core]
[ 109.083082] kernfs_fop_write_iter+0x16a/0x220
[ 109.085009] vfs_write+0x378/0x520
[ 109.086539] ksys_write+0x67/0xe0
[ 109.087982] do_syscall_64+0x76/0xbb0
[ 109.089577] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 109.091665]
other info that might help us debug this:
[ 109.095478] Chain exists of:
set->srcu --> &q->q_usage_counter(io) --> &q->elevator_lock
[ 109.099544] Possible unsafe locking scenario:
[ 109.101708] CPU0 CPU1
[ 109.103402] ---- ----
[ 109.105103] lock(&q->elevator_lock);
[ 109.106530] lock(&q->q_usage_counter(io));
[ 109.109022] lock(&q->elevator_lock);
[ 109.111391] sync(set->srcu);
[ 109.112586]
*** DEADLOCK ***
[ 109.114772] 5 locks held by nvme/2734:
[ 109.116189] #0: ffff888101925410 (sb_writers#4){.+.+}-{0:0}, at: ksys_write+0x67/0xe0
[ 109.119143] #1: ffff88817a914e88 (&of->mutex#2){+.+.}-{4:4}, at: kernfs_fop_write_iter+0x10f/0x220
[ 109.123141] #2: ffff8881046313f8 (kn->active#185){++++}-{0:0}, at: sysfs_remove_file_self+0x26/0x50
[ 109.126543] #3: ffff88810470e1d0 (&set->update_nr_hwq_lock){++++}-{4:4}, at: del_gendisk+0x6d/0xa0
[ 109.129891] #4: ffff88813abea198 (&q->elevator_lock){+.+.}-{4:4}, at: elevator_change+0xa8/0x1c0
[ 109.133149]
stack backtrace:
[ 109.134817] CPU: 6 UID: 0 PID: 2734 Comm: nvme Tainted: G N 6.17.0nvme+ #3 PREEMPT(voluntary)
[ 109.134819] Tainted: [N]=TEST
[ 109.134820] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[ 109.134821] Call Trace:
[ 109.134823] <TASK>
[ 109.134824] dump_stack_lvl+0x75/0xb0
[ 109.134828] print_circular_bug+0x26a/0x330
[ 109.134831] check_noncircular+0x12f/0x150
[ 109.134834] __lock_acquire+0x1468/0x2210
[ 109.134837] ? __synchronize_srcu+0x17/0x170
[ 109.134838] lock_sync+0xa5/0x110
[ 109.134840] ? __synchronize_srcu+0x17/0x170
[ 109.134842] __synchronize_srcu+0x49/0x170
[ 109.134843] ? mark_held_locks+0x49/0x80
[ 109.134845] ? _raw_spin_unlock_irqrestore+0x2d/0x60
[ 109.134847] ? kvm_clock_get_cycles+0x14/0x30
[ 109.134853] ? ktime_get_mono_fast_ns+0x36/0xb0
[ 109.134858] elevator_switch+0xc9/0x330
[ 109.134860] elevator_change+0x128/0x1c0
[ 109.134862] ? kernfs_put.part.0+0x86/0x290
[ 109.134864] elevator_set_none+0x4c/0x90
[ 109.134866] blk_unregister_queue+0xa8/0x110
[ 109.134868] __del_gendisk+0x14e/0x3c0
[ 109.134870] del_gendisk+0x75/0xa0
[ 109.134872] nvme_ns_remove+0xf2/0x230 [nvme_core]
[ 109.134879] nvme_remove_namespaces+0xf2/0x150 [nvme_core]
[ 109.134887] nvme_do_delete_ctrl+0x71/0x90 [nvme_core]
[ 109.134893] nvme_delete_ctrl_sync+0x3b/0x50 [nvme_core]
[ 109.134899] nvme_sysfs_delete+0x34/0x40 [nvme_core]
[ 109.134905] kernfs_fop_write_iter+0x16a/0x220
[ 109.134908] vfs_write+0x378/0x520
[ 109.134911] ksys_write+0x67/0xe0
[ 109.134913] do_syscall_64+0x76/0xbb0
[ 109.134915] entry_SYSCALL_64_after_hwframe+0x76/0x7e
[ 109.134916] RIP: 0033:0x7fd68a737317
[ 109.134917] Code: 0d 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 109.134919] RSP: 002b:00007ffded1546d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 109.134920] RAX: ffffffffffffffda RBX: 000000000054f7e0 RCX: 00007fd68a737317
[ 109.134921] RDX: 0000000000000001 RSI: 00007fd68a855719 RDI: 0000000000000003
[ 109.134921] RBP: 0000000000000003 R08: 0000000030407850 R09: 00007fd68a7cd4e0
[ 109.134922] R10: 00007fd68a65b130 R11: 0000000000000246 R12: 00007fd68a855719
[ 109.134923] R13: 00000000304074c0 R14: 00000000304074c0 R15: 0000000030408660
[ 109.134926] </TASK>
[ 109.962756] Key type psk unregistered
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Sagi Grimberg <sagi@grimberg.me>
Reviewed-by: Hannes Reinecke <hare@suse.de>
Signed-off-by: Chaitanya Kulkarni <kch@nvidia.com>
Signed-off-by: Keith Busch <kbusch@kernel.org>
teawater
pushed a commit
that referenced
this pull request
May 21, 2026
…equeue_peeked When red qdisc has children (eg qfq qdisc) whose peek() callback is qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from its child (red in this case), it will do the following: 1a. do a peek() - and when sensing there's an skb the child can offer, then - the child in this case(red) calls its child's (qfq) peek. qfq does the right thing and will return the gso_skb queue packet. Note: if there wasnt a gso_skb entry then qfq will store it there. 1b. invoke a dequeue() on the child (red). And herein lies the problem. - red will call the child's dequeue() which will essentially just try to grab something of qfq's queue. [ 78.667668][ T363] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f] [ 78.667927][ T363] CPU: 1 UID: 0 PID: 363 Comm: ping Not tainted 7.1.0-rc1-00033-g46f74a3f7d57-dirty #790 PREEMPT(full) [ 78.668263][ T363] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 78.668486][ T363] RIP: 0010:qfq_dequeue+0x446/0xc90 [sch_qfq] [ 78.668718][ T363] Code: 54 c0 e8 dd 90 00 f1 48 c7 c7 e0 03 54 c0 48 89 de e8 ce 90 00 f1 48 8d 7b 48 b8 ff ff 37 00 48 89 fa 48 c1 e0 2a 48 c1 ea 03 <80> 3c 02 00 74 05 e8 ef a1 e1 f1 48 8b 7b 48 48 8d 54 24 58 48 8d [ 78.669312][ T363] RSP: 0018:ffff88810de573e0 EFLAGS: 00010216 [ 78.669533][ T363] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 [ 78.669790][ T363] RDX: 0000000000000009 RSI: 0000000000000004 RDI: 0000000000000048 [ 78.670044][ T363] RBP: ffff888110dc4000 R08: ffffffffb1b0885a R09: fffffbfff6ba9078 [ 78.670297][ T363] R10: 0000000000000003 R11: ffff888110e31c80 R12: 0000001880000000 [ 78.670560][ T363] R13: ffff888110dc4150 R14: ffff888110dc42b8 R15: 0000000000000200 [ 78.670814][ T363] FS: 00007f66a8f09c40(0000) GS:ffff888163428000(0000) knlGS:0000000000000000 [ 78.671110][ T363] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 78.671324][ T363] CR2: 000055db4c6a30a8 CR3: 000000010da67000 CR4: 0000000000750ef0 [ 78.671585][ T363] PKRU: 55555554 [ 78.671713][ T363] Call Trace: [ 78.671843][ T363] <TASK> [ 78.671936][ T363] ? __pfx_qfq_dequeue+0x10/0x10 [sch_qfq] [ 78.672148][ T363] ? __pfx__printk+0x10/0x10 [ 78.672322][ T363] ? srso_alias_return_thunk+0x5/0xfbef5 [ 78.672496][ T363] ? lockdep_hardirqs_on_prepare+0xa8/0x1a0 [ 78.672706][ T363] ? srso_alias_return_thunk+0x5/0xfbef5 [ 78.672875][ T363] ? trace_hardirqs_on+0x19/0x1a0 [ 78.673047][ T363] red_dequeue+0x65/0x270 [sch_red] [ 78.673217][ T363] ? srso_alias_return_thunk+0x5/0xfbef5 [ 78.673385][ T363] tbf_dequeue.cold+0xb0/0x70c [sch_tbf] [ 78.673566][ T363] __qdisc_run+0x169/0x1900 The right thing to do in #1b is to grab the skb off gso_skb queue. This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked() method instead. Fixes: 77be155 ("pkt_sched: Add peek emulation for non-work-conserving qdiscs.") Reported-by: Manas <ghandatmanas@gmail.com> Reported-by: Rakshit Awasthi <rakshitawasthi17@gmail.com> Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430152957.194015-2-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
teawater
pushed a commit
that referenced
this pull request
May 21, 2026
…equeue_peeked When sfb has children (eg qfq qdisc) whose peek() callback is qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from its child (sfb in this case), it will do the following: 1a. do a peek() - and when sensing there's an skb the child can offer, then - the child in this case(sfb) calls its child's (qfq) peek. qfq does the right thing and will return the gso_skb queue packet. Note: if there wasnt a gso_skb entry then qfq will store it there. 1b. invoke a dequeue() on the child (sfb). And herein lies the problem. - sfb will call the child's dequeue() which will essentially just try to grab something of qfq's queue. [ 127.594489][ T453] KASAN: null-ptr-deref in range [0x0000000000000048-0x000000000000004f] [ 127.594741][ T453] CPU: 2 UID: 0 PID: 453 Comm: ping Not tainted 7.1.0-rc1-00035-gac961974495b-dirty #793 PREEMPT(full) [ 127.595059][ T453] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011 [ 127.595254][ T453] RIP: 0010:qfq_dequeue+0x35c/0x1650 [sch_qfq] [ 127.595461][ T453] Code: 00 fc ff df 80 3c 02 00 0f 85 17 0e 00 00 4c 8d 73 48 48 89 9d b8 02 00 00 48 b8 00 00 00 00 00 fc ff df 4c 89 f2 48 c1 ea 03 <80> 3c 02 00 0f 85 76 0c 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8b [ 127.596081][ T453] RSP: 0018:ffff88810e5af440 EFLAGS: 00010216 [ 127.596337][ T453] RAX: dffffc0000000000 RBX: 0000000000000000 RCX: dffffc0000000000 [ 127.596623][ T453] RDX: 0000000000000009 RSI: 0000001880000000 RDI: ffff888104fd82b0 [ 127.596917][ T453] RBP: ffff888104fd8000 R08: ffff888104fd8280 R09: 1ffff110211893a3 [ 127.597165][ T453] R10: 1ffff110211893a6 R11: 1ffff110211893a7 R12: 0000001880000000 [ 127.597404][ T453] R13: ffff888104fd82b8 R14: 0000000000000048 R15: 0000000040000000 [ 127.597644][ T453] FS: 00007fc380cbfc40(0000) GS:ffff88816f2a8000(0000) knlGS:0000000000000000 [ 127.597956][ T453] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 127.598160][ T453] CR2: 00005610aa9890a8 CR3: 000000010369e000 CR4: 0000000000750ef0 [ 127.598390][ T453] PKRU: 55555554 [ 127.598509][ T453] Call Trace: [ 127.598629][ T453] <TASK> [ 127.598718][ T453] ? mark_held_locks+0x40/0x70 [ 127.598890][ T453] ? srso_alias_return_thunk+0x5/0xfbef5 [ 127.599053][ T453] sfb_dequeue+0x88/0x4d0 [ 127.599174][ T453] ? ktime_get+0x137/0x230 [ 127.599328][ T453] ? srso_alias_return_thunk+0x5/0xfbef5 [ 127.599480][ T453] ? qdisc_peek_dequeued+0x7b/0x350 [sch_qfq] [ 127.599670][ T453] ? srso_alias_return_thunk+0x5/0xfbef5 [ 127.599831][ T453] tbf_dequeue+0x6b1/0x1098 [sch_tbf] [ 127.599988][ T453] __qdisc_run+0x169/0x1900 The right thing to do in #1b is to grab the skb off gso_skb queue. This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked() method instead. Fixes: e13e02a ("net_sched: SFB flow scheduler") Signed-off-by: Victor Nogueria <victor@mojatatu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20260430152957.194015-3-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
teawater
pushed a commit
that referenced
this pull request
May 21, 2026
Jamal Hadi Salim says: ==================== Replace direct dequeue call with qdisc_dequeue_peeked When sfb and red qdiscs have children (eg qfq qdisc) whose peek() callback is qdisc_peek_dequeued(), we could get a kernel panic. When the parent of such qdiscs (eg illustrated in patch #3 as tbf) wants to retrieve an skb from its child (red/sfb in this case), it will do the following: 1a. do a peek() - and when sensing there's an skb the child can offer, then - the child in this case(red/sfb) calls its child's (qfq) peek. qfq does the right thing and will return the gso_skb queue packet. Note: if there wasnt a gso_skb entry then qfq will store it there. 1b. invoke a dequeue() on the child (red/sfb). And herein lies the problem. - red/sfb will call the child's dequeue() which will essentially just try to grab something of qfq's queue. The right thing to do in #1b is to grab the skb off gso_skb queue. This patchset fixes that issue by changing #1b to use qdisc_dequeue_peeked() method instead. Patch 1 fixes the issue for red qdisc. Patch 2 fixes it for sfb. Patch 3 adds testcases for the two setups. ==================== Link: https://patch.msgid.link/20260430152957.194015-1-jhs@mojatatu.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
teawater
pushed a commit
that referenced
this pull request
May 21, 2026
Pavan Chebbi says: ==================== bnxt_en: Bug fixes This patchset adds the following fixes for bnxt: Patch #1 fixes DPC AER handling to make it more reliable Patch #2 fixes incorrect capping bp->max_tpa based on what the FW supports Patch #3 fixes ignoring of VNIC configuration result when RDMA driver is loading Patch #4 fixes logic to make phase adjustment on the PPS OUT signal ==================== Link: https://patch.msgid.link/20260504083611.1383776-1-pavan.chebbi@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
teawater
pushed a commit
that referenced
this pull request
May 21, 2026
syzbot reports for sleeping function called from invalid context [1].
The recently added code for resizable hash tables uses
hlist_bl bit locks in combination with spin_lock for
the connection fields (cp->lock).
Fix the following problems:
* avoid using spin_lock(&cp->lock) under locked bit lock
because it sleeps on PREEMPT_RT
* as the recent changes call ip_vs_conn_hash() only for newly
allocated connection, the spin_lock can be removed there because
the connection is still not linked to table and does not need
cp->lock protection.
* the lock can be removed also from ip_vs_conn_unlink() where we
are the last connection user.
* the last place that is fixed is ip_vs_conn_fill_cport()
where now the cp->lock is locked before the other locks to
ensure other packets do not modify the cp->flags in non-atomic
way. Here we make sure cport and flags are changed only once
if two or more packets race to fill the cport. Also, we fill
cport early, so that if we race with resizing there will be
valid cport key for the hashing. Add a warning if too many
hash table changes occur for our RCU read-side critical
section which is error condition but minor because the
connection still can expire gracefully. Still, restore the
cport to 0 to allow retransmitted packet to properly fill
the cport. Problems reported by Sashiko.
[1]:
BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 16, name: ktimers/0
preempt_count: 2, expected: 0
RCU nest depth: 3, expected: 3
8 locks held by ktimers/0/16:
#0: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163
#1: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163
#2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:45 [inline]
#2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: timer_base_lock_expiry kernel/time/timer.c:1502 [inline]
#2: ffff8880b8826360 (&base->expiry_lock){+...}-{3:3}, at: __run_timer_base+0x120/0x9f0 kernel/time/timer.c:2384
#3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline]
#3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
#3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: __rt_spin_lock kernel/locking/spinlock_rt.c:50 [inline]
#3: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rt_spin_lock+0x1e0/0x400 kernel/locking/spinlock_rt.c:57
#4: ffffc90000157a80 ((&cp->timer)){+...}-{0:0}, at: call_timer_fn+0xd4/0x5e0 kernel/time/timer.c:1745
#5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_lock_acquire include/linux/rcupdate.h:300 [inline]
#5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: rcu_read_lock include/linux/rcupdate.h:838 [inline]
#5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:315 [inline]
#5: ffffffff8dfc80c0 (rcu_read_lock){....}-{1:3}, at: ip_vs_conn_expire+0x257/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260
#6: ffffffff8de5f260 (local_bh){.+.+}-{1:3}, at: __local_bh_disable_ip+0x3c/0x420 kernel/softirq.c:163
#7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: spin_lock include/linux/spinlock_rt.h:45 [inline]
#7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:324 [inline]
#7: ffff888068d4c3f0 (&cp->lock#2){+...}-{3:3}, at: ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260
Preemption disabled at:
[<ffffffff898a6358>] bit_spin_lock include/linux/bit_spinlock.h:38 [inline]
[<ffffffff898a6358>] hlist_bl_lock+0x18/0x110 include/linux/list_bl.h:149
CPU: 0 UID: 0 PID: 16 Comm: ktimers/0 Tainted: G W L syzkaller #0 PREEMPT_{RT,(full)}
Tainted: [W]=WARN, [L]=SOFTLOCKUP
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 03/18/2026
Call Trace:
<TASK>
dump_stack_lvl+0xe8/0x150 lib/dump_stack.c:120
__might_resched+0x329/0x480 kernel/sched/core.c:9162
__rt_spin_lock kernel/locking/spinlock_rt.c:48 [inline]
rt_spin_lock+0xc2/0x400 kernel/locking/spinlock_rt.c:57
spin_lock include/linux/spinlock_rt.h:45 [inline]
ip_vs_conn_unlink net/netfilter/ipvs/ip_vs_conn.c:324 [inline]
ip_vs_conn_expire+0xd4a/0x2390 net/netfilter/ipvs/ip_vs_conn.c:1260
call_timer_fn+0x192/0x5e0 kernel/time/timer.c:1748
expire_timers kernel/time/timer.c:1799 [inline]
__run_timers kernel/time/timer.c:2374 [inline]
__run_timer_base+0x6a3/0x9f0 kernel/time/timer.c:2386
run_timer_base kernel/time/timer.c:2395 [inline]
run_timer_softirq+0xb7/0x170 kernel/time/timer.c:2405
handle_softirqs+0x1de/0x6d0 kernel/softirq.c:622
__do_softirq kernel/softirq.c:656 [inline]
run_ktimerd+0x69/0x100 kernel/softirq.c:1151
smpboot_thread_fn+0x541/0xa50 kernel/smpboot.c:160
kthread+0x388/0x470 kernel/kthread.c:436
ret_from_fork+0x514/0xb70 arch/x86/kernel/process.c:158
ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:245
</TASK>
Reported-by: syzbot+504e778ddaecd36fdd17@syzkaller.appspotmail.com
Link: https://sashiko.dev/#/patchset/20260415200216.79699-1-ja%40ssi.bg
Link: https://sashiko.dev/#/patchset/20260420165539.85174-4-ja%40ssi.bg
Link: https://sashiko.dev/#/patchset/20260422135823.50489-4-ja%40ssi.bg
Fixes: 2fa7cc9 ("ipvs: switch to per-net connection table")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
teawater
pushed a commit
that referenced
this pull request
May 21, 2026
Leon Hwang says: ==================== bpf: Extend BPF syscall with common attributes support This patch series builds upon the discussion in "[PATCH bpf-next v4 0/4] bpf: Improve error reporting for freplace attachment failure" [1]. This patch series introduces support for *common attributes* in the BPF syscall, providing a unified mechanism for passing shared metadata across all BPF commands, initially used by BPF_PROG_LOAD, BPF_BTF_LOAD, and BPF_MAP_CREATE. The initial set of common attributes includes: 1. 'log_buf': User-provided buffer for storing log output. 2. 'log_size': Size of the provided log buffer. 3. 'log_level': Verbosity level for logging. 4. 'log_true_size': Actual log size reported by kernel. With this extension, the BPF syscall will be able to return meaningful error messages (e.g., map creation failures), improving debuggability and user experience. Links: [1] https://lore.kernel.org/bpf/20250224153352.64689-1-leon.hwang@linux.dev/ Changes: v13 -> v14: * Replace __u64 with __aligned_u64 in struct bpf_common_attr in patch #1 (per bot+bpf-ci). * Add a single line comment for preserving original error in patch #6 (per bot+bpf-ci and Alexei). * Drop unused label in patch #6 (per bot+bpf-ci). * v13: https://lore.kernel.org/bpf/20260511152817.89191-1-leon.hwang@linux.dev/ v12 -> v13: * Rebase on bpf-next tree to resolve code conflict. * Report log_true_size on success path in patch #6. * Check size instead of common->log_buf in patch #6 (per bot+bpf-ci). * v12: https://lore.kernel.org/bpf/20260420141804.27179-1-leon.hwang@linux.dev/ v11 -> v12: * Drop "log_" prefix in struct bpf_log_attr in patch #3. * Drop "log_" prefix in struct bpf_log_opts in patch #7. * Copy log_true_size using copy_to_bpfptr_offset() in patch #3 (per Alexei). * v11: https://lore.kernel.org/bpf/20260216150445.68278-1-leon.hwang@linux.dev/ v10 -> v11: * Collect Acked-by from Andrii, thanks. * Validate whether log_buf, log_size, and log_level are valid by reusing bpf_verifier_log_attr_valid() in patch #4 (per Andrii). * v10: https://lore.kernel.org/bpf/20260211151115.78013-1-leon.hwang@linux.dev/ v9 -> v10: * Collect Acked-by from Andrii, thanks. * Address comments from Andrii: * Drop log NULL check in bpf_log_attr_finalize(). * Return -EFAULT early in bpf_log_attr_finalize(). * Validate whether log_buf, log_size, and log_level are set. * Keep log_buf, log_size, log_level, and user-pointer log_true_size in struct bpf_log_attr. * Make prog_load and btf_load work with the new struct bpf_log_attr. * Add comment to log_true_size of struct bpf_log_opts in libbpf. * Address comment from Alexei: * Avoid using BPF_LOG_FIXED as log_level in tests. * v9: https://lore.kernel.org/bpf/20260202144046.30651-1-leon.hwang@linux.dev/ v8 -> v9: * Rework reporting 'log_true_size' for prog_load, btf_load, and map_create to simplify struct bpf_log_attr (per Alexei). * v8: https://lore.kernel.org/bpf/20260126151409.52072-1-leon.hwang@linux.dev/ v7 -> v8: * Return 0 when fd < 0 and errno != EFAULT in probe_sys_bpf_ext(), then simplify probe_bpf_syscall_common_attrs() (per Alexei and Andrii). * v7: https://lore.kernel.org/bpf/20260123032445.125259-1-leon.hwang@linux.dev/ v6 -> v7: * Return -errno when fd < 0 and errno != EFAULT in probe_sys_bpf_ext(). * Convert return value of probe_sys_bpf_ext() to bool in probe_bpf_syscall_common_attrs(). * Address comments from Andrii: * Drop the comment, and handle fd >= 0 case explicitly in probe_sys_bpf_ext(). * Return an error when fd >= 0 in probe_sys_bpf_ext(). * v6: https://lore.kernel.org/bpf/20260120152424.40766-1-leon.hwang@linux.dev/ v5 -> v6: * Address comments from Andrii: * Update some variables' name. * Drop unnecessary 'close(fd)' in libbpf. * Rename FEAT_EXTENDED_SYSCALL to FEAT_BPF_SYSCALL_COMMON_ATTRS with updated description in libbpf. * Use EINVAL instead of EUSERS, as EUSERS is not used in bpf yet. * Rename struct bpf_syscall_common_attr_opts to bpf_log_opts in libbpf. * Add 'OPTS_SET(log_opts, log_true_size, 0);' in libbpf's 'bpf_map_create()'. * v5: https://lore.kernel.org/bpf/20260112145616.44195-1-leon.hwang@linux.dev/ v4 -> v5: * Rework reporting 'log_true_size' for prog_load, btf_load, and map_create (per Alexei). * v4: https://lore.kernel.org/bpf/20260106172018.57757-1-leon.hwang@linux.dev/ RFC v3 -> v4: * Drop RFC. * Address comments from Andrii: * Add parentheses in 'sys_bpf_ext()'. * Avoid creating new fd in 'probe_sys_bpf_ext()'. * Add a new struct to wrap log fields in libbpf. * Address comments from Alexei: * Do not skip writing to user space when log_true_size is zero. * Do not use 'bool' arguments. * Drop the adding WARN_ON_ONCE()'s. * v3: https://lore.kernel.org/bpf/20251002154841.99348-1-leon.hwang@linux.dev/ RFC v2 -> RFC v3: * Rename probe_sys_bpf_extended to probe_sys_bpf_ext. * Refactor reporting 'log_true_size' for prog_load. * Refactor reporting 'btf_log_true_size' for btf_load. * Add warnings for internal bugs in map_create. * Check log_true_size in test cases. * Address comment from Alexei: * Change kvzalloc/kvfree to kzalloc/kfree. * Address comments from Andrii: * Move BPF_COMMON_ATTRS to 'enum bpf_cmd' alongside brief comment. * Add bpf_check_uarg_tail_zero() for extra checks. * Rename sys_bpf_extended to sys_bpf_ext. * Rename sys_bpf_fd_extended to sys_bpf_ext_fd. * Probe the new feature using NULL and -EFAULT. * Move probe_sys_bpf_ext to libbpf_internal.h and drop LIBBPF_API. * Return -EUSERS when log attrs are conflict between bpf_attr and bpf_common_attr. * Avoid touching bpf_vlog_init(). * Update the reason messages in map_create. * Finalize the log using __cleanup(). * Report log size to users. * Change type of log_buf from '__u64' to 'const char *' and cast type using ptr_to_u64() in bpf_map_create(). * Do not return -EOPNOTSUPP when kernel doesn't support this feature in bpf_map_create(). * Add log_level support for map creation for consistency. * Address comment from Eduard: * Use common_attrs->log_level instead of BPF_LOG_FIXED. * v2: https://lore.kernel.org/bpf/20250911163328.93490-1-leon.hwang@linux.dev/ RFC v1 -> RFC v2: * Fix build error reported by test bot. * Address comments from Alexei: * Drop new uapi for freplace. * Add common attributes support for prog_load and btf_load. * Add common attributes support for map_create. * v1: https://lore.kernel.org/bpf/20250728142346.95681-1-leon.hwang@linux.dev/ ==================== Link: https://patch.msgid.link/20260512153157.28382-1-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
teawater
pushed a commit
that referenced
this pull request
May 21, 2026
Leon Hwang says: ==================== bpf: Follow-up fixes for BPF syscall common attributes Address sashiko reviews for BPF syscall common attributes series. 1. The tailing padding bytes in struct bpf_common_attr should be checked. [1] 2. There was a concurrent regression in syscall.c::map_create(). [2] 3. OPTS_VALID() was missing to validate the nested struct bpf_log_opts in libbpf. [3] 4. The token_fd should be -1 to avoid a valid token fd in test. [4] A test is added to verify the fix #1. The fix #2 is hard to be verified by test. So, I decide not to add a test for it to avoid over-engineering. Decide not to add a test for fix #3 to avoid over-engineering, as the fix looks really simple. Links: [1] https://lore.kernel.org/bpf/20260513224823.6494FC19425@smtp.kernel.org/ [2] https://lore.kernel.org/bpf/20260512233658.CEED7C2BCB0@smtp.kernel.org/ [3] https://lore.kernel.org/bpf/20260512235629.C5CABC2BCB0@smtp.kernel.org/ [4] https://lore.kernel.org/bpf/20260513003358.55836C2BCB0@smtp.kernel.org/ ==================== Link: https://patch.msgid.link/20260518145446.6794-1-leon.hwang@linux.dev Signed-off-by: Alexei Starovoitov <ast@kernel.org>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The per-object memcg charging loop in
__memcg_slab_post_alloc_hook()re-acquiredstock_lockonce per object viamod_objcg_state(), and charged all objects upfront requiring individual uncharg calls whenalloc_slab_obj_exts()failed.Changes
Per-pgdat run batching: scan-ahead groups consecutive objects sharing a pgdat node into a run;
obj_cgroup_charge()andmod_objcg_state()are called once per run instead of once per object — reducinglocal_lock_irqsave/irqrestorefrom N to ~1 for typical same-node bulk allocationsCharge-on-assign: upfront bulk charge (
size * obj_full_size(s)) replaced with per-run charging; objects that failalloc_slab_obj_exts()are skipped entirely, eliminating the per-failureobj_cgroup_uncharge()callHoist loop-invariant
max_run:(INT_MAX - PAGE_SIZE) / obj_sizecomputed once before the loop; also caps run length to keepbatch_byteswithinintrange formod_objcg_state()Reuse first object's
slabpointer:slabfrom the outer loop head is used directly for the first object'sobj_extassignment, saving avirt_to_slab()call per runrun_endabsolute indexing: replacesrun_len+ relativej+skip_nextflag;while (++i < run_end)advancesito the correct position automatically — failedalloc_slab_obj_exts()in scan-ahead simply breaks, and the next outer iteration retries naturallyOriginal prompt
Problem
Commit 7ef269f ("mm/memcontrol: batch memcg charging in __memcg_slab_post_alloc_hook") introduced batch memcg charging but the implementation has several performance issues that cause overhead even when the optimization helps:
Issues in the current code (mm/memcontrol.c, function
__memcg_slab_post_alloc_hook, around line 3280-3364):1.
max_sizerecomputed every outer iteration (line 3307-3308)This value is constant across all iterations —
obj_sizeandsizedon't change. It should be hoisted before the loop.2.
skip_nextflag adds unnecessary complexity (lines 3289, 3318-3319, 3360-3363)When
alloc_slab_obj_exts()fails for an object in the scan-ahead loop, the code setsskip_next = true, breaks, then at the end does:This is unnecessary. Without
skip_next, settingi += run_lenmakes the outer loop land on the failed object. The next iteration callsvirt_to_slab()+alloc_slab_obj_exts()which fails again →i++; continue;— same net result, and this is the rare error path so the one extra check is negligible.3. Redundant
virt_to_slab()callsEach object in a run gets
virt_to_slab()called twice:struct slab *slab_j = virt_to_slab(p[j]);slab = virt_to_slab(p[i + j]);For the first object of each run, it's called three times total (line 3291 + scan-ahead for next run's check + assignment).
The first object's slab pointer (
slabfrom line 3291) is already available but thrown away in the assignment loop which re-computes it atj=0.4. Confusing index arithmetic
The scan-ahead loop uses absolute index
jfromi+1, but the assignment loop uses relativejfrom 0 withp[i + j]. This is error-prone and harder to read.5.
run_leninitialized to 0 then immediately set to 1 (lines 3286, 3300)Minor but sloppy:
size_t run_len = 0;followed byrun_len = 1;a few lines later.Proposed Fix
Optimize the loop in
__memcg_slab_post_alloc_hook()with these changes:Hoist
max_runcomputation before the loop — it's a loop-invariant constant.Remove the
skip_nextflag entirely — simplify the control flow. Whenalloc_slab_obj_exts()fails in scan-ahead, justbreak. The next outer iteration handles it naturally.Avoid redundant
virt_to_slab()for the first object — handle it separately using theslabpointer already computed at the top of the outer loop, then loop over remaining objects.Use
run_endabsolute index — replacerun_len+jrelative indexing withrun_endabsolute index. The assignment loop usesdo { ... } while (++i < run_end)which advancesidirectly, eliminating separate index tracking.Here is the optimized replacement for the loop body in
__memcg_slab_post_alloc_hook()(mm/memcontrol.c). Replace lines 3280-3364 (thefor (i = 0; i < size; )loop and its body) with: