Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
201 changes: 201 additions & 0 deletions pocs/linux/kernelctf/CVE-2026-23351_cos/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,201 @@
# Exploit

The pipapo GC UAF is triggered twice. First on a data map to leak a heap pointer via `msg_msg` reclaim, then on a verdict map to hijack `nft_do_chain` into a fake chain. The fake chain abuses `nft_immediate_eval` with an out-of-bounds `dreg` to write a ROP chain over saved registers on the kernel stack, skipping the canary. `commit_creds(init_cred)` + `swapgs; iretq` back to userspace as root.

~25% per boot (depends on `page_offset_base` alignment), reliable within 4-8 reboots.

- Vulnerable object: `struct nft_pipapo_elem` (`kmalloc-cg-512`)
- Attacking objects: `struct msg_msg` (heap leak + fake chain), skb data (verdict control)
- Primitive: UAF read for heap infoleak, UAF verdict redirect for code execution

# Setup

## Environment

`unshare(CLONE_NEWUSER | CLONE_NEWNET)` creates a user namespace with `CAP_NET_ADMIN`, required for nftables operations (`nftnl_batch_begin`, `nfnl_send`). The loopback interface is brought up inside the network namespace for IPv6 UDP packet delivery (`::1`).

## Threading and CPU pinning

Four threads, pinned to two CPUs:

- **CPU 0**: main thread (nftables setup + spray), commit thread (triggers `nft_trans_gc_catchall_sync` by toggling dormant flag), spray thread (`msg_msg` / skb allocation)
- **CPU 1**: flood thread (sends IPv6 UDP packets to `::1` triggering `nft_pipapo_lookup` under `rcu_read_lock`), GP-forcer thread (cycles `rcu_read_lock`/`rcu_read_unlock` to advance RCU grace periods)

Two CPUs are needed because the commit thread on CPU 0 must be inside `vmalloc`'s `cond_resched()` while the flood thread on CPU 1 cycles through RCU read-side critical sections. When all CPUs have reported quiescent states, the RCU grace period completes and `call_rcu` callbacks fire, freeing the element while lookups still reference it.

## Nftables configuration

Two pipapo sets in table "t", chain "c" (INET family, `NF_INET_LOCAL_OUT` hook), both with concatenated 2×16-byte IPv6 keys (src+dst):

- Set "s": data map (`NFT_SET_MAP`). The rule copies the lookup result into `meta mark`, which the flood thread reads back from the received packet via `getsockopt`. Used for the heap leak in Phase 1.
- Set "v": verdict map (`NFT_SET_MAP | NFT_SET_VMAP`). The lookup result is a verdict: `NFT_GOTO` redirects `nft_do_chain` to whatever chain the reclaimed element specifies. Used for code execution in Phase 4.

66,000 filler elements per set force `pipapo_clone` to exceed `KMALLOC_MAX_SIZE` (4 MB on `x86_64` with `MAX_ORDER=10`):

- 2 fields × 128 bits with `bb=4` grouping → 32 groups of 16 buckets per field
- `lt_size` per field = 4096 × ceil(66000/64) ≈ 4.03 MB (where 4096 = 32 groups × 16 buckets × 8 bytes/long)
- `kvzalloc(4.03 MB)` exceeds the 4 MB `KMALLOC_MAX_SIZE` limit, falling back to `vmalloc`
- `vmalloc` → `__vmalloc_area_node` → `cond_resched()`, reporting an RCU quiescent state for the current CPU

# Triggering the vulnerability

The bug is in `nft_pipapo_commit()` (`net/netfilter/nft_set_pipapo.c`):

```c
static void nft_pipapo_commit(struct nft_set *set)
{
struct nft_pipapo *priv = nft_set_priv(set);
struct nft_pipapo_match *new_match, *old;

pipapo_gc(set, priv->clone); // [1]
new_match = pipapo_clone(priv->clone); // [2]
rcu_assign_pointer(priv->match, new_match); // [3]
}
```

At [1], `pipapo_gc` walks the clone's element lists, calls `nft_pipapo_gc_deactivate` on each expired element to mark it dead, then queues the batch via `nft_trans_gc_queue_sync` for deferred freeing through `kfree_rcu`. At [2], `pipapo_clone` allocates new lookup tables. When the total size exceeds `KMALLOC_MAX_SIZE`, `kvzalloc` falls through to `vmalloc`, which calls `cond_resched()` inside `__vmalloc_area_node`. This reports a quiescent state for the current CPU. If all CPUs have gone through a quiescent state (the flood thread on CPU 1 cycles `rcu_read_lock`/`rcu_read_unlock` continuously), the RCU grace period completes and the `kfree_rcu` callbacks from [1] fire, freeing the elements.

But `rcu_assign_pointer` at [3] hasn't executed yet. Other CPUs still do packet lookups on the OLD match under `rcu_read_lock`, traversing `pipapo_lookup` which indexes into the now-freed element memory.

The target element has a 2-second timeout. After it expires, the commit thread toggles the chain's dormant flag to trigger `nft_pipapo_commit`. The vmalloc allocation takes ~10-30ms, enough for the RCU GP to complete.

# Phase 1: Heap leak

A target element is added to set "s" with a 2-second timeout and 200 bytes of `NFT_SET_EXT_USERDATA`, putting it in `kmalloc-cg-512` (305 bytes total: base `nft_pipapo_elem` + `NFT_SET_EXT_TIMEOUT` + `NFT_SET_EXT_EXPIRATION` + `NFT_SET_EXT_DATA` + userdata padding).

After expiry, the commit triggers the UAF.

## Heap spray and reclaim

`msg_msg` objects are sprayed to reclaim the freed element's slot:
- `msgsnd()` with 464-byte body → `msg_msg` header (48 bytes) + body = 512 bytes → `kmalloc-cg-512`
- 256 message queues × 4 messages each = 1024 spray objects

When a `msg_msg` lands on the freed element, the pipapo lookup reads the element's extension data at `ext->offset[NFT_SET_EXT_DATA]`. This offset field now overlaps with byte 3 of `msg_msg.m_list.next` (a kernel heap pointer). If `page_offset_base` alignment gives N=0 (probability ~25%), that byte is < 0x10, and `offset[DATA]` points into the `msg_msg` header region, and the lookup reads 8 bytes of `m_list.next` as the "data map result."

The flood thread on CPU 1 sends IPv6 UDP packets to `::1`. Each packet traverses the `LOCAL_OUT` chain, triggering a pipapo lookup on set "s". The lookup result is copied into the packet's `meta mark`/payload fields by the nft rule, and the flood thread reads it back from the received packet.

## LIFO drain+refill for stable placement

After the leak, the `msg_msg` at the leaked address `A` must be replaced with a fake chain. The exploit processes queues one at a time: drain all 4 messages from queue `i` with `msgrcv()`, then immediately refill queue `i` with 4 new messages containing the fake chain payload before moving to queue `i+1`.

This per-queue interleaving is critical. SLUB's per-cpu freelist is LIFO: freeing 4 objects and immediately reallocating 4 pushes/pops the same slots in reverse order. Batch-draining all 256 queues before any refill would scatter freed slots across the freelist, with intervening kernel allocations stealing them. The interleaved approach keeps the working set small (4 objects) and maximises the probability that the refill lands on the exact same slots, placing the fake chain at the leaked address `A`.

# Phase 2: KASLR bypass

EntryBleed / prefetch side-channel (CVE-2022-4543):

`prefetchnta` on a kernel virtual address behaves differently depending on whether the address is backed by a TLB entry. Mapped `.text` pages show ~131-134 cycle latency; unmapped addresses show ~201+ cycles. The exploit scans 2 MB-aligned addresses across the kernel ASLR range (`0xffffffff80000000` to `0xffffffffc0000000`) and measures `rdtscp`-bracketed `prefetchnta` latency, taking the minimum across 64 rounds per address to filter noise.

The scan looks for the first pair of consecutive 2 MB-aligned addresses where both show "mapped" latency. This identifies `_stext` (kernel `.text` is >2 MB). A single scan occasionally produces false positives from speculative TLB fills, so the exploit runs 7 independent scans and applies Boyer-Moore majority vote to select the consensus `_stext`.

Returns `_stext`. All ROP gadgets and kernel symbols are computed as fixed offsets from `_stext`.

# Phase 3: Place fake chain

With heap leak address `A` from Phase 1, the fake chain is placed at `A+48` (inside `msg_msg.mtext`). The +48 offset skips the `msg_msg` header (48 bytes: `m_list` 16 + `m_type` 8 + `m_ts` 8 + `next` 8 + `security` 8), landing at the start of the user-controlled message body. All pointers within the fake chain reference `A+48+N` offsets, making the entire structure self-contained within a single `msg_msg` object.

Layout of the fake chain at `A+48`:

```
Offset Content Purpose
------ ------- -------
[0-7] blob_gen_0 = A+48+16 nft_chain.blob_gen_0 → rule blob
[8-15] blob_gen_1 = A+48+16 nft_chain.blob_gen_1 → rule blob
[16-23] nft_rule_blob: { size=256, pad=0 } rule blob header
[24-31] rule_dp: { is_last=0, dlen=64 } 2 expressions × 32 bytes each
[32-39] expr 1 ops = A+48+300 → nft_immediate_eval
[40-55] expr 1 priv.data = {0xFFFFFFFF, 0, 0, 0} NFT_CONTINUE at data[0]
[56-63] expr 1 priv: dreg=0, dlen=4, padding
[64-71] expr 2 ops = A+48+300 → nft_immediate_eval
[72-87] expr 2 priv.data = {0, 0} 136-byte copy source starts here (rbx=0, rbp=0)
[88-95] expr 2 priv: dreg=54, dlen=136, padding dreg/dlen land in saved r12 (harmless)
[96-207] expr 2 copy payload (bytes 24-135) saved r13-r15, return addr, iret frame
[208-299] (padding)
[300-319] fake nft_expr_ops: eval=nft_immediate_eval embedded ops struct
```

The fake `nft_expr_ops` at `mtext[300]` sets `eval` to `nft_immediate_eval` (`_stext + 0x12323d0`), `size = 32`, everything else zeroed. This avoids needing the address of the real `nft_imm_ops` in kernel `.data`, so the fake ops struct is self-contained in the `msg_msg` payload.

Both expressions point to this same fake ops struct. Expression 1 writes `NFT_CONTINUE` (0xFFFFFFFF) to `regs->data[0]` (the verdict register), clearing the stale `NFT_GOTO` so the eval loop proceeds to expression 2. Expression 2 calls `nft_immediate_eval` with `dreg=54` and `dlen=136`, which is the novel OOB write technique (see `novel-techniques.md`).

# Phase 4: Verdict map UAF + code execution

A target element is added to verdict map set "v" (2-second timeout, `kmalloc-cg-512`). After expiry, a commit triggers the UAF on the verdict map.

## Verdict reclaim via `AF_UNIX` skb

`msg_msg` is not used for Phase 4 because the verdict map lookup reads different extension offsets than the data map, and the `msg_msg` header layout doesn't align well with the required `NFT_SET_EXT_DATA` offset for a verdict. Instead, `AF_UNIX` `SOCK_DGRAM` provides fine-grained control over the reclaimed object's contents.

`socketpair(AF_UNIX, SOCK_DGRAM)` + `write(sv[0], payload, 192)` creates an skb with `skb->head` allocated from `kmalloc-cg-512` (192 bytes data + 320 bytes `skb_shared_info` = 512 bytes). The full 192-byte payload is attacker-controlled, starting at `skb->data`.

The payload is crafted so the reclaimed element appears valid to `pipapo_lookup`:

- `ext->genmask = 0` (passes generation check in `nft_pipapo_lookup`)
- `ext->offset[NFT_SET_EXT_DATA] = 16` (points into controlled region within the skb data)
- `ext->offset[NFT_SET_EXT_EXPIRATION] = 160` (points far enough into the buffer that the expiration timestamp reads as a large future value, passing the timeout check)
- At payload offset 16: `verdict.code = NFT_GOTO`, `verdict.chain = A + 48` (the fake chain address from Phase 1)

When the flood thread's packet triggers a lookup on set "v", the verdict resolves to `NFT_GOTO` with the fake chain pointer. `nft_do_chain` follows the GOTO into the fake chain at `A+48`, entering the expression eval loop with attacker-controlled expressions.

## Stack layout proof

`nft_do_chain` prologue on COS-121 (6.6.122, objdump of vmlinux):

```
nft_do_chain:
push %r15
push %r14
push %r13
push %r12
push %rbp
push %rbx
sub $0xf8, %rsp // [1] frame size = 0xf8
...
lea 0x20(%rsp), %r12 // [2] regs = rsp + 0x20
```

From [1] and [2]: `regs` is at `rsp+0x20`. The stack canary is at `rsp+0xf0` (stored from `%gs:0x28` after the `sub`). Saved `rbx` starts at `rsp+0xf8`.

`dreg=54` → byte offset `54 × 4 = 0xd8` from `regs` → absolute position `rsp + 0x20 + 0xd8 = rsp + 0xf8`, exactly the first saved callee register, 8 bytes PAST the canary.

## ROP chain

| Stack position | Value | Symbol / Gadget |
|---------------|-------|-----------------|
| `rsp+0xf8` | 0 | saved rbx |
| `rsp+0x100` | 0 | saved rbp |
| `rsp+0x108` | (dreg/dlen) | saved r12 (harmless) |
| `rsp+0x110` | 1 | saved r13 |
| `rsp+0x118` | 0 | saved r14 |
| `rsp+0x120` | 0 | saved r15 |
| `rsp+0x128` | `_stext+0x160db4` | `pop rdi; ret` (`native_write_cr4+0x34`) |
| `rsp+0x130` | `_stext+0x2e72f20` | `&init_cred` |
| `rsp+0x138` | `_stext+0x1ffbe0` | `commit_creds` |
| `rsp+0x140` | `_stext+0x1601949` | `swapgs_restore_regs_and_return_to_usermode+0x99` (pop rax; pop rdi; swapgs; KPTI CR3 restore; iretq) |
| `rsp+0x148` | 0 | rax pad |
| `rsp+0x150` | 0 | rdi pad |
| `rsp+0x158` | `user_rip` | shell function address |
| `rsp+0x160` | 0x33 | user CS |
| `rsp+0x168` | `user_rflags` | saved RFLAGS |
| `rsp+0x170` | `user_rsp` | user stack pointer |
| `rsp+0x178` | 0x2b | user SS |

`nft_do_chain` returns through its epilogue (`pop rbx; pop rbp; pop r12; pop r13; pop r14; pop r15; ret`), pops the overwritten values, and `ret` lands on `pop rdi; ret`. The chain runs `commit_creds(init_cred)`, then `swapgs_restore_regs_and_return_to_usermode+0x99` performs `swapgs`, traverses the KPTI return trampoline, and `iretq` returns to userspace with the iret frame at `rsp+0x148`. `execl("/bin/sh")` gives a root shell.

# Stability

| Component | Success rate | Notes |
|-----------|-------------|-------|
| `page_offset_base` alignment | ~25% per boot | N=0 mod 4 required (see below) |
| Prefetch KASLR | ~95% per trial | 7-trial Boyer-Moore majority vote |
| Heap reclaim (`msg_msg`) | >90% | Interleaved per-queue LIFO drain+refill |
| Verdict reclaim (skb) | >95% | `AF_UNIX` skb same cache, allocated on-demand |
| End-to-end per boot | ~20% | Dominated by alignment requirement |

The `page_offset_base` constraint: the heap leak reads byte 3 of `msg_msg.m_list.next` as `ext->offset[NFT_SET_EXT_DATA]`. The physical-to-virtual mapping randomises the upper bits of heap pointers. When `page_offset_base & 0xFF00000000 == 0`, byte 3 is < 0x10, producing a valid `offset[DATA]` pointing into the `msg_msg` header where the full 8-byte pointer is readable. Other alignments produce `offset[DATA]` > 0x10, pointing outside the object. The exploit detects this (no valid leak after 100 packets) and exits with code 2, signalling the wrapper script to reboot and retry.

With a 20% per-boot success rate, 8 independent boots give `1 - 0.8^8 ≈ 83%` cumulative probability. In practice, root is achieved within 4-8 reboots. The exploit exits with distinct codes:
- 0 = root shell obtained
- 1 = unrecoverable error (spray failure, KASLR mismatch)
- 2 = alignment check failed (retry on next boot)
72 changes: 72 additions & 0 deletions pocs/linux/kernelctf/CVE-2026-23351_cos/docs/novel-techniques.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# `nft_immediate_eval` OOB dreg write

## Problem

After gaining a UAF on an nft set element and redirecting `nft_do_chain` to a fake chain on the heap, the standard next step is a stack pivot to a controlled ROP chain. On COS-121 (6.6.122), this fails:

- `.text` has no usable `xchg rsp, rXX`, `mov rsp, [rXX]`, or `push rXX; pop rsp` gadgets (checked via raw byte search for `48 94`, `48 87 xx`, `5c`, and `53 5c` across the full 19MB .text; occurrences exist but all are mid-instruction or followed by immediate faults)
- At the `expr->ops->eval` call site, RDI points to the fake expression (heap) but no other general-purpose register holds a controlled heap address; `push rdi; pop rsp` (`57 5c`) has no usable occurrence either
- `.rodata` is on a separate 2MB page, mapped NX, so no jump to data there

Without a stack pivot, the attacker has code execution via `nft_do_chain`'s expression eval loop but no way to run a ROP chain.

## Technique

`nft_immediate_eval` performs an unchecked `memcpy` from expression data into `regs->data[dreg]`:

```c
static void nft_immediate_eval(const struct nft_expr *expr,
struct nft_regs *regs, ...)
{
const struct nft_immediate_expr *priv = nft_expr_priv(expr);
nft_data_copy(&regs->data[priv->dreg], &priv->data, priv->dlen); // [1]
}
```

At [1], `dreg` (u8) is a u32 array index and `dlen` (u8) is the byte count. Neither is bounds-checked at eval time. Validation happens at rule install time in `nft_parse_register_store`, but this is bypassed because the expressions come from a fake chain on the heap, not from netlink.

In `nft_do_chain`, `regs` is a local variable on the stack at `rsp+0x20`. The stack canary is at `rsp+0xf0`, saved callee registers start at `rsp+0xf8`. A fake expression with `dreg=54` writes starting at byte offset `54 × 4 = 0xd8` from `regs`, landing at `rsp+0xf8`, 8 bytes past the canary. With `dlen=136`, this covers saved `rbx` through `r15`, the return address, and a full ROP payload + iret frame.

The fake chain needs two expressions:
1. Reset the verdict register (`regs->data[0]`) to `NFT_CONTINUE`, since the eval loop would otherwise re-enter GOTO handling from the stale verdict
2. OOB write with `dreg=54`, `dlen=136` containing the ROP chain

When `nft_do_chain` returns through its epilogue (`pop rbx; pop rbp; pop r12-r15; ret`), the overwritten return address starts the ROP chain. The canary check passes because `dreg=54` starts past it.

## Prior art

No prior kernelCTF submission uses this technique. Prior nft-based exploits either:
- Corrupt a function pointer in an nft object (e.g. `nft_expr_ops.eval`) and pivot to a heap ROP chain (requires a stack pivot gadget)
- Use page-level attacks (DirtyPagetable, `pipe_buffer` page UAF), which require different bug primitives
- Use `modprobe_path` / `core_pattern` overwrites, blocked by read-only mounts on COS

The OOB dreg write is distinct: the `expr->ops->eval` indirect call targets the real `nft_immediate_eval` function, whose own `memcpy` writes the ROP chain directly onto the kernel stack. No branch to arbitrary code, no stack pivot needed.

## Why the canary doesn't help

Stack canaries protect against linear buffer overflows that start below the canary and overwrite upwards. The OOB dreg write starts at `dreg=54` which corresponds to `rsp+0xf8`, 8 bytes ABOVE the canary at `rsp+0xf0`. The canary bytes (indices 52-53) are never touched. The epilogue's `__stack_chk_fail` check passes because the canary is intact.

Canaries only detect contiguous overwrites from below. A write with an attacker-controlled starting offset bypasses them.

## Generalisability

The technique applies wherever:
1. An eval/dispatch loop stores a register file on the stack
2. Expression/instruction data comes from attacker-controlled memory
3. The register index and write length are validated at load time but not at eval time

The `dreg` value is target-specific. To compute it for a given kernel:
```
dreg = (saved_regs_offset - regs_offset) / sizeof(u32)
= (canary_offset + 8 - regs_offset) / 4
```
On COS-121: `(0xf8 - 0x20) / 4 = 54`. On other kernels, check `nft_do_chain`'s `sub $N, %rsp` and `lea M(%rsp), %rREGS` in the prologue.

Beyond nftables, the same pattern could apply to any kernel interpreter that stores a scratch register file on the stack and dispatches attacker-influenced instructions, e.g. a custom VM in a kernel module or any expression evaluator where operand indices are validated at load time but not at eval time.

## Proposed mitigations

1. **Bounds-check dreg at eval time**: `if (priv->dreg + priv->dlen/4 > NFT_REG32_COUNT) return;` in `nft_immediate_eval`. Low overhead (~2 instructions).
2. **Move regs off the stack**: allocate `nft_regs` on the heap or in a per-CPU area, so OOB writes can't reach saved registers.
3. **Compiler-based stack layout randomisation**: randomise the relative position of local variables and saved registers, making `dreg` calculation target-specific and unpredictable.
4. **Validate fake chain pointers**: `nft_do_chain` could verify that `chain->blob_gen_X` points into a valid nft allocation, not arbitrary heap memory.
Loading
Loading