Conquering the memory through io_uring - Analysis of CVE-2023-2598
A very powerful bug in the io_uring
driver of the linux kernel. In this case, the vulnerability is in the handling of registering fixed buffers via the IORING_REGISTER_BUFFERS
opcode, which allows an application to ‘pin’ and register memory for long-term use, which includes making it exempt from paging mechanics. The user can pass an iovec
of an address and length, which the kernel will then take to construct a bio_vec
(essentially an iovec
but for physical memory). The problem comes in when the driver tries to optimize the buffer for compound pages.
Background on compound pages / folio
Typically a page refers to the minimum sized block of physical memory that the kernel can map, which in most cases these days is 4KB. In linux though, a page
can refer to a singular page or compound pages, which are a group of pages that are contiguous in memory. With compound pages, the first page holds information about the group of pages comprising the compound page, and the tailing pages point back to the first page. This leads to a problem with any kernel function that has to handle pages, as it needs to know if it’s a tail page of a compound page or not. To solve this problem, the folio
object was created. Singular pages and the first page of compound pages are wrapped with the folio
type to distinguish them from tail pages.
Vulnerability
When registering buffers, if you try to register a buffer that’s larger than a physical page size, it’ll check to see if those pages are part of a compound page by checking their folio
pointers, and if they are, it’ll reduce the number of pages to 1
and mark it as a folio buffer. The bug is that when they do this checking, they don’t make sure the pages are physically contiguous. You can have multiple virtual pages map to the same physical page, which leads to a situation where the buffer is virtually contiguous but not physically contiguous. Ultimately this gives you an out of bounds access on adjacent physical pages, which is an insanely powerful primitive.
Exploitation
Exploiting this was not only fairly straightforward, but super reliable. By spraying pages filled with socket
objects and tagging the sockets with setting crafted pacing rates, they can use the fixed buffer access to read and check for their sprayed socket objects, defeat kernel ASLR via the various pointers, and also write into and replace the operations function table to gain code execution.