226 - A Heap of Linux Bugs
One vulnerability a use-after-free in the Linux nftable subsystem, exploitable on the three kernelCTF targets: latest Long-term Stable (LTS) release, Container-optimized build as used by Google Cloud, and a Mitigation build that isn’t as up-to-date but includes experimentation mitigations to be bypassed.
The vulnerability exists in the Netfilter tables subsystem of the Linux kernel. The issue occurs during processing of a NFT_MSG_NEWRULE operation inside of a transaction/batch; as the name implies you are adding a new rule to a set. if an error happens during this it can fall into the
err_release_rule path, which calls into
nf_tables_rule_release Which makes sense from a developer point of view, the rule is bad, you want to release it. However this function calls into
nft_rule_expr_deactivate which takes in a parameter for the current
phase. It is hard-coded to use the
NFT_TRANS_RELEASE phase so when the function is called, for that phase it’ll end up unbinding the
nft_set object the rule was being added to. However a reference to that set is still kept earlier in the chain processing the transaction, leading to the use-after-free.
The patch seems fairly straight forward, rather than using the
nf_tables_rule_release function, they call the two functions that function would call, and change the
phase for the call to
nft_rule_expr_deactivate to the appropriate
With this vulnerability there is the initial use-after-free, but if execution keeps going, the prematurely freed
nft_set structure will be freed again after everything has been processed creating a double free situation. A double-free is a much more friendly primitive to have for exploitation so the authors pursued that route. They did have to introduce an extra set object into the process to interweave the frees in order to bypass a naive double-free check (can’t free the same pointer twice in a row).
I won’t be diving too far into the exploitation here because usage of the
msg_msgseg structures has been well explored. It is a very powerful object that can be sprayed from userland with a high-degree of control over the data by a user. Ultimately they corrupt the
pipe_buf_operations structure which contains various function pointers which can be triggered from operations on the pipe in userland. And then went for a ROP chain to escalate privileges.
I will call out one thing I found kinda fun, while on the LTS kernelCTF box they did a standard escalation via a
commit_creds call. On the Cloud-optimized build, while they used some different objects for their corruption, they still corrupted an operations structure and got in position from a ROP. Instead of doing a
commit_creds call they called
set_memory_x to set some heap memory as executable and just ran plain shellcode they wrote into the heap that did the usual escalation technique.
A very powerful bug in the
io_uring driver of the linux kernel. In this case, the vulnerability is in the handling of registering fixed buffers via the
IORING_REGISTER_BUFFERS opcode, which allows an application to ‘pin’ and register memory for long-term use, which includes making it exempt from paging mechanics. The user can pass an
iovec of an address and length, which the kernel will then take to construct a
bio_vec (essentially an
iovec but for physical memory). The problem comes in when the driver tries to optimize the buffer for compound pages.
Background on compound pages / folio
Typically a page refers to the minimum sized block of physical memory that the kernel can map, which in most cases these days is 4KB. In linux though, a
page can refer to a singular page or compound pages, which are a group of pages that are contiguous in memory. With compound pages, the first page holds information about the group of pages comprising the compound page, and the tailing pages point back to the first page. This leads to a problem with any kernel function that has to handle pages, as it needs to know if it’s a tail page of a compound page or not. To solve this problem, the
folio object was created. Singular pages and the first page of compound pages are wrapped with the
folio type to distinguish them from tail pages.
When registering buffers, if you try to register a buffer that’s larger than a physical page size, it’ll check to see if those pages are part of a compound page by checking their
folio pointers, and if they are, it’ll reduce the number of pages to
1 and mark it as a folio buffer. The bug is that when they do this checking, they don’t make sure the pages are physically contiguous. You can have multiple virtual pages map to the same physical page, which leads to a situation where the buffer is virtually contiguous but not physically contiguous. Ultimately this gives you an out of bounds access on adjacent physical pages, which is an insanely powerful primitive.
Exploiting this was not only fairly straightforward, but super reliable. By spraying pages filled with
socket objects and tagging the sockets with setting crafted pacing rates, they can use the fixed buffer access to read and check for their sprayed socket objects, defeat kernel ASLR via the various pointers, and also write into and replace the operations function table to gain code execution.