12 minute read.

Adventures of porting MUSL to PS4

Specter

Over the last year or so, I've been working with the OpenOrbis team to develop a toolchain for building homebrew to the PS4 without violating copyright laws by using the official SDK materials. This is not an easy task, building homebrew that will run on the system without official tools presents many challenges. Key among these challenges were:

  • The executables use a customized ELF format that enables functionality unique to the console.
  • The libc library used by Sony does not adhere to standards other libc libraries adhere to.
  • We cannot redistribute sony libraries with the toolchain for legal reasons.

While tackling the custom ELF format was an interesting task that I may go into in the future, the process of porting MUSL also offered interesting hurdles and lessons.

Overview of libc

Many developers treat libc as a black-box, they don't care how libc works, they just care that it does. This is perfectly fine when targeting an established platform that has a mature libc already working on it. When porting to a new platform however, you need to go into the nitty gritty details, because libc essentially acts as the glue between userland and kernel. For example, you will very rarely see userland applications issue system calls directly, but this is common-place in the world of libc. So if we break libc down, what are the fundamental components that are provided by libc to applications?

  1. The C Runtime (CRT) stub for initializing the environment, typically known as crt1.o.
  2. A layer to abstract common functionality (ie. file I/O, networking, data types, memory management) away from the platform's low-level interface (system calls, sysctls/ioctls, etc).
  3. Type definitions and objects for things like stdin, stdout, stderr, and error handling (errno).

These components must be tailored to the operating system and architecture being targeted.

Comparing BSD libc and MUSL

For those not familiar with PS4 internals, it's a FreeBSD based system, based on FreeBSD 9.0. Sony added their own system calls and modified some existing kernel code while also running a custom userland, but most of the standard syscalls are still pretty similar. Given it's a BSD system at base, one might ask; why not use BSD libc instead of MUSL - after all, it should match what the system exposes more closely? Knowing this, I decided to give it a shot. Here are some of my observations of both BSD libc and MUSL.

BSD libc

  1. It was assumed BSD libc would work out of the box with minor CRT stub changes. It built without issue, but at runtime, this didn't end up being the case. Even something as simple as sprintf() failed with EINVAL, which is pretty incredible considering that's not even listed as a possible return value in the MAN pages. Bummer.
  2. It requires a BSD system to build it, and you build it by building "BSD world", which is massive and includes more than just libc. Navigating the codebase was far more difficult, and the code was less easy to work with.
  3. Because of how much of a behemoth BSD world is, the build times were extremely long.

MUSL

  1. MUSL definitely wouldn't work out of the box without some significant changes, as it's a Linux-based libc and we're dealing with BSD. Even getting it to build required changes.
  2. Unlike BSD world, MUSL is just a libc (and a minimal one at that). Fixing things was relatively easy and the code was readable and easily understood even if you're not familiar with the codebase.
  3. MUSL build times were blazingly fast compared to BSD libc. Where BSD libc would take about 20 minutes to build, building MUSL took closer to 2 minutes.

Even though in theory porting MUSL would take more changes than building a modified BSD libc, MUSL was the better option when we considered how much easier it was to work with and how much faster it was to build.

A custom CRT stub

Most of the time, in C programs, main() is not the true entry-point. While it's the entry-point as far as the application author is concerned, the real endpoint is usually _start() unless changed in the compiler flags or by a pre-processor directive. The reason for this is the C runtime wants to initialize it's environment before running any user-code. This is handled by the CRT stub. The bulk of the C runtime bootstrapping code is contained in the crt1.o object file. Generally in this file, you have the entry-point _start(), which calls __libc_start_main(), passing the address of the user-defined main() which __libc_start-main() will execute after it finishes initializing environment-related stuff.

The PS4 is a little different in this regard, because the PS4 doesn't have a conventional environment or set of arguments provided to it. As a matter of fact, the PS4 doesn't even want applications to ever return, as attempting to return from main() will crash the game. This makes sense, as games do not need or use arguments, and games should never return unless the player quits the game or application. With this in mind, I ignored __libc_start_main() and defined my own _start() which matches closer to the one produced by BSD libc.

__asm__(
".intel_syntax noprefix \n"
".global " START " \n"
START ": \n"
	"sub rsp, 0x28 \n"
	"mov rdi, r8 \n"
	"call atexit \n"
	"xor edx, edx \n"
	"mov edi, r9d \n"
	"mov rsi, r10 \n"
	"call main \n"
	"mov r11d, eax \n"
	"mov edi, r11d \n"
	"call exit \n"
);

The other significant change that had to be made was the PS4 has a customized ELF loader. The dynamic linking is custom, and there are additional non-standard / Sony-defined segments linked into PS4 apps. While most of these other segments are out of scope for this article, one of these segments is .sce_process_param. This segment defines metadata information, and needs to be linked into every application. Because of this, the CRT stub is a good location to put this segment in, as it'll always get linked with every application built with the toolchain.

Some of this metadata information includes the version magic for "ORBIS" applications, entries for other specialized objects, the SDK version used to build the app, and other various information. Below is a snippet of this custom section; I won't paste the full thing for brevity's sake, but if you're interested, you can check this out by going to the port repo and looking at /arch/ps4/crt_arch.h.

__asm__(
".intel_syntax noprefix \n"
".align 0x8 \n"
".section \".data.sce_process_param\" \n"
"_sceProcessParam: \n"
	// size
	".quad 	0x50 \n"
	// magic "ORBI"
	".long   0x4942524F \n"
	// entry count
	".long 	0x3 \n"
	// ...
);

Using BSD Syscalls

As MUSL is a Linux-based libc, it's going to invoke Linux syscalls to bridge the gap between userland and kernel. Linux and FreeBSD have different sets of syscalls. While there are a lot of similarities (for example open(), read(), write(), close()), differences start to appear when you hit less common system calls. Luckily MUSL was written with this in mind, and provides the handy ability for custom "architectures" (more like targets) to be defined, which allows you to specify (among various other things), a set of input syscall identifiers.

To define syscall numbers, MUSL reads the /bits/syscall.h.in file from the architecture directory.

#define	__NR_syscall	0
#define	__NR_exit	1
#define	__NR_fork	2
// ...

The differences in syscalls between Linux and FreeBSD vary depending on the subsystem. For some (such as file I/O), the differences are little to none. For others, the difference is merely a different name or a few arguments switched around. In some cases though, the syscall straight up doesn't exist in BSD where it does in Linux. This is the case for things like fast user-space mutexes (or "futexes" for short). While porting to the PS4, I encountered all of these types of differences.

In the case of different names, I just made alias definitions for the names MUSL expects and defined them to the BSD syscall that does the same thing. A good example of this is the set of syscalls used for signaling.

// Aliases for linux -> BSD
#define __NR_rt_sigqueueinfo    __NR_sigqueue
#define __NR_rt_sigaction       __NR_sigaction
#define __NR_rt_sigpending      __NR_sigpending
// ...

In the harder to resolve cases like the system call not existing, it was always in exotic functionality, so I ifdef'd out calls to syscalls we didn't have. In the future, any functionality that needs implementing can be added in. The ifdef'ing was done to prevent breaking MUSL for non-PS4 architectures. This is illustrated by the fanotify_init() syscall, which does not exist on FreeBSD.

int fanotify_init(unsigned flags, unsigned event_f_flags)
{
#ifndef PS4
	return syscall(SYS_fanotify_init, flags, event_f_flags);
#else
	return -1;
#endif
}

These are all the changes necessary to get MUSL to build, but you'll run into bugs at runtime. This is because of discrepancies between the kernel ABI of Linux and BSD.

Kernel ABI differences

The Application Binary Interface, or "ABI", is basically a specification that outlines the calling convention to be followed, binary formats, dynamic linking, and other things. The calling convention is what's most important, because we need to know what registers are preserved, what registers are trashed and are volatile (aka. "scratch") registers, and which registers are used for arguments and return values. Userland and kernel have slightly different calling conventions.

Linux and BSD both use the "System V" or "SYSV" ABI specification. We're going to focus on the calling convention for the purposes of this article. Under SYSV, the following specifications apply:

Userland

For function calls:

  • Registers RBX, RSP, RBP, R12, R13, R14, and R15 are preserved, meaning their values are saved by callee functions.
  • Registers RAX, RDI, RSI, RDX, RCX, R8, R9, R10, and R11 are volatile / scratch registers, meaning their values are not saved by callee functions.
  • For passing arguments and returning values, the following applies:
    • In order of 1st arg to 6th arg, registers RDI, RSI, RDX, RCX, R8, and R9 are used respectively. Beyond this, arguments are passed on the stack.
    • The return value is stored in RAX.

Kernel

For system calls:

  • Registers RBX, RSP, RBP, R12, R13, R14, and R15 are preserved.
  • For passing arguments, the following applies:
    • The RAX register is used to specify what system call index to invoke from the system call table.
    • In order of 1st arg to 6th arg, registers RDI, RSI, RDX, R10, R8, and R9 are used respectively.
    • The first (and usually only) return value is stored in RAX.
    • The second return value if it exists, is stored in RBX.
  • Register RCX is trashed by the kernel syscall exception handler to store the original instruction pointer value before the syscall.
  • Usually, while the argument registers are volatile in userland context, their values are saved and restored after a syscall by the syscall exception handler itself.

Bug #1: When BSD/PS4 gaslights you

Notice in the last section I said usually the argument register values are saved and restored after a syscall with emphasis on "usually". In Linux this is the case, and on FreeBSD this used to be the case. On the PS4 and on newer FreeBSD versions after January of 2019, this assumption is broken.

From a system design point of view, you have to be careful with registers when switching privilege levels to and from kernel. You need to ensure you restore registers to their original values or otherwise change them before returning back to userland, ideally the former. For one, you don't want to "bait and switch" the register values out from underneath your userland caller if you can help it. Sometimes an argument register might be used to store data that's re-used later after the syscall returns, especially considering the argument registers are scratch registers, which are considered fair game by the compiler.

The second and probably more important reason from a security point of view is you don't want to leak kernel register values to userland. Information disclosures like this can be powerful for exploitation, it can single handedly destroy kernel ASLR. It seems this was a concern, but rather than restoring the registers to their userland values pre-syscall, they instead opted to clear the register values by XOR'ing the registers with themselves before returning to userland. This galaxy brain idea causes issues! I only discovered this issue with the R10 register, but I figured it's probably happening to more than just R10, so I checked out the Xfast_syscall exception handler for system calls in a PS4 kernel dump.

mov     rdi, qword [rsp]
mov     rsi, qword [rsp+0x8]
mov     rdx, qword [rsp+0x10]
mov     rax, qword [rsp+0x30]
mov     r11, qword [rsp+0xb8]
mov     rcx, qword [rsp+0xa8]
mov     rsp, qword [rsp+0xc0]
xor     r8, r8  {0x0}
xor     r9, r9  {0x0}
xor     r10, r10  {0x0}
swapgs  
sysret  

Not only does it clear R10, it clears R8 and R9 as well. What's even more interesting is this code was added by Sony, as this code is not present in FreeBSD 9. But this code does exist in FreeBSD 12. If we look at the git blame for BSD's /sys/amd64/amd64/exception.S which contains exception handler implementations, we'll notice the following commit:

https://github.com/freebsd/freebsd/commit/84203fed6bace55a9e7f89d83cf74bd81603e91e

amd64: clear callee-preserved registers on syscall exit.

%r8, %r10, and on non-KPTI configuration %r9 were not restored on fast
return from a syscall.

Reviewed by: markj
Approved by: so
Security: CVE-2019-5595
Sponsored by: The FreeBSD Foundation
MFC after: 0 minutes

If we check out CVE-2019-5595:

In FreeBSD before 11.2-STABLE(r343782), 11.2-RELEASE-p9, 12.0-STABLE(r343781), and 12.0-RELEASE-p3, kernel callee-save registers are not properly sanitized before return from system calls, potentially allowing some kernel data used in the system call to be exposed.

As suspected. There was kernel information disclosure via the scratch registers. But instead of restoring them properly to userland-saved values, they just zero them out. I have no idea how this does not break assumptions when building BSD's libc, they must have some special code that's aware of it that I couldn't find. It's either that or there were bugs created by this issue that are yet undiscovered/disclosed.

The odd thing here is the PS4 dump I looked at is 5.05 firmware, which was released in January of 2018. This commit in FreeBSD mainline however, was pushed in February of 2019. Sony knew about this problem before FreeBSD did by over a year. Either Sony reported this issue and it took a while to land in FreeBSD, or FreeBSD maintainers had to discover the issue independently later on. Due to the fix being very similar, I'd guess it's likely Sony had a hand in the mainline FreeBSD patch, as the fix is basically identical.

This "fix" isn't optimal, because regular clang doesn't expect this behavior. Compilers like to be efficient, so it'll try to re-use registers where possible to avoid the performance cost of reloading a register. With this assumption broken, there are logic bugs introduced. Any time the compiler re-uses R8, R9, or R10 after invoking a syscall, it will trigger undefined behavior. In the case of pointers, this causes null pointer dereferences in code that from the source level looks perfectly valid. This issue was discovered thanks to these null pointer dereferences.

For example, printf() in MUSL calls a function called __stdout_write, which issues a TIOCGWINSZ ioctl syscall to get the window size. Here's the disassembly:

This code looks OK until you consider syscall clobbers R10. Boom, null pointer dereference on the mov dword [r10+0x90], -1 instruction. To workaround this, I modified the syscall wrappers to backup and restore these registers.

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	// ...
    
	__asm__ __volatile__
	(
		".intel_syntax\n\t"
        "push r8\n\t"
        "push r9\n\t"
		"push r10\n\t"
		"syscall\n\t"
		// ...
        "pop r10\n\t"
        "pop r9\n\t"
        "pop r8\n\t" : "=a"(ret) :
                       "a"(n), "D"(a1), "S"(a2), "d"(a3),
                           "r"(r10), "r"(r8), "r"(r9) :
                       "rcx", "r11", "memory"
	);
    
    return ret;
}

Bug #2: Syscall error handling

This bug was more my fault than anything, as I didn't realize SYSV did not extend beyond just defining the registers involved with error handling. It does not define how syscalls should indicate an error occurred and whether or not to return positive or negative errno values. In Linux, system calls on success return 0 or the positive value of whatever they should return (commonly a descriptor or a count). In cases of failure, a negative errno is returned. Expecting this, MUSL has a handler which checks if the return is -4095 <= RAX <= -1. If it's within this range, it sets errno appropriately by flipping the errno back to a positive by negating it, and returns -1.

long __syscall_ret(unsigned long r)
{
	if (r > -4096UL) {
		errno = -r;
		return -1;
	}
	return r;
}

FreeBSD is a little different though. FreeBSD system calls return a positive errno, not a negative one. One may wonder how this works; how would you know the syscall had failed and the return value should be interpreted as an error rather than interpreted as valid data? It just so happens FreeBSD sets the carry flag or "CF" of FLAGS to 1 on error, and 0 on success.

When using the standard FreeBSD calling convention, the carry flag is cleared upon success, set upon failure.

It took me a bit of time to figure out how to do this in a way that didn't feel like a hack. I ended up doing a conditional jump based on the carry flag, and negated RAX if the flag was set. These changes were again added to the syscall wrappers.

static __inline long __syscall6(long n, long a1, long a2, long a3, long a4, long a5, long a6)
{
	// ...
    
	__asm__ __volatile__
	(
		".intel_syntax\n\t"
    	// ...
		"syscall\n\t"
		"jnc syscallexit%=\n\t"
		"neg rax\n\t"
		"syscallexit%=:\n\t"
		// ...
    	: "=a"(ret) : "=a"(ret) :
                      "a"(n), "D"(a1), "S"(a2), "d"(a3),
                          "r"(r10), "r"(r8), "r"(r9) :
                      "rcx", "r11", "memory"
	);

Conclusion

There are a lot of low-level factors to consider when porting something like a libc, and it's easy to get frustrated and fall victim to subtle discrepancies. To test the MUSL port, I wrote a suite of unit tests some readers may find interesting, and compiled it with the OpenOrbis PS4 Toolchain against the newly-built MUSL libc static library. Below you can find a link to the MUSL PS4 port [1], the mentioned set of tests [2], the test results directly from the PS4 [3], and other references.

[1] https://github.com/OpenOrbis/musl

[2] https://github.com/OpenOrbis/OpenOrbis-PS4-Toolchain/tree/master/samples

[3] https://pastebin.com/Wzvbdk8s

[4] https://nvd.nist.gov/vuln/detail/CVE-2019-5595

[5] https://www.freebsd.org/doc/en_US.ISO8859-1/books/developers-handbook/x86-return-values.html