If you execute the code below you'll see execve returns a process id and parent never executes. I tried looking for documentation but I either can't find it or can't understand it. clone talks about vfork (CLONE_VFORK) and says the below but the parent never seems to execute. If you uncomment the non sys call vfork or use the syscall fork it'll work as expected
the execution of the calling process is suspended until the child releases its virtual memory resources via a call to execve(2) or _exit(2) (as with vfork(2)).
#include <unistd.h>
#include <syscall.h>
#include <stdio.h>
#include <string.h>
int main(int argc, char *argv[])
{
//int a = vfork();
//int a = syscall(__NR_fork);
int a = syscall(__NR_vfork);
if (a) {
write(2, "parent\n", 7);
} else {
char*args[] = {"/usr/bin/true", (char*)0};
int res = execve(args[0], args, &argv[2]);
char buf[256];
sprintf(buf, "child got %d\n", res);
write(2, buf, strlen(buf));
}
write(2, "Done\nChild\n", a?5:11);
}
CodePudding user response:
There are multiple instances of undefined behavior in the code.
You are invoking undefined behavior by making calls such as sprintf()
and write()
after execve()
fails. Per POSIX:
... the behavior is undefined if the process created by
vfork()
either modifies any data other than a variable of typepid_t
used to store the return value fromvfork()
, or returns from the function in whichvfork()
was called, or calls any other function before successfully calling_exit()
or one of the exec family of functions.
Even simply returning from main()
after vfork()
invokes undefined behavior.
@Barmar summed it up best: "you should just not use vfork()
at all"
This code also invokes undefined behavior:
char*args[] = {"/usr/bin/true", (char*)0};
int res = execve(args[0], args, &argv[2]);
argv[2]
doesn't exist, so passing its address to execve()
invokes undefined behavior. Note that taking the address of argv[2]
does not in itself invoke undefined behavior - an address one past the actual end of an array does exist. But it can't be safely derferenced, which execve()
will do.
execve()
expects a pointer to an array of environment pointers as its third argument:
Using execve()
The following example passes arguments to the ls command in the cmd array, and specifies the environment for the new process image using the env argument.
#include <unistd.h> int ret; char *cmd[] = { "ls", "-l", (char *)0 }; char *env[] = { "HOME=/usr/home", "LOGNAME=home", (char *)0 }; ... ret = execve ("/bin/ls", cmd, env);
CodePudding user response:
I was curious what exactly did happen. I used strace -f ./a.out
to see output like this, showing that it's the parent making a write(2, "Done\nChild\n", 11)
system call. (lower-numbered PID, and not the new PID strace reports attaching to after vfork)
...
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7f7e48c59000, 193483) = 0
vfork(strace: Process 515667 attached
<unfinished ...>
[pid 515667] execve("/usr/bin/true", ["/usr/bin/true"], 0x7ffc4447ce18 /* 60 vars */ <unfinished ...>
[pid 515666] <... vfork resumed>) = 515667
[pid 515666] write(2, "child got 515667\n", 17child got 515667
) = 17
[pid 515667] <... execve resumed>) = 0
[pid 515666] write(2, "Done\nChild\n", 11Done
Child
) = 11
[pid 515667] brk(NULL <unfinished ...>
[pid 515666] exit_group(0 <unfinished ...>
[pid 515667] <... brk resumed>) = 0x5603b644c000
[pid 515666] <... exit_group resumed>) = ?
[pid 515667] arch_prctl(0x3001 /* ARCH_??? */, 0x7ffc878f2720) = -1 EINVAL (Invalid argument)
[pid 515666] exited with 0
access("/etc/ld.so.preload", R_OK) = -1 ENOENT (No such file or directory)
... the parent has exited by now, leaving just the child running the dynamic linker for /usr/bin/true
This is terminal output mixed with strace output; I could have used strace -f -o vfork.trace ./a.out
to capture the log separately, or ./a.out &>/dev/null
.
The child overwrites the parent's return address, to the execve
call site
The actual behaviour of this C code with undefined behaviour happened to be the same with gcc
(-O0 by default), gcc -O3
, and clang -O3
. So for asm that was easier to single-step with GDB, I built it with gcc -O3 -fno-plt
on my Arch GNU/Linux system (GCC12.2 in case it matters). -fno-plt
means that dynamic linking isn't "lazy", so we can step into library functions.
It was also handy to look at the compiler's asm source with symbolic names (https://godbolt.org/z/j6ME6rWaa).
After vfork
, GDB detaches the child and lets it run, so you're still single-stepping the parent.
The parent's return from the glibc syscall()
wrapper function is not to the test eax,eax
instruction after call syscall
, it's to the instruction after a different call
It seems that after the child returns from vfork
, it ends up overwriting the return address on the stack before the parent has a chance to run. That makes sense; the compiler-generated asm for main
doesn't adjust RSP after function entry, so any other call
would push a return address to the same place, overwriting the return address in the other process.
The glibc wrapper for vfork
avoids this by popping the return address around the syscall
and pushing it right after, to make it usually work under the conditions where POSIX and the Linux man page says it should. (Which don't include the way you're using it, but even in a safe usage, call execve
before the parent can ret
from a wrapper function would be a problem.)
The actual place it returned to was a RIP-relative LEA following a call
, not a test eax,eax
. That was the lightbulb moment, the clue that a return address would have been overwritten. That LEA is setting up args for sprintf
; the preceding call was call execve
.
That makes sense; execve
is the last thing the child did since it only returns on error; on success it replaces the process with a fresh address space that's no longer shared with the parent.
After the child returned from syscall(__NR_vfork)
,it branched and called execve
, pushing that return address, overwriting the parent's return address from call syscall
because they share an address-space including the stack.
That leaves just the parent, executing from the return path of execve()
, which in a non-buggy (or non-hacky) program would only be reachable on error.
So it does the sprintf. It prints child got 515667
because that PID was the value in EAX as the parent was returning from vfork
(to this block of code which takes res
from the EAX return value of this other call site.)
As for how it manages to pick 11
instead of 5
as the length for the write
system call, the details probably differ in debug vs. optimized builds. In an optimized build, different branches of the if(a)
leave a different number in a register which the call to write()
uses.
In a debug build, only the child returned to the vfork
call site and stored an a
value to the stack.
Shenanigans like this are why nobody uses vfork
anymore; a couple copy-on-write page-faults are cheap enough that it's not worth playing with fire.
It's also why the rules on how you're allowed to use vfork
are very restrictive; you'd better have your args for execve
already constructed before you call vfork
, so the very next thing can be a call execve
.
syscall(__NR_vfork)
isn't safe; it needs special handling
Single-stepping into the glibc wrapper (stepi
aka si
in GDB, in layout asm
TUI mode), we can see its asm.
│ 0x7ffff7e7d830 <vfork> endbr64
│ 0x7ffff7e7d834 <vfork 4> pop rdi
│ 0x7ffff7e7d835 <vfork 5> mov eax,0x3a
│ 0x7ffff7e7d83a <vfork 10> syscall
│ 0x7ffff7e7d83c <vfork 12> push rdi
│ > 0x7ffff7e7d83d <vfork 13> cmp eax,0xfffff001 # EAX >= -ERRNO_MAX
│ 0x7ffff7e7d842 <vfork 18> jae 0x7ffff7e7d858 <vfork 40>
# else no-error return path.
│ 0x7ffff7e7d844 <vfork 20> xor esi,esi
│ 0x7ffff7e7d846 <vfork 22> rdsspq rsi
│ 0x7ffff7e7d84b <vfork 27> test rsi,rsi # if shadow stack not in use
│ 0x7ffff7e7d84e <vfork 30> je 0x7ffff7e7d857 <vfork 39>
│ 0x7ffff7e7d850 <vfork 32> test eax,eax # in parent, normal return
│ 0x7ffff7e7d852 <vfork 34> jne 0x7ffff7e7d857 <vfork 39>
│ 0x7ffff7e7d854 <vfork 36> pop rdi # pop real return address
│ 0x7ffff7e7d855 <vfork 37> jmp rdi # and manually return to the correct address from the shadow stack?
# no shadow-stack path of execution, return normally.
│ 0x7ffff7e7d857 <vfork 39> ret
# error handling, set errno and return -1
│ 0x7ffff7e7d858 <vfork 40> mov rcx,QWORD PTR [rip 0x105509] # 0x7ffff7f82d68
│ 0x7ffff7e7d85f <vfork 47> neg eax
│ 0x7ffff7e7d861 <vfork 49> mov DWORD PTR fs:[rcx],eax
│ 0x7ffff7e7d864 <vfork 52> or rax,0xffffffffffffffff # code-size optimization for mov rax,-1 (really rarely executed for most system calls)
│ 0x7ffff7e7d868 <vfork 56> ret
rdsspq
reads the "shadow stack" pointer, in case the caller was using CET, Control-flow Enforcement Technology. I'm not familiar with CET, so my comments on that part are guesswork based on what this function probably needs to do, and how it's using these instructions.
I should have just looked at the hand-written glibc source which has comments, glibc/sysdeps/unix/sysv/linux/x86_64/vfork.S
; updated with some from there.
It seems like there could still be a race with the child, like if our push rdi
runs before the child returns and calls execve
. Under normal scheduling conditions, though, the child does run first.
Maybe some special logic stops the parent task from returning to user-space until after the child has made one more system call?