vfork execve strange when using syscall-CodePudding

If you execute the code below you'll see execve returns a process id and parent never executes. I tried looking for documentation but I either can't find it or can't understand it. clone talks about vfork (CLONE_VFORK) and says the below but the parent never seems to execute. If you uncomment the non sys call vfork or use the syscall fork it'll work as expected

the execution of the calling process is suspended until the child releases its virtual memory resources via a call to execve(2) or _exit(2) (as with vfork(2)).

#include <unistd.h>
#include <syscall.h>
#include <stdio.h>
#include <string.h>

int main(int argc, char *argv[])
{
    //int a = vfork();
    //int a = syscall(__NR_fork);
    int a = syscall(__NR_vfork);
    if (a) {
        write(2, "parent\n", 7);
    } else {
        char*args[] = {"/usr/bin/true", (char*)0};
        int res = execve(args[0], args, &argv[2]);
        char buf[256];
        sprintf(buf, "child got %d\n", res);
        write(2, buf, strlen(buf));
    }
    write(2, "Done\nChild\n", a?5:11);
}

CodePudding user response：

There are multiple instances of undefined behavior in the code.

You are invoking undefined behavior by making calls such as sprintf() and write() after execve() fails. Per POSIX:

... the behavior is undefined if the process created by vfork() either modifies any data other than a variable of type pid_t used to store the return value from vfork(), or returns from the function in which vfork() was called, or calls any other function before successfully calling _exit() or one of the exec family of functions.

Even simply returning from main() after vfork() invokes undefined behavior.

@Barmar summed it up best: "you should just not use vfork() at all"

This code also invokes undefined behavior:

    char*args[] = {"/usr/bin/true", (char*)0};
    int res = execve(args[0], args, &argv[2]);

argv[2] doesn't exist, so passing its address to execve() invokes undefined behavior. Note that taking the address of argv[2] does not in itself invoke undefined behavior - an address one past the actual end of an array does exist. But it can't be safely derferenced, which execve() will do.

execve() expects a pointer to an array of environment pointers as its third argument:

Using execve()

The following example passes arguments to the ls command in the cmd array, and specifies the environment for the new process image using the env argument.
#include <unistd.h>


int ret;
char *cmd[] = { "ls", "-l", (char *)0 };
char *env[] = { "HOME=/usr/home", "LOGNAME=home", (char *)0 };
...
ret = execve ("/bin/ls", cmd, env);

CodePudding user response：

I was curious what exactly did happen. I used strace -f ./a.out to see output like this, showing that it's the parent making a write(2, "Done\nChild\n", 11) system call. (lower-numbered PID, and not the new PID strace reports attaching to after vfork)

...
prlimit64(0, RLIMIT_STACK, NULL, {rlim_cur=8192*1024, rlim_max=RLIM64_INFINITY}) = 0
munmap(0x7f7e48c59000, 193483)          = 0
vfork(strace: Process 515667 attached
 <unfinished ...>
[pid 515667] execve("/usr/bin/true", ["/usr/bin/true"], 0x7ffc4447ce18 /* 60 vars */ <unfinished ...>
[pid 515666] <... vfork resumed>)       = 515667
[pid 515666] write(2, "child got 515667\n", 17child got 515667
) = 17
[pid 515667] <... execve resumed>)      = 0
[pid 515666] write(2, "Done\nChild\n", 11Done
Child
) = 11
[pid 515667] brk(NULL <unfinished ...>
[pid 515666] exit_group(0 <unfinished ...>
[pid 515667] <... brk resumed>)         = 0x5603b644c000
[pid 515666] <... exit_group resumed>)  = ?
[pid 515667] arch_prctl(0x3001 /* ARCH_??? */, 0x7ffc878f2720) = -1 EINVAL (Invalid argument)
[pid 515666]     exited with 0    
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
... the parent has exited by now, leaving just the child running the dynamic linker for /usr/bin/true

This is terminal output mixed with strace output; I could have used strace -f -o vfork.trace ./a.out to capture the log separately, or ./a.out &>/dev/null.

The child overwrites the parent's return address, to the `execve` call site

The actual behaviour of this C code with undefined behaviour happened to be the same with gcc (-O0 by default), gcc -O3, and clang -O3. So for asm that was easier to single-step with GDB, I built it with gcc -O3 -fno-plt on my Arch GNU/Linux system (GCC12.2 in case it matters). -fno-plt means that dynamic linking isn't "lazy", so we can step into library functions.

It was also handy to look at the compiler's asm source with symbolic names (https://godbolt.org/z/j6ME6rWaa).

After vfork, GDB detaches the child and lets it run, so you're still single-stepping the parent.

The parent's return from the glibc syscall() wrapper function is not to the test eax,eax instruction after call syscall, it's to the instruction after a different call It seems that after the child returns from vfork, it ends up overwriting the return address on the stack before the parent has a chance to run. That makes sense; the compiler-generated asm for main doesn't adjust RSP after function entry, so any other call would push a return address to the same place, overwriting the return address in the other process.

The glibc wrapper for vfork avoids this by popping the return address around the syscall and pushing it right after, to make it usually work under the conditions where POSIX and the Linux man page says it should. (Which don't include the way you're using it, but even in a safe usage, call execve before the parent can ret from a wrapper function would be a problem.)

The actual place it returned to was a RIP-relative LEA following a call, not a test eax,eax. That was the lightbulb moment, the clue that a return address would have been overwritten. That LEA is setting up args for sprintf; the preceding call was call execve.

That makes sense; execve is the last thing the child did since it only returns on error; on success it replaces the process with a fresh address space that's no longer shared with the parent.

After the child returned from syscall(__NR_vfork),it branched and called execve, pushing that return address, overwriting the parent's return address from call syscall because they share an address-space including the stack.

That leaves just the parent, executing from the return path of execve(), which in a non-buggy (or non-hacky) program would only be reachable on error.

So it does the sprintf. It prints child got 515667 because that PID was the value in EAX as the parent was returning from vfork (to this block of code which takes res from the EAX return value of this other call site.)

As for how it manages to pick 11 instead of 5 as the length for the write system call, the details probably differ in debug vs. optimized builds. In an optimized build, different branches of the if(a) leave a different number in a register which the call to write() uses.

In a debug build, only the child returned to the vfork call site and stored an a value to the stack.

Shenanigans like this are why nobody uses vfork anymore; a couple copy-on-write page-faults are cheap enough that it's not worth playing with fire.

It's also why the rules on how you're allowed to use vfork are very restrictive; you'd better have your args for execve already constructed before you call vfork, so the very next thing can be a call execve.

`syscall(__NR_vfork)` isn't safe; it needs special handling

Single-stepping into the glibc wrapper (stepi aka si in GDB, in layout asm TUI mode), we can see its asm.

│    0x7ffff7e7d830 <vfork>          endbr64
│    0x7ffff7e7d834 <vfork 4>        pop    rdi
│    0x7ffff7e7d835 <vfork 5>        mov    eax,0x3a
│    0x7ffff7e7d83a <vfork 10>       syscall
│    0x7ffff7e7d83c <vfork 12>       push   rdi
│  > 0x7ffff7e7d83d <vfork 13>       cmp    eax,0xfffff001     # EAX >= -ERRNO_MAX
│    0x7ffff7e7d842 <vfork 18>       jae    0x7ffff7e7d858 <vfork 40>                                                                                                                                                                
               # else no-error return path.
│    0x7ffff7e7d844 <vfork 20>       xor    esi,esi
│    0x7ffff7e7d846 <vfork 22>       rdsspq rsi
│    0x7ffff7e7d84b <vfork 27>       test   rsi,rsi   # if shadow stack not in use
│    0x7ffff7e7d84e <vfork 30>       je     0x7ffff7e7d857 <vfork 39>
│    0x7ffff7e7d850 <vfork 32>       test   eax,eax   # in parent, normal return
│    0x7ffff7e7d852 <vfork 34>       jne    0x7ffff7e7d857 <vfork 39>
│    0x7ffff7e7d854 <vfork 36>       pop    rdi         # pop real return address
│    0x7ffff7e7d855 <vfork 37>       jmp    rdi         # and manually return to the correct address from the shadow stack?

     # no shadow-stack path of execution, return normally.
│    0x7ffff7e7d857 <vfork 39>       ret

  # error handling, set errno and return -1
│    0x7ffff7e7d858 <vfork 40>       mov    rcx,QWORD PTR [rip 0x105509]        # 0x7ffff7f82d68
│    0x7ffff7e7d85f <vfork 47>       neg    eax
│    0x7ffff7e7d861 <vfork 49>       mov    DWORD PTR fs:[rcx],eax
│    0x7ffff7e7d864 <vfork 52>       or     rax,0xffffffffffffffff   # code-size optimization for mov rax,-1   (really rarely executed for most system calls)
│    0x7ffff7e7d868 <vfork 56>       ret

rdsspq reads the "shadow stack" pointer, in case the caller was using CET, Control-flow Enforcement Technology. I'm not familiar with CET, so my comments on that part are guesswork based on what this function probably needs to do, and how it's using these instructions.

I should have just looked at the hand-written glibc source which has comments, glibc/sysdeps/unix/sysv/linux/x86_64/vfork.S; updated with some from there.

It seems like there could still be a race with the child, like if our push rdi runs before the child returns and calls execve. Under normal scheduling conditions, though, the child does run first.

Maybe some special logic stops the parent task from returning to user-space until after the child has made one more system call?

The child overwrites the parent's return address, to the execve call site

syscall(__NR_vfork) isn't safe; it needs special handling

The child overwrites the parent's return address, to the `execve` call site

`syscall(__NR_vfork)` isn't safe; it needs special handling