What causes overhead in QEMU in case of trivial `sleep 1`?-CodePudding

Experiment:

I ran sleep 1 under strace -tt (which reports timestamps of all syscalls) in host and QEMU guest, and noticed that the time required to reach a certain syscall (clock_nanosleep) is almost twice larger in case of the guest:

1.813 ms on the host vs
3.396 ms in the guest.

Here is full host strace -tt sleep 1 and here is full QEMU strace -tt sleep 1.

Below are excerpts where you can already see the difference:

Host:

Time    diff      timestamp (as reported by strace)
0.000 / 0.653 ms: 13:13:56.452820 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffded01ecb0 /* 53 vars */) = 0
0.653 / 0.023 ms: 13:13:56.453473 brk(NULL)               = 0x5617efdea000
0.676 / 0.063 ms: 13:13:56.453496 arch_prctl(0x3001 /* ARCH_??? */, 0x7fffeb7041b0) = -1 EINVAL (Invalid argument)

QEMU:

Time    diff      timestamp (as reported by strace)
0.000 / 1.008 ms: 12:12:03.164063 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffd0bd93e50 /* 13 vars */) = 0
1.008 / 0.119 ms: 12:12:03.165071 brk(NULL)               = 0x55b78c484000
1.127 / 0.102 ms: 12:12:03.165190 arch_prctl(0x3001 /* ARCH_??? */, 0x7ffcb5dfd850) = -1 EINVAL (Invalid argument)

The questions:

What causes the slowdown & overhead? It is not using any hardware (like GPU, disks, etc), so there is no translation layers. I also tried running the command several times to ensure everything that can be cached is cached in the guest.
Is there a way to speed it up?

Update:

With cpupower frequency-set --governor performance the timings are:
- Host: 0.922ms
- Guest: 1.412ms
With image in /dev/shm (-drive file=/dev/shm/root):
- Host: 0.922ms
- Guest: 1.280ms

PS

I modified "bare" output of strace so that it include (1) time that starts from 0 with the first syscall followed by (2) duration of the syscall, for easier understanding. For completeness, the script is here.

I started qemu in this way:

qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -m 4G -nodefaults -no-user-config -nographic -no-reboot \
   -kernel $HOME/devel/vmlinuz-5.13.0-20-generic \
   -append 'earlyprintk=hvc0 console=hvc0 root=/dev/sda rw' \
   -drive file=$HOME/devel/images/root,if=ide,index=0,media=disk,format=raw \
   -device virtio-serial,id=virtio-serial0 -chardev stdio,mux=on,id=host-io,signal=off -device virtconsole,chardev=host-io,id=console0

CodePudding user response：

That's expected, considering the way strace is implemented, i.e. via the ptrace(2) system call: every time the traced process performs a system call or gets a signal, the process is forcefully stopped and the control is passed to the tracing process, which in the case of strace does all the unpacking & printing synchronously, i.e. while keeping the traced process stopped. That's the kind of path which increases any emulation overhead exponentially.

It would be instructive to strace strace itself -- you will see that does not let the traced process continue (with ptrace(PTRACE_SYSCALL, ...)) until it has processed & written out everything related to the current system call.

Notice that in order to run a "trivial" sleep 1 command, the dynamic linker will perform a couple dozen system calls before even getting to the entry point of the sleep binary.

I don't think that optimizing strace is worth spending time on; if you were planning to run strace as an auditing instead of a debugging tool (by running production tasks under strace or similar), you should reconsider your designs ;-)

CodePudding user response：

It turned out that my (custom-built kernel) was missing CONFIG_HYPERVISOR_GUEST=y option (and a couple of nested options).