Experiment:
I ran sleep 1
under strace -tt
(which reports timestamps of all syscalls) in host and QEMU guest, and noticed that the time required to reach a certain syscall (clock_nanosleep
) is almost twice larger in case of the guest:
- 1.813 ms on the host vs
- 3.396 ms in the guest.
Here is full host strace -tt sleep 1 and here is full QEMU strace -tt sleep 1.
Below are excerpts where you can already see the difference:
Host:
Time diff timestamp (as reported by strace)
0.000 / 0.653 ms: 13:13:56.452820 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffded01ecb0 /* 53 vars */) = 0
0.653 / 0.023 ms: 13:13:56.453473 brk(NULL) = 0x5617efdea000
0.676 / 0.063 ms: 13:13:56.453496 arch_prctl(0x3001 /* ARCH_??? */, 0x7fffeb7041b0) = -1 EINVAL (Invalid argument)
QEMU:
Time diff timestamp (as reported by strace)
0.000 / 1.008 ms: 12:12:03.164063 execve("/usr/bin/sleep", ["sleep", "1"], 0x7ffd0bd93e50 /* 13 vars */) = 0
1.008 / 0.119 ms: 12:12:03.165071 brk(NULL) = 0x55b78c484000
1.127 / 0.102 ms: 12:12:03.165190 arch_prctl(0x3001 /* ARCH_??? */, 0x7ffcb5dfd850) = -1 EINVAL (Invalid argument)
The questions:
- What causes the slowdown & overhead? It is not using any hardware (like GPU, disks, etc), so there is no translation layers. I also tried running the command several times to ensure everything that can be cached is cached in the guest.
- Is there a way to speed it up?
Update:
- With
cpupower frequency-set --governor performance
the timings are:- Host: 0.922ms
- Guest: 1.412ms
- With image in
/dev/shm
(-drive file=/dev/shm/root
):- Host: 0.922ms
- Guest: 1.280ms
PS
I modified "bare" output of strace so that it include (1) time that starts from 0 with the first syscall followed by (2) duration of the syscall, for easier understanding. For completeness, the script is here.
I started qemu in this way:
qemu-system-x86_64 -enable-kvm -cpu host -smp 4 -m 4G -nodefaults -no-user-config -nographic -no-reboot \
-kernel $HOME/devel/vmlinuz-5.13.0-20-generic \
-append 'earlyprintk=hvc0 console=hvc0 root=/dev/sda rw' \
-drive file=$HOME/devel/images/root,if=ide,index=0,media=disk,format=raw \
-device virtio-serial,id=virtio-serial0 -chardev stdio,mux=on,id=host-io,signal=off -device virtconsole,chardev=host-io,id=console0
CodePudding user response:
That's expected, considering the way strace
is implemented, i.e. via the ptrace(2)
system call: every time the traced process performs a system call or gets a signal, the process is forcefully stopped and the control is passed to the tracing process, which in the case of strace
does all the unpacking & printing synchronously, i.e. while keeping the traced process stopped. That's the kind of path which increases any emulation overhead exponentially.
It would be instructive to strace strace
itself -- you will see that does not let the traced process continue (with ptrace(PTRACE_SYSCALL, ...)
) until it has processed & written out everything related to the current system call.
Notice that in order to run a "trivial" sleep 1
command, the dynamic linker will perform a couple dozen system calls before even getting to the entry point of the sleep
binary.
I don't think that optimizing strace
is worth spending time on; if you were planning to run strace as an auditing instead of a debugging tool (by running production tasks under strace
or similar), you should reconsider your designs ;-)
CodePudding user response:
It turned out that my (custom-built kernel) was missing CONFIG_HYPERVISOR_GUEST=y
option (and a couple of nested options).