I am following the book "Beginning x64 Assembly Programming", in Linux 64 system. I am using NASM and gcc.
In the chapter about floating point operations the book specifies the below code for adding 2 float numbers. In the book, and other online sources, I have read that register RAX specifies the number of XMM registers to be used, according to calling conventions.
The code in the book goes as follows:
extern printf
section .data
num1 dq 9.0
num2 dq 73.0
fmt db "The numbers are %f and %f",10,0
f_sum db "%f %f = %f",10,0
section .text
global main
main:
push rbp
mov rbp, rsp
printn:
movsd xmm0, [num1]
movsd xmm1, [num2]
mov rdi, fmt
mov rax, 2 ;for printf rax specifies amount of xmm registers
call printf
sum:
movsd xmm2, [num1]
addsd xmm2, [num2]
printsum:
movsd xmm0, [num1]
movsd xmm1, [num2]
mov rdi, f_sum
mov rax, 3
call printf
That works as expected.
Then, before the last printf
call, I tried changing
mov rax, 3
for
mov rax, 1
Then I reassembled and ran the program.
I was expecting some different nonsense output, but I was surprised the output was exactly the same. printf
outputs the 3 float values correctly:
The numbers are 9.000000 and 73.000000 9.000000 73.000000 = 82.000000
I suppose there is some kind of override when printf
is expecting the use of several XMM registers, and as long as RAX is not 0, it will use consecutive XMM registers. I have searched for an explanation in calling conventions and NASM manual, but didn't find one.
What is the reason why this works?
CodePudding user response:
The x86-64 SysV ABI's strict rules allow implementations that only save the exact number of XMM regs specified, but current implementations only check for zero / non-zero because that's efficient, especially for the AL=0 common case.
If you pass a number in AL1 lower than the actual number of XMM register args, or a number higher than 8, you'd be violating the ABI, and it's only this implementation detail which stops your code from breaking. (i.e. it "happens to work", but is not guaranteed by any standard or documentation, and isn't portable to some other real implementations, like older GNU/Linux distros that were built with GCC4.5 or earlier.)
This Q&A shows a current build of glibc printf which just checks for AL!=0
, vs. an old build of glibc which computes a jump target into a sequence of movaps
stores. (That Q&A is about that code breaking when AL>8
, making the computed jump go somewhere it shouldn't.)
Why does eax contain the number of vector parameters? quotes the ABI doc, and shows ICC code-gen which similarly does a computed jump using the same instructions as old GCC.
Glibc's printf
implementation is compiled from C source, normally by GCC. When modern GCC compiles a variadic function like printf, it makes asm that only checks for a zero vs. non-zero AL, dumping all 8 arg-passing XMM registers to an array on the stack if non-zero.
GCC4.5 and earlier actually did use the number in AL to do a computed jump into a sequence of movaps
stores, to only actually save as many XMM regs as necessary.
Nate's simple example from comments on Godbolt with GCC4.5 vs. GCC11 shows the same difference as the linked answer with disassembly of old/new glibc (built by GCC), unsurprisingly. This function only ever uses va_arg(v, double);
, never integer types, so it doesn't dump the incoming RDI...R9 anywhere, unlike printf
. And it's a leaf function so it can use the red-zone (128 bytes below RSP).
# GCC4.5.3 -O3 -fPIC to compile like glibc would
add_them:
movzx eax, al
sub rsp, 48 # reserve stack space, needed either way
lea rdx, 0[0 rax*4] # each movaps is 4 bytes long
lea rax, .L2[rip] # code pointer to after the last movaps
lea rsi, -136[rsp] # used later by va_arg. test/jz version does the same, but after the movaps stores
sub rax, rdx
lea rdx, 39[rsp] # used later by va_arg, test/jz version also does an LEA like this
jmp rax # AL=0 case jumps to L2
movaps XMMWORD PTR -15[rdx], xmm7 # using RDX as a base makes each movaps 4 bytes long, vs. 5 with RSP
movaps XMMWORD PTR -31[rdx], xmm6
movaps XMMWORD PTR -47[rdx], xmm5
movaps XMMWORD PTR -63[rdx], xmm4
movaps XMMWORD PTR -79[rdx], xmm3
movaps XMMWORD PTR -95[rdx], xmm2
movaps XMMWORD PTR -111[rdx], xmm1
movaps XMMWORD PTR -127[rdx], xmm0 # xmm0 last, will be ready for store-forwading last
.L2:
lea rax, 56[rsp] # first stack arg (if any), I think
## rest of the function
vs.
# GCC11.2 -O3 -fPIC
add_them:
sub rsp, 48
test al, al
je .L15 # only one test&branch macro-fused uop
movaps XMMWORD PTR -88[rsp], xmm0 # xmm0 first
movaps XMMWORD PTR -72[rsp], xmm1
movaps XMMWORD PTR -56[rsp], xmm2
movaps XMMWORD PTR -40[rsp], xmm3
movaps XMMWORD PTR -24[rsp], xmm4
movaps XMMWORD PTR -8[rsp], xmm5
movaps XMMWORD PTR 8[rsp], xmm6
movaps XMMWORD PTR 24[rsp], xmm7
.L15:
lea rax, [rsp 56] # first stack arg (if any), I think
lea rsi, -136[rsp] # used by va_arg. done after the movaps stores instead of before.
...
lea rdx, 56[rsp] # used by va_arg. With a different offset than older GCC, but used somewhat similarly. Redundant with the LEA into RAX; silly compiler.
GCC presumably changed strategy because the computed jump takes more static code size (I-cache footprint), and a test/jz is easier to predict than an indirect jump. Even more importantly, it's fewer uops executed in the common AL=0 (no-XMM) case2. And not many more even for the AL=1 worst case (7 dead movaps
stores but no work done computing a branch target).
Related Q&As:
- Assembly executable doesn't show anything (x64) AL != 0 vs. computed jump code-gen for glibc printf
- Why is