Home > Blockchain >  Why does printf still work with RAX lower than the number of FP args in XMM registers?
Why does printf still work with RAX lower than the number of FP args in XMM registers?

Time:04-25

I am following the book "Beginning x64 Assembly Programming", in Linux 64 system. I am using NASM and gcc.
In the chapter about floating point operations the book specifies the below code for adding 2 float numbers. In the book, and other online sources, I have read that register RAX specifies the number of XMM registers to be used, according to calling conventions.
The code in the book goes as follows:

extern printf
section .data
num1        dq  9.0
num2        dq  73.0
fmt     db  "The numbers are %f and %f",10,0
f_sum       db  "%f   %f = %f",10,0

section .text
global main
main:
    push rbp
    mov rbp, rsp
printn:
    movsd xmm0, [num1]
    movsd xmm1, [num2]
    mov rdi, fmt
    mov rax, 2      ;for printf rax specifies amount of xmm registers
    call printf

sum:
    movsd xmm2, [num1]
    addsd xmm2, [num2]
printsum:
    movsd xmm0, [num1]
    movsd xmm1, [num2]
    mov rdi, f_sum
    mov rax, 3
    call printf

That works as expected.
Then, before the last printf call, I tried changing

mov rax, 3

for

mov rax, 1

Then I reassembled and ran the program.

I was expecting some different nonsense output, but I was surprised the output was exactly the same. printf outputs the 3 float values correctly:

The numbers are 9.000000 and 73.000000
9.000000   73.000000 = 82.000000

I suppose there is some kind of override when printf is expecting the use of several XMM registers, and as long as RAX is not 0, it will use consecutive XMM registers. I have searched for an explanation in calling conventions and NASM manual, but didn't find one.

What is the reason why this works?

CodePudding user response:

The x86-64 SysV ABI's strict rules allow implementations that only save the exact number of XMM regs specified, but current implementations only check for zero / non-zero because that's efficient, especially for the AL=0 common case.

If you pass a number in AL1 lower than the actual number of XMM register args, or a number higher than 8, you'd be violating the ABI, and it's only this implementation detail which stops your code from breaking. (i.e. it "happens to work", but is not guaranteed by any standard or documentation, and isn't portable to some other real implementations, like older GNU/Linux distros that were built with GCC4.5 or earlier.)

This Q&A shows a current build of glibc printf which just checks for AL!=0, vs. an old build of glibc which computes a jump target into a sequence of movaps stores. (That Q&A is about that code breaking when AL>8, making the computed jump go somewhere it shouldn't.)

Why does eax contain the number of vector parameters? quotes the ABI doc, and shows ICC code-gen which similarly does a computed jump using the same instructions as old GCC.


Glibc's printf implementation is compiled from C source, normally by GCC. When modern GCC compiles a variadic function like printf, it makes asm that only checks for a zero vs. non-zero AL, dumping all 8 arg-passing XMM registers to an array on the stack if non-zero.

GCC4.5 and earlier actually did use the number in AL to do a computed jump into a sequence of movaps stores, to only actually save as many XMM regs as necessary.

Nate's simple example from comments on Godbolt with GCC4.5 vs. GCC11 shows the same difference as the linked answer with disassembly of old/new glibc (built by GCC), unsurprisingly. This function only ever uses va_arg(v, double);, never integer types, so it doesn't dump the incoming RDI...R9 anywhere, unlike printf. And it's a leaf function so it can use the red-zone (128 bytes below RSP).

# GCC4.5.3 -O3 -fPIC    to compile like glibc would
add_them:
        movzx   eax, al
        sub     rsp, 48                  # reserve stack space, needed either way
        lea     rdx, 0[0 rax*4]          # each movaps is 4 bytes long
        lea     rax, .L2[rip]            # code pointer to after the last movaps
        lea     rsi, -136[rsp]             # used later by va_arg.  test/jz version does the same, but after the movaps stores
        sub     rax, rdx
        lea     rdx, 39[rsp]               # used later by va_arg, test/jz version also does an LEA like this
        jmp     rax                      # AL=0 case jumps to L2
        movaps  XMMWORD PTR -15[rdx], xmm7     # using RDX as a base makes each movaps 4 bytes long, vs. 5 with RSP
        movaps  XMMWORD PTR -31[rdx], xmm6
        movaps  XMMWORD PTR -47[rdx], xmm5
        movaps  XMMWORD PTR -63[rdx], xmm4
        movaps  XMMWORD PTR -79[rdx], xmm3
        movaps  XMMWORD PTR -95[rdx], xmm2
        movaps  XMMWORD PTR -111[rdx], xmm1
        movaps  XMMWORD PTR -127[rdx], xmm0   # xmm0 last, will be ready for store-forwading last
.L2:
        lea     rax, 56[rsp]       # first stack arg (if any), I think
     ## rest of the function

vs.

# GCC11.2 -O3 -fPIC
add_them:
        sub     rsp, 48
        test    al, al
        je      .L15                          # only one test&branch macro-fused uop
        movaps  XMMWORD PTR -88[rsp], xmm0    # xmm0 first
        movaps  XMMWORD PTR -72[rsp], xmm1
        movaps  XMMWORD PTR -56[rsp], xmm2
        movaps  XMMWORD PTR -40[rsp], xmm3
        movaps  XMMWORD PTR -24[rsp], xmm4
        movaps  XMMWORD PTR -8[rsp], xmm5
        movaps  XMMWORD PTR 8[rsp], xmm6
        movaps  XMMWORD PTR 24[rsp], xmm7
.L15:
        lea     rax, [rsp 56]       # first stack arg (if any), I think
        lea     rsi, -136[rsp]      # used by va_arg.  done after the movaps stores instead of before.
...
        lea     rdx, 56[rsp]        # used by va_arg.  With a different offset than older GCC, but used somewhat similarly.  Redundant with the LEA into RAX; silly compiler.

GCC presumably changed strategy because the computed jump takes more static code size (I-cache footprint), and a test/jz is easier to predict than an indirect jump. Even more importantly, it's fewer uops executed in the common AL=0 (no-XMM) case2. And not many more even for the AL=1 worst case (7 dead movaps stores but no work done computing a branch target).


Related Q&As:

  • Related