Home > Net >  Strange uses of movzx by Clang and GCC
Strange uses of movzx by Clang and GCC

Time:07-12

I know that movzx can be used for dependency breaking, but I stumbled on some movzx uses by both Clang and GCC that I really can't see what good they are for. Here's a simple example I tried on godbolt:

#include <stdint.h>

int add2bytes(uint8_t* a, uint8_t* b) {
    return uint8_t(*a   *b);
}

with gcc12 -O3:

add2bytes(unsigned char*, unsigned char*):
        movzx   eax, BYTE PTR [rsi]
        add     al, BYTE PTR [rdi]
        movzx   eax, al
        ret

If I understand correctly, the first movzx here breaks dependency on previous eax value, but what is the second movzx doing? I don't think there's any dependency it can break, and it shouldn't affect the result either.

with clang14 -O3, it's even more weird:

add2bytes(unsigned char*, unsigned char*):                       # @add2bytes(unsigned char*, unsigned char*)
        mov     al, byte ptr [rsi]
        add     al, byte ptr [rdi]
        movzx   eax, al
        ret

It uses mov where movzx seems more reasonable, and then zero extends al to eax, but wouldn't it be much better to do movzx at the start?

I have 2 more examples here: https://godbolt.org/z/aPaE8v5x7 GCC generates both sensible and strange movzx, and Clang's use of mov and movzx just makes no sense to me. I also tried adding -march=skylake to make sure this isn't a feature for really old architectures, but the generated assembly looks more or less the same.

The closest post I have found is https://stackoverflow.com/a/64915219/14730360 where they showed similar movzx uses that seem useless and/or out of place.

Do the compilers really use movzx poorly here, or am I missing something?

CodePudding user response:

The clang bit actually seems reasonable. You get a partial register stall if you write to al and then read from eax. Using movzx breaks this partial register stall.

The initial mov to al has no dependencies on existing values of eax (due to register renaming), so the dependencies are just the unavoidable dependencies (wait for [rsi], wait for [rdi], wait for add to complete before zero-extending).

In other words, the top 24 bits must be zeroed and the lower 8 bits must be calculated, but the two actions can be done in either order. clang just chooses to add first, zero later.

[EDIT] As for GCC, it seems a particularly bad choice. If it had chosen bl as the temporary register, that last movzx would be zero-latency on Haswell/SkyLake, but move elimination does not work on al to eax.

CodePudding user response:

The final MOVZX is mandated by the fact that the function returns an int, extended from a byte. It must be there in the clang version, but with gcc one is extra.

  • Related