Why is the XOR swap optimized into a normal swap using the MOV instruction?-CodePudding

While testing things around Compiler Explorer, I tried out the following overflow-free function for calculating average of 2 unsigned 32-bit integer:

uint32_t average_1(uint32_t a, uint32_t b)
{
    if(a < b){
        a ^= b;
        b ^= a;
        a ^= b;
    }
    return b   (a - b) / 2;
}

which get compiled into this: (same with either -O1, -O2, -O3 optimization activated)

average_1:
        cmp     edi, esi
        jnb     .L2
        mov     eax, edi
        mov     edi, esi
        mov     esi, eax
.L2:
        sub     edi, esi
        shr     edi
        lea     eax, [rdi rsi]
        ret

which the swapping part of the code is optimized to use the mov instruction with 1 additional register.

I've read through these questions: Why don't people use xor swaps?, Cost of swapping variables through mov, xor,

and get that using the XOR swap is more difficult at explaining itself, and have no or negative affects in execution speed, by requiring more memory reads.

But in this case, seeing not the memory but only the eax, esi, edi register being used, I thought the compiled assembly code could also be generated as:

average_1:
        cmp     edi, esi
        jnb     .L2
        xor     edi, esi
        xor     esi, edi
        xor     edi, esi

...

with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly there is something I did not think through though, but what is it?

Edit: To be clear, here my question is not about "why not use the XOR swap" but rather
"when XOR swap is used, though it doesn't seems to effect execution speed, why is it still optimized out?"

CodePudding user response：

Clang does the same thing. Probably for compiler-construction and CPU architecture reasons:

Disentangling that logic into just a swap may allow better optimization in some cases; definitely something it makes sense for a compiler to do early so it can follow values through the swap.
Xor-swap is total garbage for swapping registers, the only advantage being that it doesn't need a temporary. But xchg reg,reg already does that better.

I'm not surprised that GCC's optimizer recognizes the xor-swap pattern and disentangles it to follow the original values. In general, this makes constant-propagation and value-range optimizations possible through swaps, especially for cases where the swap wasn't conditional on the values of the vars being swapped. This pattern-recognition probably happens soon after transforming the program logic to GIMPLE (SSA) representation, so at that point it will forget that the original source ever used an xor swap, and not think about emitting asm that way.

Hopefully sometimes that lets it then optimize down to only a single mov, or two movs, depending on register allocation for the surrounding code (e.g. if one of the vars can move to a new register, instead of having to end up back in the original locations). And whether both variables are actually used later, or only one. Or if it can fully disentangle an unconditional swap, maybe no mov instructions.

But worst case, three mov instructions needing a temporary register is still better, unless it's running out of registers. I'd guess GCC is not smart enough to use xchg reg,reg instead of spilling something else or saving/restoring another tmp reg, so there might be corner cases where this optimization actually hurts.

(Apparently GCC -Os does have a peephole optimization to use xchg reg,reg instead of 3x mov: ~~PR 92549~~ was fixed for GCC10. It looks for that quite late, during RTL -> assembly. And yes, it works here: turning your xor-swap into an xchg: https://godbolt.org/z/zs969xh47)

xor-swap has worse latency and defeats mov-elimination

with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly there is something I did not think through though, but what is it?

Instruction count is only a rough proxy for one of three things that are relevant for perf analysis: front-end uops, latency, and back-end execution ports. (And machine-code size in bytes: x86 machine-code instructions are variable-length.)

It's the same size in machine-code bytes, and same number of front-end uops, but the critical-path latency is worse: 3 cycles from input a to output a for xor-swap, and 2 from input b to output a, for example.

MOV-swap has at worst 1-cycle and 2-cycle latencies from inputs to outputs, or less with mov-elimination. (Which can also avoid using back-end execution ports, especially relevant for CPUs like IvyBridge and Tiger Lake with a front-end wider than the number of integer ALU ports. And Ice Lake, except Intel disabled mov-elimination on it as an erratum workaround; not sure if it's re-enabled for Tiger Lake or not.)

Also related:

Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? - and those 3 uops can't benefit from mov-elimination. But on modern AMD xchg reg,reg is only 2 uops.

If you're going to branch, just duplicate the averaging code

GCC's real missed optimization here (even with -O3) is that tail-duplication results in about the same static code size, just a couple extra bytes since these are mostly 2-byte instructions. The big win is that the a<b path then becomes the same length as the other, instead of twice as long to first do a swap and then run the same 3 uops for averaging.

update: GCC will do this for you with -ftracer (https://godbolt.org/z/es7a3bEPv), optimizing away the swap. (That's only enabled manually or as part of -fprofile-use, not at -O3, so it's probably not a good idea to use all the time without PGO, potentially bloating machine code in cold functions / code-paths.)

Doing it manually in the source (Godbolt):

uint32_t average_1_taildup(uint32_t a, uint32_t b)
{
    if(a < b){
        return a   (b - a) / 2;
    }else {
        return b   (a - b) / 2;
    }
}

# GCC11.2 -O3
average_1_taildup:
        cmp     edi, esi
        jnb     .L5
        sub     esi, edi
        shr     esi
        lea     eax, [rsi rdi]
        ret
.L5:
        sub     edi, esi
        shr     edi
        lea     eax, [rdi rsi]
        ret

Clang compiles both version 1 and 1_taildup into code using cmov (e.g. cmp / mov / cmovb / cmovb, or making a bit of a mess for the tail-duplication version).

But if you're going to go branchless then your average_3 is superior:

uint32_t average_3(uint32_t a, uint32_t b)
{
    return (a & b)   ((a ^ b) >> 1);
}

# clang 13.0 -O3
average_3:
        mov     eax, esi
        and     eax, edi
        xor     esi, edi
        shr     esi
        add     eax, esi
        ret

Both GCC and clang's versions are only 5 instructions (plus ret), but clang arranges it so the critical-path latency is only 3 cycles (3 single-uop instructions) from either input to the output, even without mov-elimination. (GCC has one chain that's 4 instructions long including a mov.)

Efficient non-overflowing unsigned mean

See also Efficient overflow-immune arithmetic mean in C/C - widening to uint64_t can be even cheaper, especially when inlining, on a 64-bit machine. (As discussed in comments under the question, e.g. https://godbolt.org/z/sz53eEYh9 shows code from the existing answers at the time I commented.)

Another good option is this, but usually not as good as widening:

  return (a&b)   (a^b)/2;

If compilers recognized any of these idioms, they could use the asm add/rcr trick, though, which is even more efficient than widening to unsigned __int128 for averaging uint64_t.