While testing things around Compiler Explorer, I tried out the following overflow-free function for calculating average of 2 unsigned 32-bit integer:
uint32_t average_1(uint32_t a, uint32_t b)
{
if(a < b){
a ^= b;
b ^= a;
a ^= b;
}
return b (a - b) / 2;
}
which get compiled into this: (same with either -O1
, -O2
, -O3
optimization activated)
average_1:
cmp edi, esi
jnb .L2
mov eax, edi
mov edi, esi
mov esi, eax
.L2:
sub edi, esi
shr edi
lea eax, [rdi rsi]
ret
which the swapping part of the code is optimized to use the mov
instruction with 1 additional register.
I've read through these questions: Why don't people use xor swaps?, Cost of swapping variables through mov, xor,
and get that using the XOR swap is more difficult at explaining itself, and have no or negative affects in execution speed, by requiring more memory reads.
But in this case, seeing not the memory but only the eax
, esi
, edi
register being used, I thought the compiled assembly code could also be generated as:
average_1:
cmp edi, esi
jnb .L2
xor edi, esi
xor esi, edi
xor edi, esi
...
with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly there is something I did not think through though, but what is it?
Edit: To be clear, here my question is not about "why not use the XOR swap" but rather
"when XOR swap is used, though it doesn't seems to effect execution speed, why is it still optimized out?"
CodePudding user response:
Clang does the same thing. Probably for compiler-construction and CPU architecture reasons:
Disentangling that logic into just a swap may allow better optimization in some cases; definitely something it makes sense for a compiler to do early so it can follow values through the swap.
Xor-swap is total garbage for swapping registers, the only advantage being that it doesn't need a temporary. But
xchg reg,reg
already does that better.
I'm not surprised that GCC's optimizer recognizes the xor-swap pattern and disentangles it to follow the original values. In general, this makes constant-propagation and value-range optimizations possible through swaps, especially for cases where the swap wasn't conditional on the values of the vars being swapped. This pattern-recognition probably happens soon after transforming the program logic to GIMPLE (SSA) representation, so at that point it will forget that the original source ever used an xor swap, and not think about emitting asm that way.
Hopefully sometimes that lets it then optimize down to only a single mov
, or two mov
s, depending on register allocation for the surrounding code (e.g. if one of the vars can move to a new register, instead of having to end up back in the original locations). And whether both variables are actually used later, or only one. Or if it can fully disentangle an unconditional swap, maybe no mov
instructions.
But worst case, three mov
instructions needing a temporary register is still better, unless it's running out of registers. I'd guess GCC is not smart enough to use xchg reg,reg
instead of spilling something else or saving/restoring another tmp reg, so there might be corner cases where this optimization actually hurts.
(Apparently GCC -Os
does have a peephole optimization to use xchg reg,reg
instead of 3x mov: PR 92549 was fixed for GCC10. It looks for that quite late, during RTL -> assembly. And yes, it works here: turning your xor-swap into an xchg: https://godbolt.org/z/zs969xh47)
xor-swap has worse latency and defeats mov-elimination
with no memory reads, and the same number of instructions, I don't see any bad impacts and feels odd that it be changed. Clearly there is something I did not think through though, but what is it?
Instruction count is only a rough proxy for one of three things that are relevant for perf analysis: front-end uops, latency, and back-end execution ports. (And machine-code size in bytes: x86 machine-code instructions are variable-length.)
It's the same size in machine-code bytes, and same number of front-end uops, but the critical-path latency is worse: 3 cycles from input a
to output a
for xor-swap, and 2 from input b
to output a
, for example.
MOV-swap has at worst 1-cycle and 2-cycle latencies from inputs to outputs, or less with mov-elimination. (Which can also avoid using back-end execution ports, especially relevant for CPUs like IvyBridge and Tiger Lake with a front-end wider than the number of integer ALU ports. And Ice Lake, except Intel disabled mov-elimination on it as an erratum workaround; not sure if it's re-enabled for Tiger Lake or not.)
Also related:
- Why is XCHG reg, reg a 3 micro-op instruction on modern Intel architectures? - and those 3 uops can't benefit from mov-elimination. But on modern AMD
xchg reg,reg
is only 2 uops.
If you're going to branch, just duplicate the averaging code
GCC's real missed optimization here (even with -O3
) is that tail-duplication results in about the same static code size, just a couple extra bytes since these are mostly 2-byte instructions. The big win is that the a<b
path then becomes the same length as the other, instead of twice as long to first do a swap and then run the same 3 uops for averaging.
update: GCC will do this for you with -ftracer
(https://godbolt.org/z/es7a3bEPv), optimizing away the swap. (That's only enabled manually or as part of -fprofile-use
, not at -O3
, so it's probably not a good idea to use all the time without PGO, potentially bloating machine code in cold functions / code-paths.)
Doing it manually in the source (Godbolt):
uint32_t average_1_taildup(uint32_t a, uint32_t b)
{
if(a < b){
return a (b - a) / 2;
}else {
return b (a - b) / 2;
}
}
# GCC11.2 -O3
average_1_taildup:
cmp edi, esi
jnb .L5
sub esi, edi
shr esi
lea eax, [rsi rdi]
ret
.L5:
sub edi, esi
shr edi
lea eax, [rdi rsi]
ret
Clang compiles both version 1 and 1_taildup
into code using cmov
(e.g. cmp / mov / cmovb / cmovb, or making a bit of a mess for the tail-duplication version).
But if you're going to go branchless then your average_3
is superior:
uint32_t average_3(uint32_t a, uint32_t b)
{
return (a & b) ((a ^ b) >> 1);
}
# clang 13.0 -O3
average_3:
mov eax, esi
and eax, edi
xor esi, edi
shr esi
add eax, esi
ret
Both GCC and clang's versions are only 5 instructions (plus ret), but clang arranges it so the critical-path latency is only 3 cycles (3 single-uop instructions) from either input to the output, even without mov
-elimination. (GCC has one chain that's 4 instructions long including a mov.)
Efficient non-overflowing unsigned mean
See also Efficient overflow-immune arithmetic mean in C/C - widening to uint64_t can be even cheaper, especially when inlining, on a 64-bit machine. (As discussed in comments under the question, e.g. https://godbolt.org/z/sz53eEYh9 shows code from the existing answers at the time I commented.)
Another good option is this, but usually not as good as widening:
return (a&b) (a^b)/2;
If compilers recognized any of these idioms, they could use the asm add
/rcr
trick, though, which is even more efficient than widening to unsigned __int128
for averaging uint64_t
.