How "bitwise AND mask equals mask" can be optimized?
Example:
bool foo(uint64_t x)
{
return (x & 0x7ff0000000000000) == 0x7ff0000000000000;
}
leads to (ARM 32-bit):
gcc 12.1 (linux) -O3:
f:
movs r3, #0
movt r3, 32752
bics r3, r3, r1
ite eq
moveq r0, #1
movne r0, #0
bx lr
armv7-a clang 11.0.1 -O3:
f:
mov r0, #267386880
orr r0, r0, #1879048192
bic r0, r0, r1
rsbs r1, r0, #0
adc r0, r0, r1
bx lr
Can the C code above be rewritten in such a way that a faster ASM code is produced?
Perhaps there are relevant bit twiddling hacks? Or their combinations? Or similar?
CodePudding user response:
One option is
bool foo4(uint64_t x)
{
return (((x << 1) >> 53) 1) >> 11;
}
which compiles with gcc to
foo:
ubfx r0, r1, #20, #11
adds r0, r0, #1
ubfx r0, r0, #11, #1
bx lr
The saving here mostly comes from not having to convert to a 0/1 result but generating an 1 bit directly. If this function is inlined and the result is used for a branch, this is not helpful and might actually result in slower code.
CodePudding user response:
On clang the code is already as good as it gets:
bool foo(uint64_t x)
{
return (x & 0x7ff0000000000000) == 0x7ff0000000000000;
}
mov x8, #9218868437227405312
bics xzr, x8, x0
cset w0, eq
bool foo2(uint64_t x)
{
// check if x*2 overflows (i.e. produces a carry)
// by adding one the LSB
return ((x * 2) 0x0020000000000000) < (x * 2);
}
ubfx x8, x0, #52, #11
cmp x8, #2046
cset w0, hi
Especially the first version is quite clever: clearing the bits of 0x7ff00000...0
produces zero only if all the 11 bits in the source register are set.
The second version was something I hoped to generate code like
mov x8, #0x0020000000000000
adds x8, x0, lsr #2
cset x0, lt
for carry. But this would be on par with the bic
method -- being essentially just two instructions when the mov x8, constant
could be reused.
On Arm64 with plenty of predicated operations it would make no difference to have the result in CF, ZF or any other status register.