Home > OS >  How "bitwise AND mask equals mask" can be optimized?
How "bitwise AND mask equals mask" can be optimized?

Time:06-30

How "bitwise AND mask equals mask" can be optimized?

Example:

bool foo(uint64_t x)
{
      return (x & 0x7ff0000000000000) == 0x7ff0000000000000;
}

leads to (ARM 32-bit):

gcc 12.1 (linux) -O3:
f:
        movs    r3, #0
        movt    r3, 32752
        bics    r3, r3, r1
        ite     eq
        moveq   r0, #1
        movne   r0, #0
        bx      lr

armv7-a clang 11.0.1 -O3:
f:
        mov     r0, #267386880
        orr     r0, r0, #1879048192
        bic     r0, r0, r1
        rsbs    r1, r0, #0
        adc     r0, r0, r1
        bx      lr

Can the C code above be rewritten in such a way that a faster ASM code is produced?

Perhaps there are relevant bit twiddling hacks? Or their combinations? Or similar?

CodePudding user response:

One option is

bool foo4(uint64_t x)
{
    return (((x << 1) >> 53)   1) >> 11;
}

which compiles with gcc to

foo:
        ubfx    r0, r1, #20, #11
        adds    r0, r0, #1
        ubfx    r0, r0, #11, #1
        bx      lr

The saving here mostly comes from not having to convert to a 0/1 result but generating an 1 bit directly. If this function is inlined and the result is used for a branch, this is not helpful and might actually result in slower code.

CodePudding user response:

On clang the code is already as good as it gets:

bool foo(uint64_t x)
{
      return (x & 0x7ff0000000000000) == 0x7ff0000000000000;
}
        mov     x8, #9218868437227405312
        bics    xzr, x8, x0
        cset    w0, eq

bool foo2(uint64_t x)
{
      // check if x*2 overflows (i.e. produces a carry)
      // by adding one the LSB
      return ((x * 2)   0x0020000000000000) < (x * 2);
}
        ubfx    x8, x0, #52, #11
        cmp     x8, #2046
        cset    w0, hi

Especially the first version is quite clever: clearing the bits of 0x7ff00000...0 produces zero only if all the 11 bits in the source register are set.

The second version was something I hoped to generate code like

    mov x8, #0x0020000000000000
    adds x8, x0, lsr #2
    cset x0, lt 

for carry. But this would be on par with the bic method -- being essentially just two instructions when the mov x8, constant could be reused.

On Arm64 with plenty of predicated operations it would make no difference to have the result in CF, ZF or any other status register.

  • Related