How do I count leading zeros on both mac M1 and x86-64?-CodePudding

Originally I tried lzcnt but that doesn't seem to work on a mac. I'm working with someone who is using the apple M1 CPU which is ARM64v8.4

In this arm document which list ARM 8 it appears clz supports using 0

CLZ Xd, Xm
Count Leading Zeros (64-bit): sets Xd to the number of binary zeros at the most significant end of Xm. The result will be in the range 0 to 64 inclusive.

The CPU we originally support is x86-64 which has _lzcnt_u64

Both instructions appear to return 64 if the value is 0. Specifically "0 to 64 inclusive" on ARM and the intel site suggest it too (and confirmed by my code)

However GCC says the below

Built-in Function: int __builtin_clzll (unsigned long long)

Similar to __builtin_clz, except the argument type is unsigned long long.

Can I safely use 0 or does this builtin use a different instruction? I tried on clang and the sanitizer stop the program and told me it is a problem which was surprising to me.

How should I get the leading zero count when I want to get 64 if I pass in 0 like these two instructions do

CodePudding user response：

If C 20 features are available, std::countl_zero solves the problem, and it's up to compiler devs to implement it in a way that compiles efficiently even for an input of zero.

That's great in theory, but unfortunately they're faced with the same problem you are of using actual builtin functions. GCC compiles libstdc 's implementation to waste instructions on system where a good instruction is available, like ARM64 always, or x86-64 with lzcnt.

__builtin_clzll will compile to 63-bsr(x) on x86 if you don't use a -march= option that includes BMI1 (or for some AMD, at least LZCNT without the rest of BMI1). BSR leaves its destination unmodified for an input of 0, not producing -1 or something. https://www.felixcloutier.com/x86/bsr (This was probably a big part of the motivation of defining the builtin that way, so it can just compile to a single BSR or BSF instruction without conditional branches or cmov on x86 long before lzcnt existed. Some use-cases don't need to use it with zero.)

What's the goal here, portable GNU C ? Or perfectly optimized asm for a couple of compile targets with specific options?

You could try x ? __builtin_clzll(x) : 64 and hope GCC optimizes that to x86-64 lzcnt when available. clang does do this optimization, but unfortunately GCC11 doesn't. (Godbolt)

unsigned clz(unsigned long long x) {
    return x ? __builtin_clzll(x) : 64;
}

BTW, libstdc 's std::countl_zero compiles the same way, presumably because it's written more or less like that. (Probably with some std::numeric_limits<T> stuff instead of a hard-coded 64).

# x86-64  clang 13 -O3 -march=haswell
clz(unsigned long long):
        lzcnt   rax, rdi
        ret

x86-64 GCC 11.2 -O3 -march=haswell
clz(unsigned long long):
        xor     edx, edx          # break output dependency for lzcnt on Intel pre-SKL
        mov     eax, 64
        lzcnt   rdx, rdi
        test    rdi, rdi
        cmovne  eax, edx          # actually do the ternary
        ret

Same story for ARM64: clang optimizes away the redundant selection operation to just clz, GCC doesn't:

# ARM64 gcc11.2 -O3 -march=cortex-a53
clz(unsigned long long):
        cmp     x0, 0
        clz     x0, x0
        mov     w1, 64
        csel    w0, w0, w1, ne
        ret

Explicitly use lzcnt when available:

#ifdef __BMI__
#include <immintrin.h>
unsigned clz_lzcnt(unsigned long long x) {
    return _lzcnt_u64(x);
}
#endif

# x86-64 GCC11 -O3 -march=haswell
clz_lzcnt(unsigned long long):
        xor     eax, eax
        lzcnt   rax, rdi
        ret

(So really you'd want to use this #ifdef inside another function, with a #else using the ternary as a fallback.)