Originally I tried lzcnt but that doesn't seem to work on a mac. I'm working with someone who is using the apple M1 CPU which is ARM64v8.4
In this arm document which list ARM 8 it appears clz supports using 0
CLZ Xd, Xm
Count Leading Zeros (64-bit): sets Xd to the number of binary zeros at the most significant end of Xm. The result will be in the range 0 to 64 inclusive.
The CPU we originally support is x86-64 which has _lzcnt_u64
Both instructions appear to return 64 if the value is 0. Specifically "0 to 64 inclusive" on ARM and the intel site suggest it too (and confirmed by my code)
However GCC says the below
Built-in Function: int __builtin_clzll (unsigned long long)
Similar to __builtin_clz, except the argument type is unsigned long long.
Can I safely use 0 or does this builtin use a different instruction? I tried on clang and the sanitizer stop the program and told me it is a problem which was surprising to me.
How should I get the leading zero count when I want to get 64 if I pass in 0 like these two instructions do
CodePudding user response:
If C 20 features are available, std::countl_zero
solves the problem, and it's up to compiler devs to implement it in a way that compiles efficiently even for an input of zero.
That's great in theory, but unfortunately they're faced with the same problem you are of using actual builtin functions. GCC compiles libstdc 's implementation to waste instructions on system where a good instruction is available, like ARM64 always, or x86-64 with lzcnt
.
__builtin_clzll
will compile to 63-bsr(x)
on x86 if you don't use a -march=
option that includes BMI1 (or for some AMD, at least LZCNT without the rest of BMI1). BSR leaves its destination unmodified for an input of 0, not producing -1
or something. https://www.felixcloutier.com/x86/bsr (This was probably a big part of the motivation of defining the builtin that way, so it can just compile to a single BSR or BSF instruction without conditional branches or cmov on x86 long before lzcnt existed. Some use-cases don't need to use it with zero.)
What's the goal here, portable GNU C ? Or perfectly optimized asm for a couple of compile targets with specific options?
You could try x ? __builtin_clzll(x) : 64
and hope GCC optimizes that to x86-64 lzcnt
when available. clang does do this optimization, but unfortunately GCC11 doesn't. (Godbolt)
unsigned clz(unsigned long long x) {
return x ? __builtin_clzll(x) : 64;
}
BTW, libstdc 's std::countl_zero
compiles the same way, presumably because it's written more or less like that. (Probably with some std::numeric_limits<T>
stuff instead of a hard-coded 64
).
# x86-64 clang 13 -O3 -march=haswell
clz(unsigned long long):
lzcnt rax, rdi
ret
x86-64 GCC 11.2 -O3 -march=haswell
clz(unsigned long long):
xor edx, edx # break output dependency for lzcnt on Intel pre-SKL
mov eax, 64
lzcnt rdx, rdi
test rdi, rdi
cmovne eax, edx # actually do the ternary
ret
Same story for ARM64: clang optimizes away the redundant selection operation to just clz
, GCC doesn't:
# ARM64 gcc11.2 -O3 -march=cortex-a53
clz(unsigned long long):
cmp x0, 0
clz x0, x0
mov w1, 64
csel w0, w0, w1, ne
ret
Explicitly use lzcnt when available:
#ifdef __BMI__
#include <immintrin.h>
unsigned clz_lzcnt(unsigned long long x) {
return _lzcnt_u64(x);
}
#endif
# x86-64 GCC11 -O3 -march=haswell
clz_lzcnt(unsigned long long):
xor eax, eax
lzcnt rax, rdi
ret
(So really you'd want to use this #ifdef
inside another function, with a #else using the ternary as a fallback.)