bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,64);
b2=(uint8_t*)__builtin_assume_aligned(b2,64);
for(int ii = 0; ii < 64; ii){
if(b1[ii]!=b2[ii]){
return false;
}
}
return true;
}
Looking at the assembly, clang and gcc don't seem to have any optimizations to add(with flags -O3 -mavx512f -msse4.2) apart from loop unrolling. I would think its pretty easy to just put both memory regions in 512 bit registers and compare them. Even more surprisingly both compilers also fail to optimize this(ideally only a single 64 bit compare required and no special large registers required):
bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,8);
b2=(uint8_t*)__builtin_assume_aligned(b2,8);
for(int ii = 0; ii < 8; ii){
if(b1[ii]!=b2[ii]){
return false;
}
}
return true;
}
So are both compilers just dumb or is there a reason that this code isn't vectorized? And is there any way to force vectorization short of writing inline assembly?
CodePudding user response:
"I assume" the following is most efficient
memcmp(b1, b2, any_size_you_need);
especially for huge arrays!
(For small arrays, there is not a lot to gain anyway!)
Otherwise, you would need to vectorize manually using Intel Intrinsics. (Also mentioned by chtz.) I started to look at that until i thought about memcmp
.
CodePudding user response:
The compiler must assume that once the function returns (or it is exiting the loop), it can't read any bytes behind the current index -- for example if one of the pointers happens to point to somewhere near the boundary of invalid memory.
You can give the compiler a chance to optimize this by using (non-lazy) bitwise &
/|
operators to combine the results of the individual comparisons:
bool equal(uint8_t * b1,uint8_t * b2){
b1=(uint8_t*)__builtin_assume_aligned(b1,64);
b2=(uint8_t*)__builtin_assume_aligned(b2,64);
bool ret = true;
for(int ii = 0; ii < 64; ii){
ret &= (b1[ii]==b2[ii]);
}
return ret;
}
Godbolt demo: https://godbolt.org/z/3ePh7q5rM
gcc still fails to vectorize this, though. So you may need to write manual vectorized versions of this, if this function is performance critical.