How do I get clang/gcc to vectorize looped array comparisons?-CodePudding

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,64);
    b2=(uint8_t*)__builtin_assume_aligned(b2,64);
    for(int ii = 0; ii < 64;   ii){
        if(b1[ii]!=b2[ii]){
            return false;
        }
    }
    return true;
}

Looking at the assembly, clang and gcc don't seem to have any optimizations to add(with flags -O3 -mavx512f -msse4.2) apart from loop unrolling. I would think its pretty easy to just put both memory regions in 512 bit registers and compare them. Even more surprisingly both compilers also fail to optimize this(ideally only a single 64 bit compare required and no special large registers required):

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,8);
    b2=(uint8_t*)__builtin_assume_aligned(b2,8);
    for(int ii = 0; ii < 8;   ii){
        if(b1[ii]!=b2[ii]){
            return false;
        }
    }
    return true;
}

So are both compilers just dumb or is there a reason that this code isn't vectorized? And is there any way to force vectorization short of writing inline assembly?

CodePudding user response：

"I assume" the following is most efficient

memcmp(b1, b2, any_size_you_need);

especially for huge arrays!

(For small arrays, there is not a lot to gain anyway!)

Otherwise, you would need to vectorize manually using Intel Intrinsics. (Also mentioned by chtz.) I started to look at that until i thought about memcmp.

CodePudding user response：

The compiler must assume that once the function returns (or it is exiting the loop), it can't read any bytes behind the current index -- for example if one of the pointers happens to point to somewhere near the boundary of invalid memory. You can give the compiler a chance to optimize this by using (non-lazy) bitwise &/| operators to combine the results of the individual comparisons:

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,64);
    b2=(uint8_t*)__builtin_assume_aligned(b2,64);
    bool ret = true;
    for(int ii = 0; ii < 64;   ii){
        ret &= (b1[ii]==b2[ii]);
    }
    return ret;
}

Godbolt demo: https://godbolt.org/z/3ePh7q5rM

gcc still fails to vectorize this, though. So you may need to write manual vectorized versions of this, if this function is performance critical.