Home > Blockchain >  How do I get clang/gcc to vectorize looped array comparisons?
How do I get clang/gcc to vectorize looped array comparisons?

Time:08-18

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,64);
    b2=(uint8_t*)__builtin_assume_aligned(b2,64);
    for(int ii = 0; ii < 64;   ii){
        if(b1[ii]!=b2[ii]){
            return false;
        }
    }
    return true;
}

Looking at the assembly, clang and gcc don't seem to have any optimizations to add(with flags -O3 -mavx512f -msse4.2) apart from loop unrolling. I would think its pretty easy to just put both memory regions in 512 bit registers and compare them. Even more surprisingly both compilers also fail to optimize this(ideally only a single 64 bit compare required and no special large registers required):

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,8);
    b2=(uint8_t*)__builtin_assume_aligned(b2,8);
    for(int ii = 0; ii < 8;   ii){
        if(b1[ii]!=b2[ii]){
            return false;
        }
    }
    return true;
}

So are both compilers just dumb or is there a reason that this code isn't vectorized? And is there any way to force vectorization short of writing inline assembly?

CodePudding user response:

"I assume" the following is most efficient

memcmp(b1, b2, any_size_you_need);

especially for huge arrays!

(For small arrays, there is not a lot to gain anyway!)

Otherwise, you would need to vectorize manually using Intel Intrinsics. (Also mentioned by chtz.) I started to look at that until i thought about memcmp.

CodePudding user response:

The compiler must assume that once the function returns (or it is exiting the loop), it can't read any bytes behind the current index -- for example if one of the pointers happens to point to somewhere near the boundary of invalid memory. You can give the compiler a chance to optimize this by using (non-lazy) bitwise &/| operators to combine the results of the individual comparisons:

bool equal(uint8_t * b1,uint8_t * b2){
    b1=(uint8_t*)__builtin_assume_aligned(b1,64);
    b2=(uint8_t*)__builtin_assume_aligned(b2,64);
    bool ret = true;
    for(int ii = 0; ii < 64;   ii){
        ret &= (b1[ii]==b2[ii]);
    }
    return ret;
}

Godbolt demo: https://godbolt.org/z/3ePh7q5rM

gcc still fails to vectorize this, though. So you may need to write manual vectorized versions of this, if this function is performance critical.

  • Related