Home > Enterprise >  Is there a better way to any detect bits that are set in a 16-byte array of flags?
Is there a better way to any detect bits that are set in a 16-byte array of flags?

Time:06-08

    ALIGNTO(16) uint8_t noise_frame_flags[16] = { 0 };

    // Code detects noise and sets noise_frame_flags omitted

    __m128i xmm0            = _mm_load_si128((__m128i*)noise_frame_flags);
    bool    isNoiseToCancel = _mm_extract_epi64(xmm0, 0) | _mm_extract_epi64(xmm0, 1);

    if (isNoiseToCancel)
        cancelNoises(audiobuffer, nAudioChannels, audio_samples, noise_frame_flags);

This is a code snippet from my AV Capture tool on Linux. noise_frame_flags here is an array of flags for 16-channel audio. For each channel, the corresponding byte can be either 0 or 1. 1 is indicating that the channel has some noise to cancel. For example, if noise_frame_flags[0] == 1, that means first channel noise flag is set (by the omitted code).

Even if a single "flag" is set then I need to call cancelNoises. And this code seems to work fine in that matter. As you see I used _mm_load_si128 to load a whole array of flags that is correctly aligned and then two _mm_extract_epi64 to extract "flags". My question is there a better way to do this (using pop count maybe)?

Note: ALIGNTO(16) is a macro expands to correct GCC equivalent but nicer looking.

CodePudding user response:

Yes, you eventually want a 64-bit OR to look for any non-zero bits in either half, but it's not efficient to get those uint64_t values from a 128-bit load and then extract.

In asm you just want a mov load and a memory-source or or add, which will set ZF just like you're doing now. Two loads from the same cache line are very cheap; current CPUs have at least 2/clock load throughput. The extra ALU work to extra a single 128-bit load is just not worth it, even if you did shuffle / por to set up for a single movq.

In C , use memcpy to do strict-aliasing safe loads of uint64_t tmp vars, then if(a | b). This is still SIMD, just SWAR (SIMD Within A Register).

add is even better than or: it can macro-fuse with most jcc instructions on Intel Sandybridge-family (but not AMD). or can't fuse with branch instructions on any CPUs. Since your values are 0 or 1, we can't have a case of two non-zero values adding to produce a zero, which is why you'd normally use or for the general case.

(Some addressing modes may defeat micro or macro-fusion. Or maybe it always works since there's no immediate involved. It really is possible for add rax, [mem] / jnz to go through the front-end and ROB as a single uop, and execute in the back-end as only 2 (load add/sub-and-branch). Assuming it's about the same as cmp on my Skylake, except it does write the destination so Haswell and later can maybe keep it micro-fused even for indexed addressing modes.)

    uint64_t a, b;
    memcpy(&a, noise_frame_flags 0, sizeof(a));   // strict-aliasing-safe loads
    memcpy(&b, noise_frame_flags 8, sizeof(b));   // which optimize to MOV
    bool  isNoiseToCancel = a   b;   // equivalent to a | b  for bool inputs

This should compile to 3 asm instructions which will decode to 2 uops total.


Alternatives:

movdqa 16-byte load / SSE4.1 ptest xmm0, xmm0 / jnz - 4 uops on Intel CPUs, 3 on AMD.
Intel runs ptest as 2 uops, and it can't macro-fuse with jcc.
AMD CPUs run ptest as 1 uop, but it still can't fuse.
If you had an all-ones constant in a register, ptest xmm0, [mem] would work to save a uop on Intel, but that's still 3 total.

PTEST is only good for checking a 32-byte array with AVX1 or AVX2. (Then it's about break-even with AVX2 vmovdqa / vpslld ymm0, 7 / vpmovmskb eax,ymm0 / test jnz). See TrentP's answer for portable GNU C native vector source code that should compile to vptest on x86 with AVX available, and maybe to something clunky on other ISAs like ARM depending on how good their horizontal OR support is.


popcnt wouldn't be useful unless you want to break down the work depending on how many bits are set.
In that case, yes, sure, you can turn the bool array into a bitmap that you can scan easily, probably more efficient than _mm_sad_epu8 against a zeroed register to sum into two 8-byte halves.

   __m128i vflags = _mm_load_si128((__m128i*)noise_frame_flags);
   vflags = _mm_slli_epi32(vflags, 7);
   unsigned flagmask = _mm_movemask_epi8(vflags);
   if (flagmask) {
       unsigned flagcount = __builtin_popcount(flagmask);  // popcnt with -march=nehalem or higher
       unsigned first_setflag = __builtin_ctz(flagmask);   // tzcnt if available, else BSF
       vflags &= vflags - 1;   // clear lowest set bit.  blsr if compiled with -march=haswell or bdver2 or newer.
      ...
   }

(Don't actually use -march=bdver2 or -march=nehalem, unless you want to set an ISA baseline but also use -mtune=haswell or something more modern. There are individual options like -mpopcnt and -mbmi, but generally good to enable all ISA extensions that some CPU supports, so you don't miss out on useful stuff the compiler can use.)

CodePudding user response:

Here's what I came up with for doing this:

#define VLEN 8
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));

// Constants for 128 or 256 bit registers
#if VLEN == 8
#define V(a,b,c,d,e,f,g,h) a,b,c,d,e,f,g,h
#else
#define V(a,b,c,d,e,f,g,h) a,b,c,d
#endif
#define SWAP128 V(4,5,6,7, 0,1,2,3)
#define SWAP64 V(2,3, 0,1,  6,7, 4,5)
#define SWAP32 V(1, 0,  3, 2,  5, 4,  7, 6)

static bool any(vNb x) {
    if (VLEN >= 8)
        x |= __builtin_shufflevector(x,x, SWAP128);
    x |= __builtin_shufflevector(x,x, SWAP64);
    x |= __builtin_shufflevector(x,x, SWAP32);
    return x[0];
}

With VLEN = 8, this will use 256-bit registers if the arch supports it. Change to 4 to use 128 bit.

This should compile to a single vptest instruction.

  • Related