ALIGNTO(16) uint8_t noise_frame_flags[16] = { 0 };
// Code detects noise and sets noise_frame_flags omitted
__m128i xmm0 = _mm_load_si128((__m128i*)noise_frame_flags);
bool isNoiseToCancel = _mm_extract_epi64(xmm0, 0) | _mm_extract_epi64(xmm0, 1);
if (isNoiseToCancel)
cancelNoises(audiobuffer, nAudioChannels, audio_samples, noise_frame_flags);
This is a code snippet from my AV Capture tool on Linux. noise_frame_flags here is an array of flags for 16-channel audio. For each channel, the corresponding byte can be either 0 or 1. 1 is indicating that the channel has some noise to cancel. For example, if noise_frame_flags[0] == 1, that means first channel noise flag is set (by the omitted code).
Even if a single "flag" is set then I need to call cancelNoises
. And this code seems to work fine in that matter. As you see I used _mm_load_si128
to load a whole array of flags that is correctly aligned and then two _mm_extract_epi64
to extract "flags". My question is there a better way to do this (using pop count maybe)?
Note: ALIGNTO(16)
is a macro expands to correct GCC equivalent but nicer looking.
CodePudding user response:
Yes, you eventually want a 64-bit OR to look for any non-zero bits in either half, but it's not efficient to get those uint64_t
values from a 128-bit load and then extract.
In asm you just want a mov
load and a memory-source or
or add
, which will set ZF just like you're doing now. Two loads from the same cache line are very cheap; current CPUs have at least 2/clock load throughput. The extra ALU work to extra a single 128-bit load is just not worth it, even if you did shuffle / por
to set up for a single movq
.
In C , use memcpy
to do strict-aliasing safe loads of uint64_t
tmp vars, then if(a | b)
. This is still SIMD, just SWAR (SIMD Within A Register).
add
is even better than or
: it can macro-fuse with most jcc
instructions on Intel Sandybridge-family (but not AMD). or
can't fuse with branch instructions on any CPUs. Since your values are 0
or 1
, we can't have a case of two non-zero values adding to produce a zero, which is why you'd normally use or
for the general case.
(Some addressing modes may defeat micro or macro-fusion. Or maybe it always works since there's no immediate involved. It really is possible for add rax, [mem]
/ jnz
to go through the front-end and ROB as a single uop, and execute in the back-end as only 2 (load add/sub-and-branch). Assuming it's about the same as cmp
on my Skylake, except it does write the destination so Haswell and later can maybe keep it micro-fused even for indexed addressing modes.)
uint64_t a, b;
memcpy(&a, noise_frame_flags 0, sizeof(a)); // strict-aliasing-safe loads
memcpy(&b, noise_frame_flags 8, sizeof(b)); // which optimize to MOV
bool isNoiseToCancel = a b; // equivalent to a | b for bool inputs
This should compile to 3 asm instructions which will decode to 2 uops total.
Alternatives:
movdqa
16-byte load / SSE4.1 ptest xmm0, xmm0
/ jnz
- 4 uops on Intel CPUs, 3 on AMD.
Intel runs ptest
as 2 uops, and it can't macro-fuse with jcc
.
AMD CPUs run ptest
as 1 uop, but it still can't fuse.
If you had an all-ones constant in a register, ptest xmm0, [mem]
would work to save a uop on Intel, but that's still 3 total.
PTEST is only good for checking a 32-byte array with AVX1 or AVX2. (Then it's about break-even with AVX2 vmovdqa
/ vpslld ymm0, 7
/ vpmovmskb eax,ymm0
/ test jnz
). See TrentP's answer for portable GNU C native vector source code that should compile to vptest
on x86 with AVX available, and maybe to something clunky on other ISAs like ARM depending on how good their horizontal OR support is.
popcnt
wouldn't be useful unless you want to break down the work depending on how many bits are set.
In that case, yes, sure, you can turn the bool array into a bitmap that you can scan easily, probably more efficient than _mm_sad_epu8
against a zeroed register to sum into two 8-byte halves.
__m128i vflags = _mm_load_si128((__m128i*)noise_frame_flags);
vflags = _mm_slli_epi32(vflags, 7);
unsigned flagmask = _mm_movemask_epi8(vflags);
if (flagmask) {
unsigned flagcount = __builtin_popcount(flagmask); // popcnt with -march=nehalem or higher
unsigned first_setflag = __builtin_ctz(flagmask); // tzcnt if available, else BSF
vflags &= vflags - 1; // clear lowest set bit. blsr if compiled with -march=haswell or bdver2 or newer.
...
}
(Don't actually use -march=bdver2
or -march=nehalem
, unless you want to set an ISA baseline but also use -mtune=haswell
or something more modern. There are individual options like -mpopcnt
and -mbmi
, but generally good to enable all ISA extensions that some CPU supports, so you don't miss out on useful stuff the compiler can use.)
CodePudding user response:
Here's what I came up with for doing this:
#define VLEN 8
typedef int vNb __attribute__((vector_size(VLEN*sizeof(int))));
// Constants for 128 or 256 bit registers
#if VLEN == 8
#define V(a,b,c,d,e,f,g,h) a,b,c,d,e,f,g,h
#else
#define V(a,b,c,d,e,f,g,h) a,b,c,d
#endif
#define SWAP128 V(4,5,6,7, 0,1,2,3)
#define SWAP64 V(2,3, 0,1, 6,7, 4,5)
#define SWAP32 V(1, 0, 3, 2, 5, 4, 7, 6)
static bool any(vNb x) {
if (VLEN >= 8)
x |= __builtin_shufflevector(x,x, SWAP128);
x |= __builtin_shufflevector(x,x, SWAP64);
x |= __builtin_shufflevector(x,x, SWAP32);
return x[0];
}
With VLEN
= 8, this will use 256-bit registers if the arch supports it. Change to 4 to use 128 bit.
This should compile to a single vptest
instruction.