I have several functions used to compare floating-point math vectors that fill an array of booleans (with result of each comparison). Currently, i am comparing them element-by-element, however i would like to use SIMD operations to optimize it.
The issue is, however, that intel intrinsics such as _mm_cmpeq_ps
return a mask where every element is 32-bit. I am a little lost on how to convert the comparison mask to an array of booleans (guaranteed to be 8-bit).
I could shuffle every element of the SIMD vector, then extract the low elements, but i dont think that would provide an efficiency boost over manual element-by-element comparison.
Is there a way to cast the vector compare mask to a boolean array?
CodePudding user response:
A bitmap is a more efficient way to store it, if you can have the rest of your program use that. (e.g. via Fastest way to unpack 32 bits to a 32 byte SIMD vector or is there an inverse instruction to the movemask instruction in intel avx2? if you want to use it with other vectors).
Or if you can cache-block it and use at most a couple KiB of mask vectors, you could just store the compare results directly for reuse without packing them down. (In an array of alignas(16) int32_t masks[]
, in case you want to access from scalar code). But only if you can do it with a small footprint in L1d. Or much better, use it on the fly as a mask for another vector operation so you're not storing/reloading mask data.
packssdw
/packsswb
dword compare results down to bytes
You're correct, if you don't want your elements packed down to single bits, don't use _mm_movemask_ps
or epi8
. Instead, use vector pack instructions. cmpps
produces elements of all-zero / all-one bits, i.e. integer 0 (false) or -1 (true).
Signed integer pack instructions preserve 0 / -1 values. Unsigned packs would also saturate 0xFFFFFFFF to 0xFFFF, but the dword->word instruction requires SSE4.1 instead of SSE2.
To keep the compiler happy, you need _mm_castps_si128
to reinterpret a __m128
as a __m128i
.
This works most efficiently packing 4 vectors of 4 float compare results each down to one vector of 16 separate bytes. (Or with AVX, 4 vecs of 8 floats down to 1 vec of 32 bytes, requiring an extra permute at the end because _mm256_packs_epi32
and so on operate in-lane, two separate 16-byte pack operations.)
void cmp(int8_t *result, const float *a){
__m128 cmp0 = _mm_cmp_ps(...);
__m128 cmp1 = _mm_cmp_ps(...);
__m128 cmp2 = _mm_cmp_ps(...);
__m128 cmp3 = _mm_cmp_ps(...);
__m128i lo_words = _mm_packs_epi32(_mm_castps_si128(cmp0), _mm_castps_si128(cmp1));
__m128i hi_words = _mm_packs_epi32(_mm_castps_si128(cmp2), _mm_castps_si128(cmp3));
__m128i cmp_bytes = _mm_packs_epi8(lo_words, hi_words);
// if necessary create 0 / 1 bools. If not, just store cmp_bytes
cmp_bytes = _mm_abs_epi8(cmp_bytes); // SSSE3
//cmp_bytes = _mm_and_si128(cmp_bytes, _mm_set1_epi8(1)); // SSE2
_mm_storeu_si128((__m128i*)result, cmp_bytes);
}
Getting a 0/1 instead of 0/-1 takes a _mm_and_si128
or SSSE3 _mm_abs_epi8
, if you truly need bool
instead of a zero/non-zero uint8_t[]
or int8_t[]
.
If you only have a single vector of float, you'd want SSSE3 _mm_shuffle_epi8
(pshufb
) to grab 1 byte from each dword, for _mm_storeu_si32
(beware it was broken in early GCC11 versions, and wasn't even supported before then. But now it is supported as a strict-aliasing-safe unaligned store. Otherwise use _mm_cvtsi128_si32
to int, and memcpy
that to an array of bool
.)
Compilers can auto-vectorize your scalar code for you, but they do a rather poor job: https://godbolt.org/z/3o58W919Y - clang packs each vector down separately, for 4-byte stores. GCC uses unsigned pack instructions like packusdw
as the first step, doing unnecessary pand
instructions on each input to each pack instruction, including between the two steps. And without SSE4.1 it does even worse.
(SSE2 packssdw
preserves -1 or 0 just fine, without even saturating. Seems GCC isn't keeping track of the limited value-range of compare results, so doesn't realize it can let the pack instructions just work. And doesn't realize that if it's going to AND early, it could AND down to 0
/ 1
in the first place.)