Intel intrinsics: vector comparison result to array of bool conversion-CodePudding

I have several functions used to compare floating-point math vectors that fill an array of booleans (with result of each comparison). Currently, i am comparing them element-by-element, however i would like to use SIMD operations to optimize it.

The issue is, however, that intel intrinsics such as _mm_cmpeq_ps return a mask where every element is 32-bit. I am a little lost on how to convert the comparison mask to an array of booleans (guaranteed to be 8-bit).

I could shuffle every element of the SIMD vector, then extract the low elements, but i dont think that would provide an efficiency boost over manual element-by-element comparison.

Is there a way to cast the vector compare mask to a boolean array?

CodePudding user response：

A bitmap is a more efficient way to store it, if you can have the rest of your program use that. (e.g. via Fastest way to unpack 32 bits to a 32 byte SIMD vector or is there an inverse instruction to the movemask instruction in intel avx2? if you want to use it with other vectors).

Or if you can cache-block it and use at most a couple KiB of mask vectors, you could just store the compare results directly for reuse without packing them down. (In an array of alignas(16) int32_t masks[], in case you want to access from scalar code). But only if you can do it with a small footprint in L1d. Or much better, use it on the fly as a mask for another vector operation so you're not storing/reloading mask data.

`packssdw`/`packsswb` dword compare results down to bytes

You're correct, if you don't want your elements packed down to single bits, don't use _mm_movemask_ps or epi8. Instead, use vector pack instructions. cmpps produces elements of all-zero / all-one bits, i.e. integer 0 (false) or -1 (true).

Signed integer pack instructions preserve 0 / -1 values. Unsigned packs would also saturate 0xFFFFFFFF to 0xFFFF, but the dword->word instruction requires SSE4.1 instead of SSE2.

To keep the compiler happy, you need _mm_castps_si128 to reinterpret a __m128 as a __m128i.

This works most efficiently packing 4 vectors of 4 float compare results each down to one vector of 16 separate bytes. (Or with AVX, 4 vecs of 8 floats down to 1 vec of 32 bytes, requiring an extra permute at the end because _mm256_packs_epi32 and so on operate in-lane, two separate 16-byte pack operations.)

void cmp(int8_t *result, const float *a){
  __m128 cmp0 = _mm_cmp_ps(...);
  __m128 cmp1 = _mm_cmp_ps(...);
  __m128 cmp2 = _mm_cmp_ps(...);
  __m128 cmp3 = _mm_cmp_ps(...);

  __m128i lo_words = _mm_packs_epi32(_mm_castps_si128(cmp0), _mm_castps_si128(cmp1));
  __m128i hi_words  = _mm_packs_epi32(_mm_castps_si128(cmp2), _mm_castps_si128(cmp3));

  __m128i cmp_bytes = _mm_packs_epi8(lo_words, hi_words);

 // if necessary create 0 / 1 bools.  If not, just store cmp_bytes
  cmp_bytes = _mm_abs_epi8(cmp_bytes);                        // SSSE3
  //cmp_bytes = _mm_and_si128(cmp_bytes, _mm_set1_epi8(1));   // SSE2

  _mm_storeu_si128((__m128i*)result, cmp_bytes); 
}

Getting a 0/1 instead of 0/-1 takes a _mm_and_si128 or SSSE3 _mm_abs_epi8, if you truly need bool instead of a zero/non-zero uint8_t[] or int8_t[].

If you only have a single vector of float, you'd want SSSE3 _mm_shuffle_epi8 (pshufb) to grab 1 byte from each dword, for _mm_storeu_si32 (beware it was broken in early GCC11 versions, and wasn't even supported before then. But now it is supported as a strict-aliasing-safe unaligned store. Otherwise use _mm_cvtsi128_si32 to int, and memcpy that to an array of bool.)

Compilers can auto-vectorize your scalar code for you, but they do a rather poor job: https://godbolt.org/z/3o58W919Y - clang packs each vector down separately, for 4-byte stores. GCC uses unsigned pack instructions like packusdw as the first step, doing unnecessary pand instructions on each input to each pack instruction, including between the two steps. And without SSE4.1 it does even worse.

(SSE2 packssdw preserves -1 or 0 just fine, without even saturating. Seems GCC isn't keeping track of the limited value-range of compare results, so doesn't realize it can let the pack instructions just work. And doesn't realize that if it's going to AND early, it could AND down to 0 / 1 in the first place.)

packssdw/packsswb dword compare results down to bytes

`packssdw`/`packsswb` dword compare results down to bytes