I'm optimizing a hot path in my codebase and i have turned to vectorization. Keep in mind, I'm still quite new to all of this SIMD stuff. Here is the problem I'm trying to solve, implemented using non-SIMD
inline int count_unique(int c1, int c2, int c3, int c4)
{
return 4 - (c2 == c1)
- ((c3 == c1) || (c3 == c2))
- ((c4 == c1) || (c4 == c2) || (c4 == c3));
}
the assembly output after compiling with -O3
:
count_unique:
xor eax, eax
cmp esi, edi
mov r8d, edx
setne al
add eax, 3
cmp edi, edx
sete dl
cmp esi, r8d
sete r9b
or edx, r9d
movzx edx, dl
sub eax, edx
cmp edi, ecx
sete dl
cmp r8d, ecx
sete dil
or edx, edi
cmp esi, ecx
sete cl
or edx, ecx
movzx edx, dl
sub eax, edx
ret
How would something like this be done when storing c1,c2,c3,c4 as a 16byte integer vector?
CodePudding user response:
For your simplified problem (test all 4 lanes for equality), I would do it slightly differently, here’s how. This way it only takes 3 instructions for the complete test.
// True when the input vector has the same value in all 32-bit lanes
inline bool isSameValue( __m128i v )
{
// Rotate vector by 4 bytes
__m128i v2 = _mm_shuffle_epi32( v, _MM_SHUFFLE( 0, 3, 2, 1 ) );
// The XOR outputs zero for equal bits, 1 for different bits
__m128i xx = _mm_xor_si128( v, v2 );
// Use PTEST instruction from SSE 4.1 set to test the complete vector for all zeros
return (bool)_mm_testz_si128( xx, xx );
}
CodePudding user response:
Ok, I have "simplified" the problem, because the only case when i was using the unique count, was if it was 1, but that is the same as checking if all elements are the same, which can be done by comparing the input with itself, but shifted over by one element (4 bytes) using the _mm_alignr_epi8
function.
inline int is_same_val(__m128i v1) {
__m128i v2 = _mm_alignr_epi8(v1, v1, 4);
__m128i vcmp = _mm_cmpeq_epi32(v1, v2);
return ((uint16_t)_mm_movemask_epi8(vcmp) == 0xffff);
}