SSE interleave/merge/combine 2 vectors using a mask, per-element conditional move?-CodePudding

Essentially i am trying to implement a ternary-like operation on 2 SSE (__m128) vectors. The mask is another __m128 vector obtained from _mm_cmplt_ps.

What i want to achieve is to select element of vector a when the corresponding element of the mask is 0xffff'ffff and element of b when the mask's element is 0.

Example of the desired operation (in semi-pseudocode):

const __m128i mask = {0xffffffff, 0, 0xffffffff, 0};  // e.g. a compare result
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};
const __m128 c = interleave(a, b, mask); // c contains {1.0, 2.1, 1.2, 2.3}

I am having trouble implementing this operation in SIMD (SSE) intrinsics. My original idea was to mix a and b using moves and then shuffle the elements using the mask, however _mm_shuffle_ps takes an int mask consisting of 4-bit indices, not an __m128 mask.

Another idea was to use something akin to a conditional move, but there does not seem to be a conditional move in SSE (or at least I did not manage to find it in Intel's guide).

How is this normally done in SSE?

CodePudding user response：

That's called a "blend".
Intel's intrinsics guide groups blend instructions under the "swizzle" category, along with shuffles.

You're looking for SSE4.1 blendvps (intrinsic _mm_blendv_ps). The other element sizes are _mm_blendv_pd and _mm_blendv_epi8. These use the high bit of the corresponding element as the control, so you can use a float directly (without _mm_cmp_ps) if its sign bit is interesting.

__m128i mask = _mm_castps_si128(_mm_cmplt_ps(x, y));   // integer 0 / -1 bit patterns
__m128 c = _mm_blendv_ps(b, a, mask);  // copy element from 2nd op where the mask is set

Note that I reversed a, b to b, a because SSE blends take the element from the 2nd operand in positions where the mask was set. Like a conditional-move which copies when the condition is true. If you name your constants / variables accordingly, you can write blend(a,b, mask) instead of having them backwards. Or give them meaningful names line ones and twos.

In other cases where your control operand is a constant, there's also _mm_blend_ps / pd / _mm_blend_epi16 (an 8-bit immediate operand can only control 8 separate elements, so 8x 2-byte.)

Performance

blendps xmm, xmm, imm8 is a single-uop instruction for any vector ALU port on Intel CPUs, as cheap as andps. (https://uops.info/). pblendw is also single-uop, but only runs on port 5 on Intel, competing with shuffles. AVX2 vpblendd blends with dword granularity, an integer version of vblendps, and with the same very good efficiency. (It's an integer-SIMD instruction; unlike shuffles, blends have extra bypass latency on Intel CPUs if you mix integer and FP SIMD.)

But variable blendvps is 2 uops on Intel before Skylake (and only for port 5). And the AVX version (vblendvps) is unfortunately still 2 uops on Intel (3 on Alder Lake-P, 4 on Alder Lake-E). Although the uops can at least run on any of 3 vector ALU ports.

The vblendvps version is funky in asm because it has 4 operands, not overwriting any of the inputs registers. (The non-AVX version overwrites one input, and uses XMM0 implicitly as the mask input.) Intel uops apparently can't handle 4 separate registers, only 3 for stuff like FMA, adc, and cmov. (And AVX-512 vpternlogd which can do a bitwise blend as a single uop)

AMD has fully efficient handling of vblendvps, single uop (except for YMM on Zen1) with 2/clock throughput.

Without SSE4.1, you can emulate with ANDN/AND/OR

(x&~mask) | (y&mask) is equivalent to _mm_blendv_ps(x,y,mask), except it's pure bitwise so all the bits of each mask element should match the top bit. (e.g. a compare result, or broadcast the top bit with _mm_srai_epi32(mask, 31).)

Compilers know this trick and will use it when auto-vectorizing scalar code if you compile without any arch options like -march=haswell or whatever. (SSE4.1 was new in 2nd-gen Core 2, so it's increasingly widespread but not universal.)

For constant / loop-invariant `a^b` without SSE4.1

x ^ ((x ^ y) & mask saves one operation if you can reuse x ^ y. (Suggested in comments by Aki). Otherwise this is worse, longer critical-path latency and no instruction-level parallelism.

Without AVX non-destructive 3-operand instructions, this way would need a movaps xmm,xmm register-copy to save b, but it can choose to destroy the mask instead of a. The AND/ANDN/OR way would normally destroy its 2nd operand, the one you use with y&mask, and destroy the mask with ANDN (~mask & x).

With AVX, vblendvps is guaranteed available. Although if you're targeting Intel (especially Haswell) and don't care about AMD, you might still choose an AND/XOR if a^b can be pre-computed.

CodePudding user response：

As suggested by @Peter Cordes in the comments to the question, the blendvps instruction (_mm_blendv_* intrinsics) is used to preform the interleave/conditional move operation.

It should be noted that _mm_blendv_* family select the left-hand elements if the mask contains 0 instead of 0xffffffff, thus a and b should be passed in reverse order.

The implementation then would look like this

const __m128i mask = {0xffffffff, 0, 0xffffffff, 0};  // e.g. a compare result
const __m128 m_ps = _mm_castsi128_ps(mask);
const __m128 a = {1.0, 1.1, 1.2, 1.3};
const __m128 b = {2.0, 2.1, 2.2, 2.3};

#ifdef __SSE4_1__ // _mm_blendv_ps requires SSE4.1 
const __m128 c = _mm_blendv_ps(b, a, m_ps);
#else
const __m128 c = _mm_or_ps(_mm_and_ps(m_ps, a), _mm_andnot_ps(m_ps, b));
#endif
// c contains {1.0, 2.1, 1.2, 2.3}

Performance

Without SSE4.1, you can emulate with ANDN/AND/OR

For constant / loop-invariant a^b without SSE4.1

For constant / loop-invariant `a^b` without SSE4.1