Better way of interweaving two vectors

I've been learning/experimenting with simd in C# and came across this problem: Given two 256 bit vectors containing 4 x uint64 rearrange them such that a = <0,2,4,6>, b = <1,3,5,7> becomes c = <0,1,2,3>, d = <4,5,6,7>.

My current solution uses two permutations and unpack low high, I'm sure there must be a better way of doing this using only two permutes or a better unpack low/high. Is there a better way of doing this?

Vector256<ulong> a = Vector256.Create((ulong)0, 2, 4, 6);
Vector256<ulong> b = Vector256.Create((ulong)1, 3, 5, 7);

Vector256<ulong> low = Avx2.UnpackLow(a, b);
Vector256<ulong> high = Avx2.UnpackHigh(a, b);

var c = Avx2.Permute2x128(low, high, 0b_00_10_00_00);
var d = Avx2.Permute2x128(low, high, 0b_00_11_00_01);

// Translated to C - I haven't tried running it.
//given __m512i a, b, low, high, c, d
low = _mm256_unpacklo_epi64(a,b); // 0,1,4,5
high = _mm256_unpackhi_epi64(a,b); // 2,3,6,7

c = _mm256_permute2x128_si256(low,high); // 0,1,2,3
d = _mm256_permute2x128_si256(low,high); // 4,5,6,7

CodePudding user response：

Here’s a slightly better way:

Vector256<ulong> a = Vector256.Create( (ulong)0, 2, 4, 6 );
Vector256<ulong> b = Vector256.Create( (ulong)1, 3, 5, 7 );

Vector256<ulong> low = Avx2.UnpackLow( a, b );
Vector256<ulong> high = Avx2.UnpackHigh( a, b );

var d = Avx2.Permute2x128( low, high, 0b_00_11_00_01 );
var c = Avx2.InsertVector128( low, high.GetLower(), 1 );

Same speed as your code on Intel CPUs. But slightly faster on AMD: on CPUs like Zen 2 or Zen 3, vinserti128 instruction only has 1 cycle of latency, vperm2i128 instruction has 3 cycles of latency.

CodePudding user response：

If you locally have more pressure on your shuffle port than the other ports, you can trade one shuffle for two blends, like so (apologies this is only C/C , but I assume you can translate this to C# if needed):

__m256i t0 = _mm256_unpacklo_epi64(a, b);
__m256i t1 = _mm256_unpackhi_epi64(a, b);

__m256i swap = _mm256_permute2x128_si256(t0, t1, 0x21);

c = _mm256_blend_epi32(t0, swap, 0xf0);
d = _mm256_blend_epi32(swap, t1, 0xf0);

Note that clang actually "optimizes" this back to a variant with two vperm2f128 (this may depend on context, though): https://godbolt.org/z/499abhWYb (check the interleave2 method).

If you want to store the result to memory (maybe after doing some other operations on it), then in some contexts using a set of vextracti128_m128 could also be an option (using more store operations but saving shuffle/blend operations).

CodePudding user response：

That looks pretty reasonable without AVX-512 for 2x vpermt2q.

AVX2 doesn't have any 2-input lane-crossing shuffles with granularity narrower than 128-bit vperm2i128 (aka _mm256_permute2x128_si256).

And shuffling each input with vpermq to set up for blends would probably 4x shuffles 2x vpblendd so that's not better.

Maybe there's a more clever trick I'm missing / forgetting about, but I don't expect you can do better.