Home > Net >  Better way of interweaving two vectors - AVX2
Better way of interweaving two vectors - AVX2

Time:10-20

I've been learning/experimenting with simd in C# and came across this problem: Given two 256 bit vectors containing 4 x uint64 rearrange them such that a = <0,2,4,6>, b = <1,3,5,7> becomes c = <0,1,2,3>, d = <4,5,6,7>.

My current solution uses two permutations and unpack low high, I'm sure there must be a better way of doing this using only two permutes or a better unpack low/high. Is there a better way of doing this?

Vector256<ulong> a = Vector256.Create((ulong)0, 2, 4, 6);
Vector256<ulong> b = Vector256.Create((ulong)1, 3, 5, 7);

Vector256<ulong> low = Avx2.UnpackLow(a, b);
Vector256<ulong> high = Avx2.UnpackHigh(a, b);

var c = Avx2.Permute2x128(low, high, 0b_00_10_00_00);
var d = Avx2.Permute2x128(low, high, 0b_00_11_00_01);
// Translated to C - I haven't tried running it.
//given __m512i a, b, low, high, c, d
low = _mm256_unpacklo_epi64(a,b); // 0,1,4,5
high = _mm256_unpackhi_epi64(a,b); // 2,3,6,7

c = _mm256_permute2x128_si256(low,high); // 0,1,2,3
d = _mm256_permute2x128_si256(low,high); // 4,5,6,7

CodePudding user response:

Here’s a slightly better way:

Vector256<ulong> a = Vector256.Create( (ulong)0, 2, 4, 6 );
Vector256<ulong> b = Vector256.Create( (ulong)1, 3, 5, 7 );

Vector256<ulong> low = Avx2.UnpackLow( a, b );
Vector256<ulong> high = Avx2.UnpackHigh( a, b );

var d = Avx2.Permute2x128( low, high, 0b_00_11_00_01 );
var c = Avx2.InsertVector128( low, high.GetLower(), 1 );

Same speed as your code on Intel CPUs. But slightly faster on AMD: on CPUs like Zen 2 or Zen 3, vinserti128 instruction only has 1 cycle of latency, vperm2i128 instruction has 3 cycles of latency.

CodePudding user response:

If you locally have more pressure on your shuffle port than the other ports, you can trade one shuffle for two blends, like so (apologies this is only C/C , but I assume you can translate this to C# if needed):

__m256i t0 = _mm256_unpacklo_epi64(a, b);
__m256i t1 = _mm256_unpackhi_epi64(a, b);

__m256i swap = _mm256_permute2x128_si256(t0, t1, 0x21);

c = _mm256_blend_epi32(t0, swap, 0xf0);
d = _mm256_blend_epi32(swap, t1, 0xf0);

Note that clang actually "optimizes" this back to a variant with two vperm2f128 (this may depend on context, though): https://godbolt.org/z/499abhWYb (check the interleave2 method).

If you want to store the result to memory (maybe after doing some other operations on it), then in some contexts using a set of vextracti128_m128 could also be an option (using more store operations but saving shuffle/blend operations).

CodePudding user response:

That looks pretty reasonable without AVX-512 for 2x vpermt2q.

AVX2 doesn't have any 2-input lane-crossing shuffles with granularity narrower than 128-bit vperm2i128 (aka _mm256_permute2x128_si256).

And shuffling each input with vpermq to set up for blends would probably 4x shuffles 2x vpblendd so that's not better.

Maybe there's a more clever trick I'm missing / forgetting about, but I don't expect you can do better.

  • Related