I've been learning/experimenting with simd in C# and came across this problem:
Given two 256 bit vectors containing 4 x uint64 rearrange them such that a = <0,2,4,6>, b = <1,3,5,7>
becomes c = <0,1,2,3>, d = <4,5,6,7>
.
My current solution uses two permutations and unpack low high, I'm sure there must be a better way of doing this using only two permutes or a better unpack low/high. Is there a better way of doing this?
Vector256<ulong> a = Vector256.Create((ulong)0, 2, 4, 6);
Vector256<ulong> b = Vector256.Create((ulong)1, 3, 5, 7);
Vector256<ulong> low = Avx2.UnpackLow(a, b);
Vector256<ulong> high = Avx2.UnpackHigh(a, b);
var c = Avx2.Permute2x128(low, high, 0b_00_10_00_00);
var d = Avx2.Permute2x128(low, high, 0b_00_11_00_01);
// Translated to C - I haven't tried running it.
//given __m512i a, b, low, high, c, d
low = _mm256_unpacklo_epi64(a,b); // 0,1,4,5
high = _mm256_unpackhi_epi64(a,b); // 2,3,6,7
c = _mm256_permute2x128_si256(low,high); // 0,1,2,3
d = _mm256_permute2x128_si256(low,high); // 4,5,6,7
CodePudding user response:
Here’s a slightly better way:
Vector256<ulong> a = Vector256.Create( (ulong)0, 2, 4, 6 );
Vector256<ulong> b = Vector256.Create( (ulong)1, 3, 5, 7 );
Vector256<ulong> low = Avx2.UnpackLow( a, b );
Vector256<ulong> high = Avx2.UnpackHigh( a, b );
var d = Avx2.Permute2x128( low, high, 0b_00_11_00_01 );
var c = Avx2.InsertVector128( low, high.GetLower(), 1 );
Same speed as your code on Intel CPUs. But slightly faster on AMD: on CPUs like Zen 2 or Zen 3, vinserti128
instruction only has 1 cycle of latency, vperm2i128
instruction has 3 cycles of latency.
CodePudding user response:
If you locally have more pressure on your shuffle port than the other ports, you can trade one shuffle for two blends, like so (apologies this is only C/C , but I assume you can translate this to C# if needed):
__m256i t0 = _mm256_unpacklo_epi64(a, b);
__m256i t1 = _mm256_unpackhi_epi64(a, b);
__m256i swap = _mm256_permute2x128_si256(t0, t1, 0x21);
c = _mm256_blend_epi32(t0, swap, 0xf0);
d = _mm256_blend_epi32(swap, t1, 0xf0);
Note that clang actually "optimizes" this back to a variant with two vperm2f128
(this may depend on context, though): https://godbolt.org/z/499abhWYb (check the interleave2
method).
If you want to store the result to memory (maybe after doing some other operations on it), then in some contexts using a set of vextracti128_m128
could also be an option (using more store operations but saving shuffle/blend operations).
CodePudding user response:
That looks pretty reasonable without AVX-512 for 2x vpermt2q
.
AVX2 doesn't have any 2-input lane-crossing shuffles with granularity narrower than 128-bit vperm2i128
(aka _mm256_permute2x128_si256
).
And shuffling each input with vpermq
to set up for blends would probably 4x shuffles 2x vpblendd
so that's not better.
Maybe there's a more clever trick I'm missing / forgetting about, but I don't expect you can do better.