I would like to take the result of an 8-bit vertical SIMD comparison between 256-bit vectors and pack the bits into the lowest byte of each 32-bit element for a vpshufb
lookup on the lowest bytes. This isn't terribly difficult with AVX-512 (replace the &
with a masked move if using 512-bit vectors):
__m256i cmp_8_into_32(__m256i a, __m256i b) {
return _mm256_popcnt_epi32(_mm256_cmpeq_epi8(a, b)
& _mm256_set1_epi32(0xff0f0301 /* can be any order */));
}
That's three uops and, assuming perfect scheduling, a throughput of 1 according to uops.info—not bad. Alas, vpopcntd
isn't in AVX2. What's the optimal way to do this operation there? The best I can think of is to mask the pairs of bits at indices 7,8 and 15,16, then perform two constant-amount vpsrld
and a vpor
. So that's 6 uops, throughput of 2.5 ish. Not bad, but I wonder if there's something better.
CodePudding user response:
Following chtz's comment (thanks!), I realize it's actually fairly easy:
__m256i cmp_8_into_32_1(__m256i a, __m256i b) {
const __m256i weights = _mm256_set1_epi32(0x08040201);
const __m256i all_n1 = _mm256_set1_epi16(-0x1);
__m256i cmp = _mm256_cmpeq_epi8(a, b);
__m256i hsum16 = _mm256_maddubs_epi16(weights, cmp);
return _mm256_madd_epi16(hsum16, all_n1);
}
Peter Cordes's suggestion saved an additional vpand
. The two multiply–add instructions both run on either port 0 or 1, so this has the same throughput as the original popcount-based solution, although with a latency of about 11 instead of 5.
CodePudding user response:
Uses 1 multiply:
__m256i cmp_8_into_32(__m256i a, __m256i b) {
__m256i cmp = _mm256_cmpeq_epi8(a, b);
__m256i msk = _mm256_and_si256(cmp, _mm256_set1_epi32(0x08040201));
__m256i hsum = _mm256_madd_epi16(msk, _mm256_set1_epi8(1));
return _mm256_srli_epi16(hsum, 8);
}
A 32-bit multiply (_mm256_mullo_epi32
) is not used because it is slow.
If the results are not needed "in-lane" then one could use a _mm256_packs_epi16
immediately after the comparison to process twice as much data at once. If you don't need all of the possible states (say we want to treat no-matches the same as only the lowest byte matches) then you could do 4x as much per instruction. If the results from the vpshufb
lookup are getting gathered together then there may be other possible optimizations...