I'm trying to search an array with AVX:
__attribute__((target("avx512bw"))) int search(int* nums, int numsSize, int target) {
// align nums
int arr[16] __attribute__((aligned(512)));
__builtin_memcpy(arr, nums, numsSize*sizeof(int));
// build vectors
const __m512i valueVec = _mm512_set1_epi32(target);
const __m512i searchVec = _mm512_load_epi32(&arr[0]);
// compare
const __mmask16 equalBits = _mm512_cmpeq_epi32_mask(searchVec, valueVec);
return equalBits;
}
When I have a 0
in the input for nums
, like [0,1,3,5,9,12]
, and target=0
, I get wrong results that are close to high powers of 2: 33282
, 33281
, 2692
.
Is this due to the undefined bits in searchVec
? Like it matches on the first zero of the ones not filled because my input does not fill the vector completely?
Also is there a way to convert the equalBits
bitmask, which is 1,2,4,8,16, to the vector's index of the matching value, like 1,2,3,4,5? I tried _tzcnt_u32( (unsigned int) equalBits)
but it looks like it needs to be cast to a vector, unsigned int __X
.
CodePudding user response:
Yes, you need to mask off the unused elements.
int search(int* nums, int numsSize, int target) {
// mask unused values -- assumes numsSize is <= 15
auto const mask = (1 << numsSize) - 1;
// build vectors
const __m512i valueVec = _mm512_set1_epi32(target);
const __m512i searchVec = _mm512_maskz_loadu_epi32(mask, nums);
// compare
const __mmask16 equalBits = _mm512_mask_cmpeq_epi32_mask(mask, searchVec, valueVec);
return equalBits;
}
You don't need to copy to an aligned temporary array; you can use the loadu
(for "unaligned") intrinsics.
is there a way to convert the equalBits bitmask, which is 1,2,4,8,16, to the vector's index of the matching value, like 1,2,3,4,5?
If you have more than one match, easy way is to make a vector of indices, then compress it
auto const indices = _mm512_set_epi32(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);
// Assuming exactly one match, store to an array or something otherwise:
int index;
_mm512_mask_compressstoreu_epi32(&index, k, indices);
// See also _mm512_maskz_compress_epi32 to return a zmm instead of storing to a ptr:
// int index = _mm_cvtsi128_si32(
// _mm256_castsi256_si128(
// _mm512_castsi512_si256(matchedIndices)));
If you're doing this in a loop, use _mm512_set1_epi32(16)
and add that to indices
in each iteration.
If you have exactly one match, then you're correct about tzcnt
and just need to cast the mask to an int:
_tzcnt_u32(static_cast<uint32_t>(k))