I want to compare a ZMM
vector and using its result and performing vpandn
.
in AVX2
, i do this :
vpcmpgtb ymm3, ymm3, ymm1
vpandn ymm3, ymm3, ymm2
But in AVX512BW
, vpcmpgtb
returns result in a K
.
How should I perform vpandn
on its result in AVX512BW
?
vpcmpgtb k0, zmm3, zmm1
vpandn ??
CodePudding user response:
There are separate instructions for k
registers; their mnemonics all start with k
so they're easy to find in the table of instructions, like kandnq k0, k0, k1
.
As well as kunpck...
(concatenate, not interleave), kadd
/kshift
, kor
/kand
/knot
/kxor
, and even a kxnor
(handy way to generate all-ones for gather/scatter). Also of course kmov
(including to/from memory or GP-integer), and kortest
and ktest
for branching.
They all come in byte/word/dword/qword sizes for the number of mask bits affected, zero-extending the result. (Without AVX-512BW on a Xeon Phi, only byte and word sizes, since 16 bits covers a ZMM with elements as small as dword. But all mainstream CPUs with AVX-512 have AVX-512BW and thus 64-bit mask registers.)
You can sometimes fold that into another operation to avoid needing a separate instruction to combine masks. Either invert the compare so you can use ktest
directly to branch, or if you want to mask, use a zero-masked compare-into-mask. (Merge-masked compare/test into a 3rd existing mask is not supported.)
vpcmpngtb k1, zmm3, zmm1 ; k0 can't be used for masking, only with k instructions
vpcmpeqb k2{k1}, zmm4, zmm1 ; This is zero-masking even without {z}, because merge masking isn't supported for this
equivalent (except for performance) to:
vpcmpngtb k1, zmm3, zmm1
vpcmpeqb k2, zmm4, zmm1
kand k2, k2, k1
Also equivalent to kandn
with a gt
compare as the NOTed (first) operand, like in your question.
k...
mask instructions can usually only run on port 0, not great performance. https://uops.info/.
A masked compare (or other instruction) has to wait for the mask register input to be ready before starting to work on the other operands. You might hope it would support late forwarding for masks since to only use them at write-back, but IIRC it doesn't. Still, only 1 instruction instead of 2 is still better. Having the first instruction of two able to execute in parallel isn't better unless it was high latency and the mask operation is low latency, and you're latency bound. But often execution-unit throughput is more of a bottleneck when using 512-bit registers. (Since the vector ALUs on port 1 are shut down.)
Some k
instructions are only 1 cycle latency on current CPUs, while others are 4 cycle latency. (Like kshift
and kunpck
, and kadd
.)
The intrinsics for these masked compare-into-mask instructions are _mm256_mask_cmp_ep[iu]_mask
, with a __mmask8/16/32/64
input operand (as well as two vectors and an immediate predicate) and a mask return value. Like the asm, they use ..._mask_...
instead of ..._maskz_...
despite this being zero-masking not merge-masking.