Let's say I have 2 arrays of bools of length 64(or whatever register size) and I want to AND all the corresponding bools to a resultant 3rd array. Obviously its possible to pack the arrays into 2 registers and perform a bitwise AND in a single instruction, but this is much slower if bit fiddling is necessary to pack and unpack. Is there any x86 instruction(or any x86 extended set instruction) that performs the packing?
CodePudding user response:
You'd normally keep your arrays packed all the time if you wanted to be able to do that efficiently, and access them with bit-indexing within a 64-bit register. e.g. with bt rdi, rax
to set CF according to the bit-number indexed by RAX. bool CF = rdi & (1ULL<<(rax&63))
.
Don't use bt
or bts
with a memory destination; they have crazy-CISC bit-string semantics where bt [rdi], rax
can index outside the qword at [rdi]
, using the whole RAX as a bit-index if the destination isn't a register.
If your arrays are stored 1 bool per byte, you'd normally just use two vpand
instructions to bitwise-AND 32 bytes at a time (AVX2). Just like if you were ANDing 256-bit bitmaps where only every 8th bit might be non-zero.
vmovdqu ymm0, [rdi] ; load 32 bytes
vpand ymm0, ymm0, [rsi] ; load and 32 bytes from the 2nd source
vmovdqu [rdx], ymm0 ; store 32 bytes
vmovdqu ymm0, [rdi 32] ; and repeat for the next 32 bytes.
vpand ymm0, ymm0, [rsi 32]
vmovdqu [rdx 32], ymm0
A compiler should do this for you if you write for(int i=0;i<64;i ) c[i] = a[i]&b[i];
for uint8_t
or bool
elements.
Packing bools to bitmaps with SSE2 or AVX2
But if you want to pack bools to bitmaps, yeah, pmovmskb
is the special x86 instruction you want for this, packing the top bit of each SIMD vector element into an integer. It's existed since SSE2, but AVX2 is fairly widely available and can go 32 at a time instead of just 16.
See also How to create a byte out of 8 bool values (and vice versa)? for that and a multiply bithack for 8 bytes at a time.
e.g. making a std::bitset<64>
from a std::array<bool, 64>
, using AVX2:
vmovdqu ymm0, [rdi] ; first 32 bool elements
vpslld ymm0, ymm0, 7 ; shift the 0/1 to the top, 0x80 or 0x00 in each byte
vpmovmskb eax, ymm0
vmovdqu ymm0, [rdi 32]
vpslld ymm0, ymm0, 7
vpmovmskb edx, ymm0
vzeroupper ; if you might do any legacy SSE before next use of 256-bit vectors
shl rdx, 32 ; combine hi:lo halves
or rax, rdx ; ((uint64_t)hi << 32) | lo
# The 64 bits in RAX come from the bools in [rdi 0..63]
So it's more work than just ANDing 32 bytes at a time from two inputs. If you wanted a packed result from two unpacked inputs, you'd probably want to _mm256_and_si256()
them and then _mm256_slli_epi32
/ _mm256_movemask_epi8
those AND results.
To unpack again, see How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)? - it's less efficient without AVX-512.
Using AVX-512
AVX-512 can compare or test into a mask register, skipping the [v]pmovmskb
step. But k0..7
mask registers are limited in what you can do with them (especially if you care about efficiency; kand
can only run on port 5 on existing CPUs; https://uops.info/). And it takes a kmov
to get data from them into GP registers like RAX.
For example with intrinsics:
#include <immintrin.h>
// or I could have declared these as taking bool *p args
__mmask64 foo(char *p){
__m512i v = _mm512_loadu_si512(p);
return _mm512_test_epi8_mask(v, v);
}
__mmask64 bar(char *p){
__m512i v = _mm512_loadu_si512(p);
return _mm512_cmpneq_epi8_mask(_mm512_setzero_si512(), v);
}
Compiles on Godbolt
# GCC12 -O3 -march=skylake-avx512
foo(char*):
vmovdqu64 zmm0, ZMMWORD PTR [rdi] # 64-byte load
vptestmb k0, zmm0, zmm0 # test into mask
kmovq rax, k0
vzeroupper # could have used ZMM16..31 to avoid this
ret
bar(char*):
vpxor xmm0, xmm0, xmm0
vpcmpb k0, zmm0, ZMMWORD PTR [rdi], 4
kmovq rax, k0
vzeroupper # not actually needed, this version doesn't write a ZMM register
ret
If I'd used two different input arrays, we could AND them together into a bitmask with one vptestmb
instruction. So it's still better to do that, rather than separately pack the inputs for a kand k0, k1
.
vmovdqu32 zmm0, [rdi]
vptestmb k1, zmm0, [rsi] ; k1 = packed bits of a[0..63] & b[0..63]
See Does Skylake need vzeroupper for turbo clocks to recover after a 512-bit instruction that only reads a ZMM register, writing a k mask? re: vzeroupper being needed or not when you only read a ZMM register after zeroing it implicitly via XMM zeroing. Either way, compilers could have just used ZMM16..31 to avoid touching the upper part of y/zmm0..15. That would avoid transition stalls, and AFAIK there wouldn't be other penalties even though there'd be a non-zero ZMM register for the remainder of the program.
Using 512-bit vectors can have some performance downsides if you don't make heavy use of them everywhere in your program, which is why compilers default to -mprefer-vector-width=256
for auto-vectorizing.
- SIMD instructions lowering CPU frequency
- why does gcc auto-vectorization for tigerlake use ymm not zmm registers
If you do compare in two 32-byte halves, you might want kunpackdq k1, k1, k2
after comparing into k1 and k2, then kmov rax, k1
. That concatenates the low 32 bits of k1 and k2.
Unpacking
AVX-512 finally added direct support for turning a mask into a vector of 0 / -1 elements, with vpmovm2b zmm0, k1
(docs). You could vpandd
that with a vector of set1_epi8(1)
to get bools.
Otherwise, see
How to perform the inverse of _mm256_movemask_epi8 (VPMOVMSKB)?
is there an inverse instruction to the movemask instruction in intel avx2? - various combos of element sizes and number of bits