I have a lot of 32b values and I need to count the occurrence of each nth true bit over the entire length of the data and I need to do it as fast as possible because this is the performance bottleneck of the whole simulation. I created a naive c approach that does this for 8 bit values to illustrate the question:
#include <iostream>
#include <vector>
#include <cstdint>
std::vector<uint32_t> vertical_popcount(std::vector<uint8_t>& data) {
std::vector<uint32_t> result({0, 0, 0, 0, 0, 0, 0, 0});
for (auto i = 0; i < data.size(); i ) {
result[0] = (data[i] & 0b10000000) > 0;
result[1] = (data[i] & 0b01000000) > 0;
result[2] = (data[i] & 0b00100000) > 0;
result[3] = (data[i] & 0b00010000) > 0;
result[4] = (data[i] & 0b00001000) > 0;
result[5] = (data[i] & 0b00000100) > 0;
result[6] = (data[i] & 0b00000010) > 0;
result[7] = (data[i] & 0b00000001) > 0;
}
return result;
}
int main() {
std::vector<uint8_t> data({0b00000001, 0b00000100, 0b00000101});
auto result = vertical_popcount(data);
std::cout << "occurrence of bits: " << result[0] << ", " << result[1] << ", " << result[2] << ", " << result[3] << ", " << result[4] << ", " << result[5] << ", " << result[6] << ", " << result[7] << "\n";
return 0;
}
Is there an algorithm that does the same but (much) faster?
CodePudding user response:
Here is a sketch of a possible approach (you can improve on this a lot if you can tweak your input to multiples of 8 bytes, or if you know the size of your vectors up front). But it gives you an idea how to use popcount. (For 32bits architecture the pattern is the same but you get less performance)
// #include <intrin.h> // for popcnt which counts number of set bits in a variable
#include <array>
#include <iostream>
#include <vector>
#include <cstdint>
#include <memory>
// work on copy on purpose so we can pad memory of vector to 64bit alignment (kind of a quick hack for now, the extra memory (re)allocations might slow you down too much)
auto vertical_popcount(std::vector<std::uint8_t> values)
{
// use 64bit architecture to do 8 values per cycle
static constexpr std::array<std::uint64_t, 8> masks
{
0x0101010101010101,
0x0202020202020202,
0x0404040404040404,
0x0808080808080808,
0x1010101010101010,
0x2020202020202020,
0x4040404040404040,
0x8080808080808080
};
//using an array instead of vector safes at least one dynamic allocation
std::array<std::size_t, 8> counts{};
// align data to multiple of 8 bytes
// add a few extra 0 bytes, they wont impact the counting
for (std::size_t n = 0; n < values.size() % 8; n)
values.push_back(0);
// make a uint64_t pointer into 8 bit data
// to pickup 8 bytes at a time for masking
auto ptr = reinterpret_cast<std::uint64_t*>(values.data());
for (std::size_t n = 0; n < values.size() / 8; n)
{
for (std::size_t m = 0; m < 8; m)
{
// mask 8 bytes at a time
auto masked_value = (*ptr & masks[m]);
// count bits of 8 uint8_t's in one loop
auto bitcount = std::popcount(masked_value);
counts[m] = bitcount;
}
ptr;
}
return counts;
}
int main()
{
std::vector<uint8_t> data({ 0b00001001, 0b00000100, 0b00000101 });
auto result = vertical_popcount(data);
std::cout << "occurrence of bits: " << result[0] << ", " << result[1] << ", " << result[2] << ", " << result[3] << ", " << result[4] << ", " << result[5] << ", " << result[6] << ", " << result[7] << "\n";
return 0;
}
CodePudding user response:
Pepijn Kramers answer shows how to parallelize the operations to do 8 byte at once. My answer looks at doing more bits at once.
In your code you extract each bit and increment a counter. You do a SIMD operation on that manually on blocks of 8 uint64_t.
The idea is to spread the bits out alternating 0 and data bits so that they can be added without overflowing into the next data bit. First step is to spread them out into 2bit units, then 4bit units, 8bit units and then sum the bytes in each uint64_t. If you want to extend this to 32 bit counts then you need to add 2 more steps to separate into 16bit units and 32bit units. The example below works on 8 uint64_t but if you have larger arrays you can merge more values per step. Just keep track of how many bits you have for each count (2, 4, 8, 16, 32) and don't merge more than 2^n-1
values.
uint64_t data[8] = 0x0123456789ABCDEF;
static const uint64_t mask0 = 0x5555555555555555;
static const uint64_t mask1 = 0x3333333333333333;
static const uint64_t mask2 = 0x0F0F0F0F0F0F0F0F;
// split even and odd bits and add 2 values together
// 2 bit per count, max value 2
uint64_t t000 = data[0] & mask0 data[1] & mask0;
uint64_t t010 = data[2] & mask0 data[3] & mask0;
uint64_t t020 = data[4] & mask0 data[5] & mask0;
uint64_t t030 = data[6] & mask0 data[7] & mask0;
uint64_t t001 = (data[0] >> 1) & mask0 (data[1] >> 1) & mask0;
uint64_t t011 = (data[2] >> 1) & mask0 (data[3] >> 1) & mask0;
uint64_t t021 = (data[4] >> 1) & mask0 (data[5] >> 1) & mask0;
uint64_t t031 = (data[6] >> 1) & mask0 (data[7] >> 1) & mask0;
// split into nibbles and build sum of 4 values
// 4 bit per count, max value 4
uint64_t t100 = t000 & mask1 t010 & mask1;
uint64_t t101 = t001 & mask1 t011 & mask1;
uint64_t t102 = (t000 >> 2) & mask1 (t010 >> 2) & mask1;
uint64_t t103 = (t001 >> 2) & mask1 (t011 >> 2) & mask1;
uint64_t t110 = t020 & mask1 t030 & mask1;
uint64_t t111 = t021 & mask1 t031 & mask1;
uint64_t t112 = (t020 >> 2) & mask1 (t030 >> 2) & mask1;
uint64_t t113 = (t021 >> 2) & mask1 (t031 >> 2) & mask1;
// split into bytes, and build sum of 8 values
// 8 bit per count, max 8
uint64_t sum[] = { t100 & mask2 t110 & mask2;
t101 & mask2 t111 & mask2;
t102 & mask2 t112 & mask2;
t103 & mask2 t113 & mask2;
(t100 >> 4) & mask2 (t110 >> 4) & mask2;
(t101 >> 4) & mask2 (t111 >> 4) & mask2;
(t102 >> 4) & mask2 (t112 >> 4) & mask2;
(t103 >> 4) & mask2 (t113 >> 4) & mask2; }
// add 8 bytes of sum[i] into a single byte
for(int i = 0; i < 8; i) {
sum[i] = sum[i] & 0xFFFFFFFF (sum[i] >> 32);
sum[i] = sum[i] & 0xFFFF (sum[i] >> 16);
sum[i] = sum[i] & 0xFF (sum[i] >> 8);
}
Unless I made a mistake sum
should now hold the bit count for each bit 0-7 for the block of 8 uint64_t.
You can improve this for larger blocks. When the bits are split into even and odd each count has 2 bits. That can hold the sum of 3 uint64_t and I only use 2. Similar the split into nibbles has 4 bits per count so it can hold the sum of 15 uint64_t. The split into bytes can hold the sum of 255 uint64_t.
You can also extend this to SIMD registers and do 128bit - 512bit at once. And I think there is a SIMD sum of bytes in vector intrinsic you can use instead of the loop at the end.