Is there an efficient way to calculate a bitwise sum of uint8_t
buffers (assume number of buffers are <= 255, so that we can make the sum uint8
)? Basically I want to know how many bits are set at the i'th position of each buffer.
Ex: For 2 buffers
uint8 buf1[k] -> 0011 0001 ...
uint8 buf2[k] -> 0101 1000 ...
uint8 sum[k*8]-> 0 1 1 2 1 0 0 1...
are there any BLAS or boost routines for such a requirement?
This is a highly vectorizable operation IMO.
UPDATE: Following is a naive impl of the requirement
for (auto&& buf: buffers){
for (int i = 0; i < buf_len; i ){
for (int b = 0; b < 8; b) {
sum[i*8 b] = (buf[i] >> b) & 1;
}
}
}
CodePudding user response:
An alternative to OP's naive code:
Perform 8 additions at once. Use a lookup table to expand the 8 bits to 8 bytes with each bit to a corresponding byte - see ones[]
.
void sumit(uint8_t number_of_buf, uint8_t k, const uint8_t buf[number_of_buf][k]) {
static const uint64_t ones[256] = { 0, 0x1, 0x100, 0x101, 0x10000, 0x10001,
/* 249 more pre-computed constants */ 0x0101010101010101};
uint64_t sum[k];
memset(sum, 0, sizeof sum):
for (size_t buf_index = 0; buf_index < number_of_buf; buf_index ) {
for (size_t int i = 0; i < k; i ) {
sum[i] = ones(buf[buf_index][i]);
}
}
for (size_t int i = 0; i < k; i ) {
for (size_t bit = 0; bit < 8; bit ) {
printf("%llu ", 0xFF & (sum[i] >> (8*bit)));
}
}
}
See also @Eric Postpischil.
CodePudding user response:
As a modification of chux's approach, the lookup table can be replaced with a vector shift and mask. Here's an example using GCC's vector extensions.
#include <stdint.h>
#include <stddef.h>
typedef uint8_t vec8x8 __attribute__((vector_size(8)));
void sumit(uint8_t number_of_buf,
uint8_t k,
const uint8_t buf[number_of_buf][k],
vec8x8 * restrict sums) {
static const vec8x8 shift = {0,1,2,3,4,5,6,7};
for (size_t i = 0; i < k; i ) {
sums[i] = (vec8x8){0};
for (size_t buf_index = 0; buf_index < number_of_buf; buf_index ) {
sums[i] = (buf[buf_index][i] >> shift) & 1;
}
}
}
I interchanged the loops from chux's answer because it seemed more natural to accumulate the sum for one buffer index at a time (then the sum can be cached in a register throughout the inner loop). There might be a tradeoff in cache performance because we now have to read the elements of the two-dimensional buf
in column-major order.
Taking ARM64 as an example, GCC 11.1 compiles the inner loop as follows.
// v1 = sums[i]
// v2 = {0,-1,-2,...,-7} (right shift is done as left shift with negative count)
// v3 = {1,1,1,1,1,1,1,1}
.L4:
ld1r {v0.8b}, [x1] // replicate buf[buf_index][i] to all elements of v0
add x0, x0, 1
add x1, x1, x20
ushl v0.8b, v0.8b, v2.8b // shift
and v0.8b, v0.8b, v3.8b // mask
add v1.8b, v1.8b, v0.8b // accumulate
cmp x0, x19
bne .L4
I think it'd be more efficient to do two bytes at a time (so unrolling the loop on i
by a factor of 2) and use 128-bit vector operations. I leave this as an exercise :)
It's not immediately clear to me whether this would end up being faster or slower than the lookup table. You might have to profile both on the target machine(s) of interest.