efficient bitwise sum calculation-CodePudding

Is there an efficient way to calculate a bitwise sum of uint8_t buffers (assume number of buffers are <= 255, so that we can make the sum uint8)? Basically I want to know how many bits are set at the i'th position of each buffer.

Ex: For 2 buffers

uint8 buf1[k] -> 0011 0001 ...
uint8 buf2[k] -> 0101 1000 ...
uint8 sum[k*8]-> 0 1 1 2 1 0 0 1...

are there any BLAS or boost routines for such a requirement?

This is a highly vectorizable operation IMO.

UPDATE: Following is a naive impl of the requirement

for (auto&& buf: buffers){
  for (int i = 0; i < buf_len; i  ){
    for (int b = 0; b < 8;   b) {
      sum[i*8   b]  = (buf[i] >> b) & 1;
    }
  }
}

CodePudding user response：

An alternative to OP's naive code:

Perform 8 additions at once. Use a lookup table to expand the 8 bits to 8 bytes with each bit to a corresponding byte - see ones[].

void sumit(uint8_t number_of_buf, uint8_t k, const uint8_t buf[number_of_buf][k]) {
  static const uint64_t ones[256] = { 0, 0x1, 0x100, 0x101, 0x10000, 0x10001, 
      /* 249 more pre-computed constants */ 0x0101010101010101};

  uint64_t sum[k];
  memset(sum, 0, sizeof sum):

  for (size_t buf_index = 0; buf_index < number_of_buf;  buf_index  ) {
    for (size_t int i = 0; i < k; i  ) {
      sum[i]  = ones(buf[buf_index][i]);
    }
  }

  for (size_t int i = 0; i < k; i  ) {
    for (size_t bit = 0; bit < 8;  bit  ) {
      printf("%llu ", 0xFF & (sum[i] >> (8*bit)));
    }
  }
}

See also @Eric Postpischil.

CodePudding user response：

As a modification of chux's approach, the lookup table can be replaced with a vector shift and mask. Here's an example using GCC's vector extensions.

#include <stdint.h>
#include <stddef.h>

typedef uint8_t vec8x8 __attribute__((vector_size(8)));

void sumit(uint8_t number_of_buf,
           uint8_t k,
           const uint8_t buf[number_of_buf][k],
           vec8x8 * restrict sums) {
    static const vec8x8 shift = {0,1,2,3,4,5,6,7};

    for (size_t i = 0; i < k; i  ) {
        sums[i] = (vec8x8){0};
        for (size_t buf_index = 0; buf_index < number_of_buf;  buf_index  ) {
            sums[i]  = (buf[buf_index][i] >> shift) & 1;
        }
    }
}

Try it on godbolt.

I interchanged the loops from chux's answer because it seemed more natural to accumulate the sum for one buffer index at a time (then the sum can be cached in a register throughout the inner loop). There might be a tradeoff in cache performance because we now have to read the elements of the two-dimensional buf in column-major order.

Taking ARM64 as an example, GCC 11.1 compiles the inner loop as follows.

// v1 = sums[i]
// v2 = {0,-1,-2,...,-7} (right shift is done as left shift with negative count)
// v3 = {1,1,1,1,1,1,1,1}
.L4:
        ld1r    {v0.8b}, [x1]         // replicate buf[buf_index][i] to all elements of v0
        add     x0, x0, 1             
        add     x1, x1, x20
        ushl    v0.8b, v0.8b, v2.8b   // shift
        and     v0.8b, v0.8b, v3.8b   // mask
        add     v1.8b, v1.8b, v0.8b   // accumulate
        cmp     x0, x19
        bne     .L4

I think it'd be more efficient to do two bytes at a time (so unrolling the loop on i by a factor of 2) and use 128-bit vector operations. I leave this as an exercise :)

It's not immediately clear to me whether this would end up being faster or slower than the lookup table. You might have to profile both on the target machine(s) of interest.