Home > Net >  C how to store Boolean data for fast bitwise AND operations
C how to store Boolean data for fast bitwise AND operations

Time:07-17

I have a database with Boolean values with (at compile-time) unknown size.

For example:

    Dtb[100][6] = { {0,1,0,1,1,0},
                    {1,0,0,1,1,0},
                    {0,0,1,0,1,0},
                    ...          }

I will initialize it once (while running) and will not change it afterwards.

Then, I want to check for a item i = {0,0,0,1,1,0}, whether it appears in each row.

For example:

    for (row in Dtb) {
        // check
        i == i & row
    }

So if an entry in i is true, whether the corresponding entry is true in the row, as well.

I am interested in way to store my database in C , so that these checks work very fast.

Yet, I considered to use std:vector<std::vector<bool>>, std:vector<std::vector<char>> (or not nested, saved columnwise), and I also came across boost::dynamic_bitset. On my research I found so far that

  • I should not use something like std::vector<bool>, because it is not a container
  • boost::dynamic_bitset needs less storage than std::vector<bool>
  • std::vector<char>is faster accessible, but it takes longer to initialize

As I could use the bitwise AND, I think, I do not need to access single bits, so boost::dynamic_bitset might be my best option?

Edit:

I must not use the std::array class, since I do not know the size of my database when compiling the code, as mentioned above.

The dimensions of the database can become big, so it should be possible to hold rows with about 100 values, as well.

CodePudding user response:

boost::dynamic_bitset<T> is just a fancy wrapper around std::vector<T>. Therefore performance comparisons between the two come down to how good you are at implementing your operation on vectors compared to the boost developers.

Unless you want to get your hands really dirty with micro-optimizing your comparison, dynamic_bitset is probably the better choice as it comes with handy functions such as is_subset which implement exactly what you seem to want.

Picking the integer type

The integer type T for the bitset can be chosen freely. Pick the largest reasonable type. Usually uint64_t or size_t unless you know you have fewer bits. The reasons are simple:

  1. Your comparison function is limited by instruction throughput. A larger type allows checking more bits per instruction. Therefore picking something small like unsigned char would limit you severely (or put you at the mercy of your compilers vectorizer which I wouldn't count on here)

  2. Compared to the overhead of the dynamic allocation and the bitset datatype itself, wasting a few bytes by picking a larger payload type is insignificant

vector<bool>

Internally, vector<bool> uses the same layout. There is nothing to be gained here. Its interface makes large set-like comparisons like the one you want slower than they need to be.

boolean matrix

Two observations:

  1. A few 100 bit isn't really a large number. That's only a couple 64 bit integers. Using a dynamic allocation for that is pretty wasteful and kills the locality of data
  2. From your example it seems like all rows are equally sized

In this case, you could consider a densely packed boolean matrix type instead. Writing one is simple enough. The improved prefetching alone may already make it worthwhile. Something like this:

struct BoolMatrix
{
  std::size_t rows, bits_per_row, ints_per_row;
  std::vector<std::uint64_t> data;

  BoolMatrix(std::size_t rows, std::size_t bits_per_row)
  : rows(rows),
    bits_per_row(bits_per_row),
    ints_per_row((bits_per_row   63) / 64),
    data(rows * ints_per_row)
  {}
  void set(std::size_t row, std::size_t col) noexcept
  {
    data[row * ints_per_row   col / 64] |= 1 << (col % 64);
  }
  std::size_t first_match(const std::uint64_t* subset) const noexcept
  {
    std::size_t i;
    for(i = 0; i < rows;   i) {
      std::size_t j;
      for(j = 0; j < ints_per_row;   j)
        if((data[i * ints_per_row   j] & subset[j]) != subset[j])
          break; // mismatch
      if(j == ints_per_row)
        return i;
    }
    return i; // no match. return N
  }
};
  •  Tags:  
  • c
  • Related