Home > Enterprise >  C - portable fast dot product between a float matrix and a sparse boolean value matrix
C - portable fast dot product between a float matrix and a sparse boolean value matrix

Time:05-07

I am working on a spiking neural network project in C where spikes are boolean values. Right now I have built a custom bit matrix type to represent the spike matrixes.

I frequently need the dot product of the bit matrix and a matrix of single precision floats of the same size, so I was wondering how I should speed things up?

I also need to do pointwise multiplication of the float matrix and the bit matrix later.

My plan right now was just to loop through with and if statement and bitshift. I want to speed this up.

float current = 0;
for (int i = 0; i < n_elem; i  , bit_vec >>= 1) {
    if (bit_vec & 1)
        current  = weights[i];
}

I don't necessarily need to use a bit vector, it could be represented in other ways too. I have seen other answers here, but they are hardware specific and I am looking for something that can be portable.

I am not using any BLAS functions either, mostly because I am never operating on two floats. Should I be?

Thanks.

CodePudding user response:

The bit_vec >>= 1 and current = weights[i] instruction cause a loop carried dependency that will certainly prevent the compiler to generate a fast implementation (and also prevent the processor to execute it efficiently). You can solve this by unrolling the loop. Additionally, most mainstream compilers are not smart enough so to optimize out the condition en use a blend instruction available on most architecture. Conditional branches are slow, especially when they cannot be easily predicted (which is certainly you case). You can use a multiplication so to help the compiler generating better instructions. Here is the result:

    const unsigned int blockSize = 4;
    float current[blockSize] = {0.f};
    int i;

    for (i = 0; i < n_elem-blockSize 1; i =blockSize, bit_vec >>= blockSize)
        for(int j = 0 ; j < blockSize ;   j)
            current[j]  = weights[i] * (bit_vec >> j);

    for (; i < n_elem;   i, bit_vec >>= 1)
        if (bit_vec & 1)
            current[0]  = weights[i];

    float sum = 0.f;
    for(int j = 0 ; j < blockSize ;   j)
        sum  = current[j];

This code should be faster assuming n_elem is relatively big. It should still be far from being efficient since compilers like GCC and Clang fail to auto-vectorize it. This is sad since it would be several time faster with SIMD instructions (like SSE, AVX, Neon, etc.). That being said, this is exactly why people use non-portable code: to manually use efficient instruction since compiler often fail to do that in non-trivial cases.

  • Related