Home > other >  C performance optimization for linear combination of large matrices?
C performance optimization for linear combination of large matrices?

Time:04-27

I have a large tensor of floating point data with the dimensions 35k(rows) x 45(cols) x 150(slices) which I have stored in an armadillo cube container. I need to linearly combine all the 150 slices together in under 35 ms (a must for my application). The linear combination floating point weights are also stored in an armadillo container. My fastest implementation so far takes 70 ms, averaged over a window of 30 frames, and I don't seem to be able to beat that. Please note I'm allowed CPU parallel computations but not GPU.

I have tried multiple different ways of performing this linear combination but the following code seems to be the fastest I can get (70 ms) as I believe I'm maximizing the cache hit chances by fetching the largest possible contiguous memory chunk at each iteration.

Please note that Armadillo stores data in column major format. So in a tensor, it first stores the columns of the first channel, then the columns of the second channel, then third and so forth.

typedef std::chrono::system_clock Timer;
typedef std::chrono::duration<double> Duration;

int rows = 35000;
int cols = 45;
int slices = 150;
arma::fcube tensor(rows, cols, slices, arma::fill::randu);
arma::fvec w(slices, arma::fill::randu);

double overallTime = 0;
int window = 30;
for (int n = 0; n < window; n  ) {

    Timer::time_point start = Timer::now();

    arma::fmat result(rows, cols, arma::fill::zeros);
    for (int i = 0; i < slices; i  )
        result  = tensor.slice(i) * w(i);

    Timer::time_point end = Timer::now();
    Duration span = end - start;
    double t = span.count();
    overallTime  = t;
    cout << "n = " << n << " --> t = " << t * 1000.0 << " ms" << endl;
}

cout << endl << "average time = " << overallTime * 1000.0 / window << " ms" << endl;

I need to optimize this code by at least 2x and I would very much appreciate any suggestions.

CodePudding user response:

First at all I need to admit, I'm not familiar with the arma framework or the memory layout; the least if the syntax result = slice(i) * weight evaluates lazily.

Two primary problem and its solution anyway lies in the memory layout and the memory-to-arithmetic computation ratio.

To say a =b*c is problematic because it needs to read the b and a, write a and uses up to two arithmetic operations (two, if the architecture does not combine multiplication and accumulation).

If the memory layout is of form float tensor[rows][columns][channels], the problem is converted to making rows * columns dot products of length channels and should be expressed as such.

If it's float tensor[c][h][w], it's better to unroll the loop to result = slice(i) slice(i 1) .... Reading four slices at a time reduces the memory transfers by 50%.

It might even be better to process the results in chunks of 4*N results (reading from all the 150 channels/slices) where N<16, so that the accumulators can be allocated explicitly or implicitly by the compiler to SIMD registers.

There's a possibility of a minor improvement by padding the slice count to multiples of 4 or 8, by compiling with -ffast-math to enable fused multiply accumulate (if available) and with multithreading.

The constraints indicate the need to perform 13.5GFlops, which is a reasonable number in terms of arithmetic (for many modern architectures) but also it means at least 54 Gb/s memory bandwidth, which could be relaxed with fp16 or 16-bit fixed point arithmetic.

EDIT

Knowing the memory order to be float tensor[150][45][35000] or float tensor[kSlices][kRows * kCols == kCols * kRows] suggests to me to try first unrolling the outer loop by 4 (or maybe even 5, as 150 is not divisible by 4 requiring special case for the excess) streams.

void blend(int kCols, int kRows, float const *tensor, float *result, float const *w) {
    // ensure that the cols*rows is a multiple of 4 (pad if necessary)
    // - allows the auto vectorizer to skip handling the 'excess' code where the data
    //   length mod simd width != 0
    // one could try even SIMD width of 16*4, as clang 14
    // can further unroll the inner loop to 4 ymm registers
    auto const stride = (kCols * kRows   3) & ~3;
    // try also s =6, s =3, or s =4, which would require a dedicated inner loop (for s =2)
    for (int s = 0; s < 150; s =5) {
        auto src0 = tensor    s * stride;
        auto src1 = src0   stride;
        auto src2 = src1   stride;
        auto src3 = src2   stride;
        auto src4 = src3   stride;
        auto dst = result;
        for (int x = 0; x < stride; x  ) {
            // clang should be able to optimize caching the weights
            // to registers outside the innerloop
            auto add = src0[x] * w[s]  
                       src1[x] * w[s 1]  
                       src2[x] * w[s 2]  
                       src3[x] * w[s 3]  
                       src4[x] * w[s 4];
            // clang should be able to optimize this comparison
            // out of the loop, generating two inner kernels
            if (s == 0) {
                dst[x] = add;
            } else {
                dst[x]  = add;
            }
        }
    }
}

EDIT 2

Another starting point (before adding multithreading) would be consider changing the layout to

float tensor[kCols][kRows][kSlices   kPadding]; // padding is optional

The downside now is that kSlices = 150 can't anymore fit all the weights in registers (and secondly kSlices is not a multiple of 4 or 8). Furthermore the final reduction needs to be horizontal.

The upside is that reduction no longer needs to go through memory, which is a big thing with the added multithreading.

void blendHWC(float const *tensor, float const *w, float *dst, int n, int c) {
     // each thread will read from 4 positions in order
     // to share the weights -- finding the best distance
     // might need some iterations
     auto src0 = tensor;
     auto src1 = src0   c;
     auto src2 = src1   c;
     auto src3 = src2   c; 
     for (int i = 0; i < n/4; i  ) {
         vec8 acc0(0.0f), acc1(0.0f), acc2(0.0f), acc3(0.0f);
         // #pragma unroll?
         for (auto j = 0; j < c / 8; c  ) {
             vec8 w(w   j);
             acc0  = w * vec8(src0   j);
             acc1  = w * vec8(src1   j);
             acc2  = w * vec8(src2   j);
             acc3  = w * vec8(src3   j);
         }
         vec4 sum = horizontal_reduct(acc0,acc1,acc2,acc3);
         sum.store(dst); dst =4;
     } 
}

These vec4 and vec8 are some custom SIMD classes, which map to SIMD instructions either through intrinsics, or by virtue of the compiler being able to do compile using vec4 = float __attribute__ __attribute__((vector_size(16))); to efficient SIMD code.

CodePudding user response:

As @hbrerkere suggested in the comment section, by using the -O3 flag and making the following changes, the performance improved by almost 65%. The code now runs at 45 ms as opposed to the initial 70 ms.

int lastStep = (slices / 4 - 1) * 4;
int i = 0;
while (i <= lastStep) {
    blendshapes  = tensor.slice(i) * w_id(i)   tensor.slice(i   1) * w_id(i   1)   tensor.slice(i   2) * w_id(i   2)   tensor.slice(i   3) * w_id(i   3);
    i  = 4;
}
while (i < slices) {
    blendshapes  = tensor.slice(i) * w_id(i);
    i  ;
}

CodePudding user response:

Without having the actual code, I'm guessing that

 = tensor.slice(i) * w_id(i)

creates a temporary object and then adds it to the lhs. Yes, overloaded operators look nice, but I would write a function

addto( lhs, slice1, w1, slice2, w2, ....unroll to 4... )

which translates to pure loops over the elements. It would surprise me if that doesn't buy you an extra factor.

  • Related