Home > database >  C/C fast absolute difference between two series
C/C fast absolute difference between two series

Time:01-08

i am interested in generating efficient c/c code to get the differences between two time series. More precise: The time series values are stored as uint16_t arrays with fixed and equal length == 128.

I am good with a pure c as well as a pure c implementation. My code examples are in c

My intentions are:

Let A,B and C be discrete time series of length l with a value-type of uint16_t.
Vn[n<l]: Cn = |An - Bn|;

What i can think of in pseudo code:

for index i:
 if a[i] > b[i]:
    c[i] = a[i] - b[i]
 else:
    c[i] = b[i] - a[i]

Or in c/c

for(uint8_t idx = 0; idx < 128; idx  ){
    c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];
}

But i really dont like the if/else statement in the loop. I am okay with looping - this can be unrolled by the compiler. Somewhat like:

void getBufDiff(const uint16_t (&a)[], const uint16_t (&b)[], uint16_t (&c)[]) {
#pragma unroll 16
    for (uint8_t i = 0; i < 128; i  ) {
        c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];
    }
#end pragma
}

What i am looking for is a 'magic code' which speeds up the if/else and gets me the absolute difference between the two unsigned values.

I am okay with a /- 1 precision (In case this would allow some bit-magic to happen). I am also okay with changing the data-type to get faster results. And i am also okay with dropping the loop for something else.

So something like:

void getBufDiff(const uint16_t (&a)[], const uint16_t (&b)[], uint16_t (&c)[]) {
#pragma unroll 16
    for (uint8_t i = 0; i < 128; i  ) {
        c[i] = magic_code_for_abs_diff(a[i],b[i]);
    }
#end pragma
}

Did try XORing the two values. Gives proper results only for one of the cases.

EDIT 2:

Did a quick test on different approaches on my Laptop.

For 250000000 entrys this is the performance (256 rounds):

c[i] = a[i] > b[i] ? a[i] - b[i] : b[i] - a[i];  ~500ms
c[i] = std::abs(a[i] - b[i]);                    ~800ms
c[i] = ((a[i] - b[i])   ((a[i] - b[i]) >> 15)) ^ (i >> 15) ~425ms
uint16_t tmp = (a[i] - b[i]); c[i] = tmp * ((tmp > 0) - (tmp < 0)); ~600ms
uint16_t ret[2] = { a[i] - b[i], b[i] - a[i] };c[i] = ret[a[i] < b[i]] ~900ms
c[i] = ((a[i] - b[i]) >> 31 | 1) * (a[i] - b[i]); ~375ms
c[i] = ((a[i] - b[i])) ^ ((a[i] - b[i]) >> 15) ~425ms

CodePudding user response:

Your problem is a good candidate for SIMD. GCC can do it automatically, here is a simplified example: https://godbolt.org/z/36nM8bYYv

void absDiff(const uint16_t* a, const uint16_t* b, uint16_t* __restrict__ c)
{
    for (uint8_t i = 0; i < 16; i  )
        c[i] = a[i] - b[i];
}

Note that I added __restrict__ to enable autovectorization, otherwise the compiler has to assume your arrays may overlap and it isn't safe to use SIMD (because some writes could change future reads in the loop).

I simplified it to just 16 at a time, and removed the absolute value for the sake of illustration. The generated assembly is:

    vld1.16 {q9}, [r0]!
    vld1.16 {q11}, [r1]!
    vld1.16 {q8}, [r0]
    vld1.16 {q10}, [r1]
    vsub.i16        q9, q9, q11
    vsub.i16        q8, q8, q10
    vst1.16 {q9}, [r2]!
    vst1.16 {q8}, [r2]
    bx      lr

That means it loads 8 integers at once from a, then from b, repeats that once, then does 8 subtracts at once, then again, then stores 8 values twice into c. Many fewer instructions than without SIMD.

Of course it requires benchmarking to see if this is actually faster on your system (after you add back the absolute value part, I suggest using your ?: approach which does not defeat autovectorization), but I expect it will be significantly faster.

CodePudding user response:

Fast abs (under two complement integers) can be implemented as (x (x >> N)) ^ (x >> N) where N is the size of int - 1, i.e. 15 in your case. That's a possible implementation of std::abs. Still you can try it

– answer by freakish

CodePudding user response:

Since you write "I am okay with a /- 1 precision", you can use a XOR-solution: instead of abs(x), do x ^ (x >> 15). This will give an off-by-1 result for negative values.

If you want to calculate the correct result even for negative values, use the other answer (with x >> 15 correction).

In any case, this XOR-trick only works if overflow is impossible. The compiler can't replace abs by code which uses XOR because of that.

  • Related