Which one is faster? Array Initialization or SIMD operations?-CodePudding

I need to use Neon Intrinsics (in aarch64) using slightly modified data from an array passed in a parameter of a function:

void scenario1(uint16x8_t* X) {

    uint16x8_t arrayTest01[4] = {
          {X[0][4],X[0][5],X[0][6],X[0][7], -X[0][0],-X[0][1],-X[0][2],-X[0][3]},
          {X[1][4],X[1][5],X[1][6],X[1][7], -X[1][0],-X[1][1],-X[1][2],-X[1][3]},
          {X[2][4],X[2][5],X[2][6],X[2][7], -X[2][0],-X[2][1],-X[2][2],-X[2][3]},
          {X[3][4],X[3][5],X[3][6],X[3][7], -X[3][0],-X[3][1],-X[3][2],-X[3][3]}  
    };

    uint16x8_t arrayTest02[4];

    arrayTest02[0] = vextq_u16(X[0], vmulq_n_u16(X[0],-1),4);
    arrayTest02[1] = vextq_u16(X[1], vmulq_n_u16(X[1],-1), 4);
    arrayTest02[2] = vextq_u16(X[2], vmulq_n_u16(X[2],-1), 4);
    arrayTest02[3] = vextq_u16(X[3], vmulq_n_u16(X[3],-1), 4);

    // Rest of code which uses arrayTest01 and/or arrayTest02
}

The idea is that both arrayTest01 and arrayTest02 are lookup tables of data populated from an array of structures outside the scenario1 function. The vectors in both arrayTest01 and arrayTest02 are half positive and half negative modulo 65536.

If I use arrayTest01, then the asm code does some ANDs and negations, but I'm unsure if that will be the case after compiling with -O3 (its hard to debug with -O3 and hit that breakpoint). I'm not sure if, when initializing, each element is loaded individually from memory or not.
The operation vmulq_n_u16 multiplies by -1 (producing -X[i]), and vextq_u16 extracts the upper half of X[i] and the lower half of -X[i].
The operations vmulq_n_u16 and vextq_u16 should be executed in one cycle each, as they are SIMD, but still unsure if they are at the end faster or slower than just the initialization.

My concern is that both arrayTest01 and arrayTest02 will have thousands of entries and scenario1 will be called multiple times, so any execution time / cycles count I can save would help greatly.

Questions

Are the elements in the initialized array (arrayTest01) loaded from memory individually?
If so, are SIMD operations faster then?
In general, which would produce a faster execution time? The initialized array? or constructing the array using SIMD? (Again, the final arrays will have thousands of entries)

Thank you!

CodePudding user response：

First off, 02 is faster ONLY because:

Auto vectorization is VERY inefficient for 01 in this case
The compilers seem to be rather gentle to 02 in this case

However, both are sub optimal.

How about this?

uint16x8_t arrayTest03[4];
int64x2x4_t temp, rslt;

temp = vld4q_s64((int64_t *)X);
rslt.val[0] = temp.val[1];
rslt.val[1] = vreinterpretq_s64_s16(vnegq_s16(vreinterpretq_s16_s64(temp.val[0])));
rslt.val[2] = temp.val[3];
rslt.val[3] = vreinterpretq_s64_s16(vnegq_s16(vreinterpretq_s16_s64(temp.val[2])));

vst4q_s64((int64_t *)arrayTest03, rslt);

Six instructions total.
You should think out of the box. Don't get tied to data types and short codes in C.
Especially, you should be explicit on memory load and store. You never know what kind of mess compilers generate.

BTW, all arrays should be aligned to 64bytes for maximum performance.