I need to use Neon Intrinsics (in aarch64
) using slightly modified data from an array passed in a parameter of a function:
void scenario1(uint16x8_t* X) {
uint16x8_t arrayTest01[4] = {
{X[0][4],X[0][5],X[0][6],X[0][7], -X[0][0],-X[0][1],-X[0][2],-X[0][3]},
{X[1][4],X[1][5],X[1][6],X[1][7], -X[1][0],-X[1][1],-X[1][2],-X[1][3]},
{X[2][4],X[2][5],X[2][6],X[2][7], -X[2][0],-X[2][1],-X[2][2],-X[2][3]},
{X[3][4],X[3][5],X[3][6],X[3][7], -X[3][0],-X[3][1],-X[3][2],-X[3][3]}
};
uint16x8_t arrayTest02[4];
arrayTest02[0] = vextq_u16(X[0], vmulq_n_u16(X[0],-1),4);
arrayTest02[1] = vextq_u16(X[1], vmulq_n_u16(X[1],-1), 4);
arrayTest02[2] = vextq_u16(X[2], vmulq_n_u16(X[2],-1), 4);
arrayTest02[3] = vextq_u16(X[3], vmulq_n_u16(X[3],-1), 4);
// Rest of code which uses arrayTest01 and/or arrayTest02
}
The idea is that both arrayTest01
and arrayTest02
are lookup tables of data populated from an array of structures outside the scenario1
function. The vectors in both arrayTest01
and arrayTest02
are half positive and half negative modulo 65536.
- If I use
arrayTest01
, then the asm code does some ANDs and negations, but I'm unsure if that will be the case after compiling with -O3 (its hard to debug with -O3 and hit that breakpoint). I'm not sure if, when initializing, each element is loaded individually from memory or not. - The operation
vmulq_n_u16
multiplies by -1 (producing-X[i]
), andvextq_u16
extracts the upper half ofX[i]
and the lower half of-X[i]
. - The operations
vmulq_n_u16
andvextq_u16
should be executed in one cycle each, as they are SIMD, but still unsure if they are at the end faster or slower than just the initialization.
My concern is that both arrayTest01
and arrayTest02
will have thousands of entries and scenario1
will be called multiple times, so any execution time / cycles count I can save would help greatly.
Questions
Are the elements in the initialized array (
arrayTest01
) loaded from memory individually?If so, are SIMD operations faster then?
In general, which would produce a faster execution time? The initialized array? or constructing the array using SIMD? (Again, the final arrays will have thousands of entries)
Thank you!
CodePudding user response:
First off, 02 is faster ONLY because:
- Auto vectorization is VERY inefficient for 01 in this case
- The compilers seem to be rather gentle to 02 in this case
However, both are sub optimal.
How about this?
uint16x8_t arrayTest03[4];
int64x2x4_t temp, rslt;
temp = vld4q_s64((int64_t *)X);
rslt.val[0] = temp.val[1];
rslt.val[1] = vreinterpretq_s64_s16(vnegq_s16(vreinterpretq_s16_s64(temp.val[0])));
rslt.val[2] = temp.val[3];
rslt.val[3] = vreinterpretq_s64_s16(vnegq_s16(vreinterpretq_s16_s64(temp.val[2])));
vst4q_s64((int64_t *)arrayTest03, rslt);
Six instructions total.
You should think out of the box. Don't get tied to data types and short codes in C.
Especially, you should be explicit on memory load and store. You never know what kind of mess compilers generate.
BTW, all arrays should be aligned to 64bytes for maximum performance.