i need to do a simple multiply accumulate of two signed 8 bit arrays.
This routine runs every millisecond on an ARM7 embedded device. I am trying to speed it up a bit. I have already tried optimizing and enabling vector ops.
-mtune=cortex-a15.cortex-a7 -mfpu=neon-vfpv4 -ftree-vectorize -ffast-math -mfloat-abi=hard
this helped but I am still running close to the edge.
this is the 'c' code.
for(i = 4095; i >= 0; --i)
{
accum = arr1[i]*arr2[i];
}
I am trying to use NEON intrinsics. This loop runs ~5 times faster, but I get different results. I am pretty sure I am not properly retrieving the accumulation, or it rolls over before I do. Any help/pointers is greatly appreciated. Any detailed docs would also be helpful.
for(int i = 256; i > 0; --i)
{
int8x16_t vec16a = vld1q_s8(&arr1[index]);
int8x16_t vec16b = vld1q_s8(&arr2[index]);
vec16res = vmlaq_s8(vec16res, vec16a, vec16b);
index =16;
}
EDIT to post solution.
Thanks to tips from all! I dropped to to 8x8 and have a fast solution
using the below code I achieved a "fast enough" time. Not as fast as the 128bit version but good enough.
I added __builtin_prefetch() for the data, and did a 10 pass avg. Neon is substantially faster.
$ ./test 10
original code time ~ 30392nS
optimized C time ~ 8458nS
NEON elapsed time ~ 3199nS
int32_t sum = 0;
int16x8_t vecSum = vdupq_n_s16(0);
int8x8_t vec8a;
int8x8_t vec8b;
int32x4_t sum32x4;
int32x2_t sum32x2;
#pragma unroll
for (i = 512; i > 0; --i)
{
vec8a = vld1_s8(&A[index]);
vec8b = vld1_s8(&B[index]);
vecSum = vmlal_s8(vecSum,vec8a,vec8b);
index = 8;
}
sum32x4 = vaddl_s16(vget_high_s16(vecSum),vget_low_s16(vecSum));
sum32x2 = vadd_s32(vget_high_s32(sum32x4),vget_low_s32(sum32x4));
sum = vget_lane_s32(vpadd_s32(sum32x2,sum32x2),0);
CodePudding user response:
Your issue is likely overflow, so you'll need to lengthen when you do your multiply-accumulate. As you're on ARMv7, you'll want vmlal_s8. ARMv8 A64 has vmlal_high_s8 which allows you to stay in 128-bit vectors, which will give an added speed-up.
As mentioned in comments, seeing what auto-vectorization will do with -O options / pragma unroll is very valuable, and learning from the godbolt of that. Unrolling often gives speed-ups when doing by hand as well.
Lots more valuable tips on optimization in the Arm Neon resources.