how to properly do multiply accumulate with NEON intrinsics-CodePudding

i need to do a simple multiply accumulate of two signed 8 bit arrays.

This routine runs every millisecond on an ARM7 embedded device. I am trying to speed it up a bit. I have already tried optimizing and enabling vector ops.

-mtune=cortex-a15.cortex-a7 -mfpu=neon-vfpv4 -ftree-vectorize -ffast-math -mfloat-abi=hard

this helped but I am still running close to the edge.

this is the 'c' code.

for(i = 4095; i >= 0; --i)
{ 
  accum  = arr1[i]*arr2[i];
}

I am trying to use NEON intrinsics. This loop runs ~5 times faster, but I get different results. I am pretty sure I am not properly retrieving the accumulation, or it rolls over before I do. Any help/pointers is greatly appreciated. Any detailed docs would also be helpful.

 for(int i = 256; i > 0; --i)
 {   
   int8x16_t vec16a = vld1q_s8(&arr1[index]);                                                                                                                  
   int8x16_t vec16b = vld1q_s8(&arr2[index]);
   vec16res = vmlaq_s8(vec16res, vec16a, vec16b);
   index =16;
 }

EDIT to post solution.

Thanks to tips from all! I dropped to to 8x8 and have a fast solution

using the below code I achieved a "fast enough" time. Not as fast as the 128bit version but good enough.

I added __builtin_prefetch() for the data, and did a 10 pass avg. Neon is substantially faster.

$ ./test 10
original code time ~ 30392nS
optimized C time ~ 8458nS
NEON elapsed time ~ 3199nS

  int32_t   sum    = 0;                                                                                                                                            
  int16x8_t vecSum = vdupq_n_s16(0);
  int8x8_t  vec8a;
  int8x8_t  vec8b;
  int32x4_t sum32x4;
  int32x2_t sum32x2;

#pragma unroll
  for (i = 512; i > 0; --i)
  {
    vec8a  = vld1_s8(&A[index]);
    vec8b  = vld1_s8(&B[index]);
    vecSum = vmlal_s8(vecSum,vec8a,vec8b);
    index  = 8;
  }

  sum32x4 = vaddl_s16(vget_high_s16(vecSum),vget_low_s16(vecSum));
  sum32x2 = vadd_s32(vget_high_s32(sum32x4),vget_low_s32(sum32x4));
  sum     = vget_lane_s32(vpadd_s32(sum32x2,sum32x2),0);

CodePudding user response：

Your issue is likely overflow, so you'll need to lengthen when you do your multiply-accumulate. As you're on ARMv7, you'll want vmlal_s8. ARMv8 A64 has vmlal_high_s8 which allows you to stay in 128-bit vectors, which will give an added speed-up.

As mentioned in comments, seeing what auto-vectorization will do with -O options / pragma unroll is very valuable, and learning from the godbolt of that. Unrolling often gives speed-ups when doing by hand as well.

Lots more valuable tips on optimization in the Arm Neon resources.