Home > other >  ARM Neon intrinsics, addition of two vectors
ARM Neon intrinsics, addition of two vectors

Time:11-02

I have a C function that is very simple, pairwise adds two arrays of ints:

void add_arrays(int* a, int* b, int* target, int size) {
    for(int i=0; i<size; i  ) {
        target[i] = a[i]   b[i];
    }
}

I see that on ARM the Neon intrinsics are available in <arm_neon.h>, and you are supposed to be able to add, multiply, etc vectors, but all the examples I saw are super convoluted. Can somebody show how to perform something simple like pairwise addition using ARM Neon intrinsics?

UPDATE My terminology is wrong, I am looking to implement element-wise addition.

Thanks!

CodePudding user response:

First off, as Jake mentioned, what this code does is not pairwise addition. Pairwise addition would be adding adjacent pairs; something like

void add_arrays(int* a, int* target, int size) {
    for(int i=0; i<size; i  ) {
        target[i] = a[i * 2]   a[(i * 2)   1];
    }
}

That can be done using NEON, but I'm going to assume your code is right but your terminology is wrong for the rest of this answer. If that's reversed, this answer plus looking at the docs for vpaddq_s32 (or maybe vpaddl_s32) should get you most of the way there.

For simplicity, I'm going to assume that size is a multiple of 4 (since 4 32-bit elements = 1 128-bit vector), so:

void add_arrays(int* a, int* b, int* target, int size) {
    for(int i=0; i<size; i =4) {
        target[  i  ] = a[  i  ]   b[  i  ];
        target[i   1] = a[i   1]   b[i   1];
        target[i   2] = a[i   2]   b[i   2];
        target[i   3] = a[i   3]   b[i   3];
    }
}

Now let's add some NEON intrinsics:

#include <arm_neon.h>

void add_arrays(int* a, int* b, int* target, int size) {
    for(int i=0; i<size; i =4) {
        /* Load data into NEON register */
        int32x4_t av = vld1q_s32(&(a[i]));
        int32x4_t bv = vld1q_s32(&(b[i]));

        /* Perform the addition */
        int32x4_t targetv = vaddq_s32(av, bv);

        /* Store the result */
        vst1q_s32(&(target[i]), targetv);
    }
}

That's it. You can see the difference in generated code at https://godbolt.org/z/W6KPv186x.

  • Related