NEON assembly code requires more cycles on Cortex-A72 vs Cortex-A53-CodePudding

I am benchmarking an ARMv7 NEON code on two ARMv8 processors in AArch32 mode: the Cortex-A53 and Cortex-A72. I am using the Raspberry Pi 3B and Raspberry Pi 4B boards with 32-bit Raspbian Buster.

My benchmarking method is as follows:

uint32_t x[4];
uint32_t t0 = ccnt_read();
for(int i = 0; i < 1000; i  )
    armv7_neon(x);
uint32_t t1 = ccnt_read();
printf("%u\n",(t1-t0)/1000);

where the armv7_neon function is defined by the following instructions:

.global armv7_neon
.func armv7_neon, armv7_neon
.type armv7_neon, %function
armv7_neon:
    vld1.32 {q0}, [r0]
    vmvn.i32 q0, q0
    vmov.i32 q8, #0x11111111
    vshr.u32 q1, q0, #2
    vshr.u32 q2, q0, #3
    vmov.i32 q9, #0x20202020
    vand q1, q1, q2
    vmov.i32 q10, #0x40404040
    vand q1, q1, q8
    vmov.i32 q11, #0x80808080
    veor q0, q0, q1
    vmov.i32 q12, #0x02020202
    vshl.u32 q1, q0, #5
    vshl.u32 q2, q0, #1
    vmov.i32 q13, #0x04040404
    vand q1, q1, q2
    vmov.i32 q14, #0x08080808
    vand q3, q1, q9
    vshl.u32 q1, q0, #5
    vshl.u32 q2, q0, #4
    veor q0, q0, q3
    vand q1, q1, q2
    vmov.i32 q15, #0x32323232
    vand q1, q1, q10
    vmov.i32 q8, #0x01010101
    veor q0, q0, q1
    vshl.u32 q1, q0, #2
    vshl.u32 q2, q0, #1
    vand q1, q1, q2
    vand q3, q1, q11
    vshr.u32 q1, q0, #2
    vshl.u32 q2, q0, #1
    veor q0, q0, q3
    vand q1, q1, q2
    vand q1, q1, q12
    veor q0, q0, q1
    vshr.u32 q1, q0, #5
    vshl.u32 q2, q0, #1
    vand q1, q1, q2
    vand q3, q1, q13
    vshr.u32 q1, q0, #1
    vshr.u32 q2, q0, #2
    veor q0, q0, q3
    vand q1, q1, q2
    vand q1, q1, q14
    veor q0, q0, q1
    vmvn.i32 q0, q0
    vand q1,  q0, q14
    vand q2,  q0, q15
    vand q3,  q0, q8
    vand q8,  q0, q11
    vand q9,  q0, q10
    vand q10, q0, q13
    vshl.u32 q1,  q1,  #1
    vshl.u32 q2,  q2,  #2
    vshl.u32 q3,  q3,  #5
    vshr.u32 q8,  q8,  #6
    vshr.u32 q9,  q9,  #4
    vshr.u32 q10, q10, #2
    vorr q0, q1, q2
    vorr q1, q3, q8
    vorr q2, q9, q10
    vorr q3, q0, q1
    vorr q0, q3, q2
    vst1.32 {q0}, [r0]
    bx lr
.endfunc

The code is simply compiled with the following options:

gcc -O3 -mfpu=neon-fp-armv8 -mcpu=cortex-a53
gcc -O3 -mfpu=neon-fp-armv8 -mcpu=cortex-a72

I get 74 and 99 cycles on the Cortex-A53 and Cortex-A72, respectively. I've come across this blogpost discussing some performance issues on the Cortex-A72 for tbl instructions, but the code I'm running does not contain any.

Where could this gap come from?

CodePudding user response：

I compared the instruction cycle timing of A72 and A55 (nothing available on A53):

vshl and vshr:

A72: throughput(IPC) 1, latency 3, executes on F1 pipeline only
A55: throughput(IPC) 2, latency 2, executes on both pipelines (restricted though)

That pretty much nails it since there are many of them in your code.

There are some drawbacks in your assembly code, too:

vadd has less restrictions and better throughput/latency than vshl. You should replace all vshl by immediate 1 with vadd. Barrel shifters are more costly than arithmetic on SIMD.
You should not repeat the same instructions unnecesarily (<<5)
The second vmvn is unnecessary. You can replace all the following vand with vbic instead.
Compilers generate acceptable machine codes as long as no permutations are involved. Hence I'd write the code in neon intrinsics in this case.

#include <arm_neon.h>

void armv7_neon(uint32_t * pData) {
    const uint32x4_t cx11 = vdupq_n_u32(0x11111111);
    const uint32x4_t cx20 = vdupq_n_u32(0x20202020);
    const uint32x4_t cx40 = vdupq_n_u32(0x40404040);
    const uint32x4_t cx80 = vdupq_n_u32(0x80808080);
    const uint32x4_t cx02 = vdupq_n_u32(0x02020202);
    const uint32x4_t cx04 = vdupq_n_u32(0x04040404);
    const uint32x4_t cx08 = vdupq_n_u32(0x08080808);
    const uint32x4_t cx32 = vdupq_n_u32(0x32323232);
    const uint32x4_t cx01 = vdupq_n_u32(0x01010101);

    uint32x4_t temp1, temp2, temp3, temp4, temp5, temp6;
    uint32x4_t in = vld1q_u32(pData);

    in = vmvnq_u32(in);

    temp1 = (in >> 2) & (in >> 3);
    temp1 &= cx11;
    in ^= temp1;

    temp1 = (in << 5) & (in   in);
    temp1 &= cx20;
    temp2 = (in << 5) & (in << 4);
    temp2 &= cx40;
    in ^= temp1;
    in ^= temp2;

    temp1 = (in << 2) & (in   in);
    temp1 &= cx80;
    temp2 = (in >> 2) & (in >> 1);
    temp2 &= cx02;
    in ^= temp1;
    in ^= temp2;

    temp1 = (in >> 5) & (in   in);
    temp1 &= cx04;
    temp2 = (in >> 1) & (in >> 2);
    temp2 &= cx08;
    in ^= temp1;
    in ^= temp2;

    temp1 = vbicq_u32(cx08, in);
    temp2 = vbicq_u32(cx32, in);
    temp3 = vbicq_u32(cx01, in);
    temp4 = vbicq_u32(cx80, in);
    temp5 = vbicq_u32(cx40, in);
    temp6 = vbicq_u32(cx04, in);

    temp1  = temp1;
    temp2 <<= 2;
    temp3 <<= 5;
    temp4 >>= 6;
    temp5 >>= 4;
    temp6 >>= 2;

    temp1 |= temp2 | temp3 | temp4 | temp5 | temp6;

    vst1q_u32(pData, temp1);
}

godbolt link

You can see that the -mcpu option makes a clear difference here.

But GCC never disappoints: It refuses to use vbic even though I explicitly ordered it to (the same for Clang. I HATE them both)

I'd take the disassembly, remove the second vmvn, and replace all the vand attached with vbic for best performance.

Keep in mind that writing in assembly doesn't automatically make the code run faster, and newer architectures don't necessarily come with more favorable ICT: A72 is largely inferior to A53 when it comes to ICT.

PS: With -mcpu=cortex-a53 option the generated code is identical to a55's. We can assume A55 is just an extension to A53 by armv8.2 ISA.