I am benchmarking an ARMv7 NEON code on two ARMv8 processors in AArch32 mode: the Cortex-A53 and Cortex-A72. I am using the Raspberry Pi 3B and Raspberry Pi 4B boards with 32-bit Raspbian Buster.
My benchmarking method is as follows:
uint32_t x[4];
uint32_t t0 = ccnt_read();
for(int i = 0; i < 1000; i )
armv7_neon(x);
uint32_t t1 = ccnt_read();
printf("%u\n",(t1-t0)/1000);
where the armv7_neon function is defined by the following instructions:
.global armv7_neon
.func armv7_neon, armv7_neon
.type armv7_neon, %function
armv7_neon:
vld1.32 {q0}, [r0]
vmvn.i32 q0, q0
vmov.i32 q8, #0x11111111
vshr.u32 q1, q0, #2
vshr.u32 q2, q0, #3
vmov.i32 q9, #0x20202020
vand q1, q1, q2
vmov.i32 q10, #0x40404040
vand q1, q1, q8
vmov.i32 q11, #0x80808080
veor q0, q0, q1
vmov.i32 q12, #0x02020202
vshl.u32 q1, q0, #5
vshl.u32 q2, q0, #1
vmov.i32 q13, #0x04040404
vand q1, q1, q2
vmov.i32 q14, #0x08080808
vand q3, q1, q9
vshl.u32 q1, q0, #5
vshl.u32 q2, q0, #4
veor q0, q0, q3
vand q1, q1, q2
vmov.i32 q15, #0x32323232
vand q1, q1, q10
vmov.i32 q8, #0x01010101
veor q0, q0, q1
vshl.u32 q1, q0, #2
vshl.u32 q2, q0, #1
vand q1, q1, q2
vand q3, q1, q11
vshr.u32 q1, q0, #2
vshl.u32 q2, q0, #1
veor q0, q0, q3
vand q1, q1, q2
vand q1, q1, q12
veor q0, q0, q1
vshr.u32 q1, q0, #5
vshl.u32 q2, q0, #1
vand q1, q1, q2
vand q3, q1, q13
vshr.u32 q1, q0, #1
vshr.u32 q2, q0, #2
veor q0, q0, q3
vand q1, q1, q2
vand q1, q1, q14
veor q0, q0, q1
vmvn.i32 q0, q0
vand q1, q0, q14
vand q2, q0, q15
vand q3, q0, q8
vand q8, q0, q11
vand q9, q0, q10
vand q10, q0, q13
vshl.u32 q1, q1, #1
vshl.u32 q2, q2, #2
vshl.u32 q3, q3, #5
vshr.u32 q8, q8, #6
vshr.u32 q9, q9, #4
vshr.u32 q10, q10, #2
vorr q0, q1, q2
vorr q1, q3, q8
vorr q2, q9, q10
vorr q3, q0, q1
vorr q0, q3, q2
vst1.32 {q0}, [r0]
bx lr
.endfunc
The code is simply compiled with the following options:
gcc -O3 -mfpu=neon-fp-armv8 -mcpu=cortex-a53
gcc -O3 -mfpu=neon-fp-armv8 -mcpu=cortex-a72
I get 74 and 99 cycles on the Cortex-A53 and Cortex-A72, respectively. I've come across this blogpost discussing some performance issues on the Cortex-A72 for tbl instructions, but the code I'm running does not contain any.
Where could this gap come from?
CodePudding user response:
I compared the instruction cycle timing of A72 and A55 (nothing available on A53):
vshl
and vshr
:
A72:
throughput(IPC) 1, latency 3, executes on F1 pipeline only
A55:
throughput(IPC) 2, latency 2, executes on both pipelines (restricted though)
That pretty much nails it since there are many of them in your code.
There are some drawbacks in your assembly code, too:
vadd
has less restrictions and better throughput/latency thanvshl
. You should replace allvshl
by immediate 1 withvadd
. Barrel shifters are more costly than arithmetic on SIMD.- You should not repeat the same instructions unnecesarily (
<<5
) - The second
vmvn
is unnecessary. You can replace all the followingvand
withvbic
instead. - Compilers generate acceptable machine codes as long as no permutations are involved. Hence I'd write the code in neon intrinsics in this case.
#include <arm_neon.h>
void armv7_neon(uint32_t * pData) {
const uint32x4_t cx11 = vdupq_n_u32(0x11111111);
const uint32x4_t cx20 = vdupq_n_u32(0x20202020);
const uint32x4_t cx40 = vdupq_n_u32(0x40404040);
const uint32x4_t cx80 = vdupq_n_u32(0x80808080);
const uint32x4_t cx02 = vdupq_n_u32(0x02020202);
const uint32x4_t cx04 = vdupq_n_u32(0x04040404);
const uint32x4_t cx08 = vdupq_n_u32(0x08080808);
const uint32x4_t cx32 = vdupq_n_u32(0x32323232);
const uint32x4_t cx01 = vdupq_n_u32(0x01010101);
uint32x4_t temp1, temp2, temp3, temp4, temp5, temp6;
uint32x4_t in = vld1q_u32(pData);
in = vmvnq_u32(in);
temp1 = (in >> 2) & (in >> 3);
temp1 &= cx11;
in ^= temp1;
temp1 = (in << 5) & (in in);
temp1 &= cx20;
temp2 = (in << 5) & (in << 4);
temp2 &= cx40;
in ^= temp1;
in ^= temp2;
temp1 = (in << 2) & (in in);
temp1 &= cx80;
temp2 = (in >> 2) & (in >> 1);
temp2 &= cx02;
in ^= temp1;
in ^= temp2;
temp1 = (in >> 5) & (in in);
temp1 &= cx04;
temp2 = (in >> 1) & (in >> 2);
temp2 &= cx08;
in ^= temp1;
in ^= temp2;
temp1 = vbicq_u32(cx08, in);
temp2 = vbicq_u32(cx32, in);
temp3 = vbicq_u32(cx01, in);
temp4 = vbicq_u32(cx80, in);
temp5 = vbicq_u32(cx40, in);
temp6 = vbicq_u32(cx04, in);
temp1 = temp1;
temp2 <<= 2;
temp3 <<= 5;
temp4 >>= 6;
temp5 >>= 4;
temp6 >>= 2;
temp1 |= temp2 | temp3 | temp4 | temp5 | temp6;
vst1q_u32(pData, temp1);
}
You can see that the -mcpu
option makes a clear difference here.
But GCC never disappoints: It refuses to use vbic
even though I explicitly ordered it to (the same for Clang. I HATE them both)
I'd take the disassembly, remove the second vmvn
, and replace all the vand
attached with vbic
for best performance.
Keep in mind that writing in assembly doesn't automatically make the code run faster, and newer architectures don't necessarily come with more favorable ICT: A72 is largely inferior to A53 when it comes to ICT.
PS: With -mcpu=cortex-a53
option the generated code is identical to a55's. We can assume A55 is just an extension to A53 by armv8.2
ISA.