I need to copy large amounts of memory (on the order of 47k) (example, from a USB buffer to a more permanent buffer).
This is using an ARM Cortex A8.
(The ARM has the NEON code.)
The ARM NEON instruction can copy 4 32-bit elements at a time (per instruction).
The ARM LDM and STM instructions can load and store (copy) more than 4 registers at a time (per instruction).
Questions:
Which is more efficient for copying large amounts (e.g. 47k) of memory, the ARM NEON instruction or the ARM LDM and STM instructions? (I don't have benchmarking tools available; this is on an embedded system).
What is the advantage of the ARM NEON instructions for copying memory?
The project is primarily C language, but also has some assembly language. Is there a method to suggest to the compiler to use ARM NEON or the LDM/STM instructions, without optimizations? (We are launching code without optimizations so there are no differences when the product is returned. There is a possibility that optimization can be responsible for issues in the product.)
Tools:
ARM Cortex A8 processor
IAR Electronic Workbench IDE & Compiler.
Development on Windows 10 PC, to remote embedded ARM processor (via JTAG).
CodePudding user response:
Neon has the advantage of unaligned load and store, but it consumes more power.
And since you are copying form the USB buffer to a permant one where you have full control over alignment and size, it would be better without neon, because memory speed is the same.
The standard memcpy
most probably already utilizes neon (it depends on the BSP), hence I'd write a mini version utilizing ldrd
and strd
which is slightly faster than ldm
and stm
.
.balign 64
push {r4-r11}
sub r1, r1, #8
sub r0, r0, #8
b 1f
.balign 64
1:
ldrd r4, r5, [r1, #8]
ldrd r6, r7, [r1, #16]
ldrd r8, r9, [r1, #24]
ldrd r10, r11, [r1, #32]!
subs r2, r2, #32
strd r4, r5, [r0, #8]
strd r6, r7, [r0, #16]
strd r8, r9, [r0, #24]
strd r10, r11, [r0, #32]!
bgt 1b
.balign 16
pop {r4-r11}
bx lr
I think you have no problem making the buffer size a multiple of 32, and both buffers aligned to 64 bytes(cache line length) or even better, 4096 bytes (page size).