Home > Enterprise >  Best process to optimize c code for multi architecture
Best process to optimize c code for multi architecture

Time:11-05

I am currently optimizing a piece of C code with a lot of loops and adding/multiplying two-dimensional float vectors. The code is so slow that I cannot process my data in real time on ARM Cortex-M or even ARM Cortex-A in low CPU mode. I am close to being quick enough on Cortex-A. But on cortex-M... I will need to run this code in a lot of different architectures environments.

This is the first time I need to optimize deeply an algorithm to be real-time. I found a lot of papers/articles about loop optimization and vectorization to help me in this task. I am also exploring multi-architecture solution as library OpenBlas.

The problem is that my two ARM environments are quite painful. Iterating, rebuilding, deploying the code and measuring the performance is a quite slow process.

Any advice to help me to accelerate the process?

  • Do I must target cross-target optimization first? Specific target optimization?
  • Is it a good idea to iterate on my x86 host and test my optimization on my target later? I am afraid that the best optimization only works for a specific architecture.
  • Can I use perhaps an emulator like QEMU to iterate more quickly? Does it make sense?
  • Is it the best method to analyze the assembler code without running it to check the result of optimization and improvement in the performance? I try to run some minor modifications and compare the result of GCC -S. The output is changing a lot.

CodePudding user response:

Since this is about processing of float vectors it is probably worth checking if you can re-write the algorithms with BLAS or even LAPACK primitives.

This will not only remove the loops but also enable you to use highly optmized BLAS libraries available for many CPU architectures.

For ARM there is the Arm Performance Libraries which includes BLAS routines (among other math routines).

So to answer your question: It is probably best to use functions from a standard interface and deploy optimized implementations of these functions .

CodePudding user response:

QEMU is not a cycle-accurate simulator at all. It doesn't even try to model performance, just to emulate as quickly as it can.

Probably your best be it to read some about the major bottlenecks on the slowest targets you care about. Like Cortex-M, especially if you're using one without cache or SIMD, although that's going to be a problem if you need to do some FP heavy lifting. If your workload needs more FP throughput than the theoretical max on the target chip, you need algorithmic changes before it's worth benchmarking anything. Or you need to choose a more capable microcontroller; IIRC some of the higher-end Cortex-M have NEON and code/data caches.

  • Related