Efficient/parallel matrix routines on Apple silicon-CodePudding

I'm working in Physics, and as part of my job I have to do a lot of numerical calculations ("numerics"), 90 % of which involves the diagonalization of large matrices. At the moment, I use NumPy/SciPy in Python; specifically numpy.linalg.eigh, or scipy.sparse.linalg.eigsh if the matrices are sparse and I don't need all of the eigenvalues/eigenvectors. I use PyCharm and Anaconda, and have never thought about how these routines are implemented and if they are efficient.

However, I've just got a new MacBook with an M1 Pro chip, and I thought this would be a good time to make sure what I'm doing is actually optimised! Unfortunately, I know very little about what goes on "behind the scenes" of these calculations.

Therefore, my question: how do I install Python and NumPy/SciPy, making sure that the matrix routines that I need are optimised to take full advantage of my computer? Specifically, that they are running natively on Apple silicon and are as parallel as possible?

A small boost to efficiency in diagonalizing matrices would have a big cumulative impact on my work!

CodePudding user response：

The M1 processor is a big-little processor and such architectures are quite poorly supported by most BLAS implementations yet. This is even more true for LAPACK implementations. In fact, it is hard to support big-little processors efficiently (it makes the BLAS/LAPACK codes even more complex while they are already too complex) and no maintream HPC processors use such architecture. Because of that you could get an unstable execution time with some methods and sometime pretty bad execution time. That being said the M1 memory has a pretty high throughput that can make some LAPACK methods much faster.

As for the performance, you can check the FLOP efficiency by computing a BLAS gemm on a matrix of 1024x1024, get the execution time T and the FLOPS your processor K can archive, and finally check the efficiency E = ((2*1024**3)/T)/K. If E is >0.75. Then your BLAS implementation is quite good. For E in the range 0.5-0.75 it is not great. For E < 0.25 there is a problem and you should try another BLAS implementation

Note that the leading BLAS/LAPACK implementation is currently the MKL which is implemented by Intel and AFAIK not supported on M1/ARM processor (and AFAIK there is no plan to support it). The default BLAS backend for Numpy is generally OpenBLAS which have no specific implementation for the M1 and results in a pretty bad execution yet. The Accelerate/vecLib (AFAIK made/supported by Apple) significantly outperform OpenBLAS on M1 processors so using it is certainly a good choice (AFAIK only for the M1).

There is a related open issue for OpenBLAS so to check the performance of your BLAS implementation actually used.

Put it shortly, test the Accelerate/vecLib and check the result are similar to the benchmark provided in the OpenBLAS issue. You can also test other BLAS implementation like BLIS known to provide good performance on a wide range of architecture. For LAPACK, it is a bit more complex since there are only few implementation that can actually run on the M1 (the inefficient NetLib implementation is generally the one use by default on most systems).