numpy is faster and more efficient than Eigen C ?-CodePudding

Recently, I had a debate with a colleague about comparing python vs C in terms of performance. Both of us were using these languages for linear algebra mostly. So I wrote two scripts, one in python3, using numpy, and the other one in C using Eigen.

Python3 numpy version matmul_numpy.py:

import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
start=time.time()
c=a*b
end=time.time()
print(end-start)

If I run this script with

python3 matmul_numpy.py

This will return:

0.07 seconds

The C eigen version matmul_eigen.cpp:


#include <iostream>
#include <Eigen/Dense>
#include "time.h"
int main(){
        clock_t start,end;
        size_t n=2000;
        Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
        Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
        start=clock();
        Eigen::MatrixXd c=a*b;
        end=clock();
        std::cout<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
        return 0;}

The way I compile it is,

g   matmul_eigen.cpp -I/usr/include/eigen3 -O3 -march=native -std=c  17 -o matmul_eigen

this will return (both c 11 and c 17):

0.35 seconds

This is very odd to me, 1-Why numpy here is so faster than C ? Am I missing any other flags for optimization?

I thought maybe it is because of the python interpreter that it is executing the program faster here. So I compile the code with cython using this thread in stacks.

The compiled python script was still faster (0.11 seconds). This again add two more questions for me:

2- why it got longer? does the interpreter do anymore optimization?

3- why the binary file of the python script (37 kb) is smaller than the c (57 kb) one ?

I would appreciate any help,

thanks

CodePudding user response：

The biggest issue is that you are comparing two completely different things:

In Numpy, a*b perform an element-wise multiplication since a and b are 2D array and not considered as matrices. a@b performs a matrix multiplication.
In Eigen, a*b performs a matrix multiplication, not an element-wise one (see the documentation). This is because a and b are matrices, not just 2D arrays.

The two gives completely different results. Moreover, a matrix multiplication runs in O(n**3) time while an element-wise multiplication runs in O(n**2) time. Matrix multiplication kernels are generally highly-optimized and compute-bound. They are often parallelized by most BLAS library. Element-wise multiplications are memory-bound (especially here due to page-faults). As a result, this is not surprising the matrix multiplication is slower than an element-wise multiplication and the gap is not huge either due to the later being memory-bound.

On my i5-9600KF processor (with 6 cores), Numpy takes 9 ms to do a*b (in sequential) and 65 ms to do a@b (in parallel, using OpenBLAS).

Note Numpy element-wise multiplications like this are not parallel (at least, not in the standard default implementation of Numpy). The matrix multiplication of Numpy use a BLAS library which is generally OpenBLAS by default (this is dependent of the actual target platform). Eigen should also use a BLAS library, but it might not be the same than the one of Numpy.

Also note note that clock is not a good way to measure parallel codes as it does not measure the wall clock time but the CPU time (see this post for more information). std::chrono::steady_clock is generally a better alternative in C .

3- why the binary file of the python script (37 kb) is smaller than the c (57 kb) one ?

Python is generally compiled to bytecode which is not a native assembly code. C is generally compiled to executable programs that contains assembled binary code, additional information used to run the program as well as meta-informations. Bytecodes are generally pretty compact because they are higher-level. Native compilers can perform optimizations making programs bigger such as loop unrolling and inlining for example. Such optimizations are not done by CPython (the default Python interpreter). In fact, CPython performs no (advanced) optimizations on the bytecode. Note that you can tell to native compilers like GCC to generate a smaller code (though generally slower) using flags like -Os and -s.

CodePudding user response：

So based on what I learned from the @Jérôme Richard response and the comments @user17732522. It seems that I made two mistakes in the comparison,

1- I made a mistake defining multiplication in the python script, it should be np.matmul(a,b) or np.dot(a,b) or a@b. not a*b which is a elementwise multiplication. 2- I didn't measure the time in C code correctly. clock_t doesn't work right for this calculation, std::chrono::steady_clock works better.

With applying these comments, the c eigen code is 10 times faster than the python's.

The updated code for matmul_eigen.cpp:

#include <iostream>
#include <Eigen/Dense>
#include <chrono>
int main(){

    size_t n=2000;
    Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
    Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
    auto t1=std::chrono::steady_clock::now();
    Eigen::MatrixXd c=a*b;
    auto t2=std::chrono::steady_clock::now();
    std::cout<<(double)std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()/1000000.0f<<std::endl;

    return 0;}

To compile, both the vectorization and multi-thread flags should be considered.

g   matmul_eigen.cpp -I/usr/include/eigen3 -O3 -std=c  17 -march=native -fopenmp -o eigen_matmul

To use the multiple threads for running the code:

OMP_NUM_THREADS=4 ./eigen_matmul

where "4" is the number of CPU(s) that openmp can use, you can see how many you have with:

lscpu | grep "CPU(s):"

This will return 0.104 seconds.

The updated python script matmul_numpy.py:

import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)

a=np.array(a, dtype=np.float64)
b=np.array(b, dtype=np.float64)

start=time.time()
c=np.dot(a,b)
end=time.time()
print(end-start)

To run the code,

python3 matmul_numpy.py

This will return 1.0531 seconds.

About the reason that it is like this, I think @Jérôme Richard response is a better reference.