Recently, I had a debate with a colleague about comparing python vs C in terms of performance. Both of us were using these languages for linear algebra mostly. So I wrote two scripts, one in python3, using numpy, and the other one in C using Eigen.
Python3 numpy version matmul_numpy.py:
import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
start=time.time()
c=a*b
end=time.time()
print(end-start)
If I run this script with
python3 matmul_numpy.py
This will return:
0.07 seconds
The C eigen version matmul_eigen.cpp:
#include <iostream>
#include <Eigen/Dense>
#include "time.h"
int main(){
clock_t start,end;
size_t n=2000;
Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
start=clock();
Eigen::MatrixXd c=a*b;
end=clock();
std::cout<<(double)(end-start)/CLOCKS_PER_SEC<<std::endl;
return 0;}
The way I compile it is,
g matmul_eigen.cpp -I/usr/include/eigen3 -O3 -march=native -std=c 17 -o matmul_eigen
this will return (both c 11 and c 17):
0.35 seconds
This is very odd to me, 1-Why numpy here is so faster than C ? Am I missing any other flags for optimization?
I thought maybe it is because of the python interpreter that it is executing the program faster here. So I compile the code with cython using this thread in stacks.
The compiled python script was still faster (0.11 seconds). This again add two more questions for me:
2- why it got longer? does the interpreter do anymore optimization?
3- why the binary file of the python script (37 kb) is smaller than the c (57 kb) one ?
I would appreciate any help,
thanks
CodePudding user response:
The biggest issue is that you are comparing two completely different things:
- In Numpy,
a*b
perform an element-wise multiplication sincea
andb
are 2D array and not considered as matrices.a@b
performs a matrix multiplication. - In Eigen,
a*b
performs a matrix multiplication, not an element-wise one (see the documentation). This is becausea
andb
are matrices, not just 2D arrays.
The two gives completely different results. Moreover, a matrix multiplication runs in O(n**3)
time while an element-wise multiplication runs in O(n**2)
time. Matrix multiplication kernels are generally highly-optimized and compute-bound. They are often parallelized by most BLAS library. Element-wise multiplications are memory-bound (especially here due to page-faults). As a result, this is not surprising the matrix multiplication is slower than an element-wise multiplication and the gap is not huge either due to the later being memory-bound.
On my i5-9600KF processor (with 6 cores), Numpy takes 9 ms to do a*b
(in sequential) and 65 ms to do a@b
(in parallel, using OpenBLAS).
Note Numpy element-wise multiplications like this are not parallel (at least, not in the standard default implementation of Numpy). The matrix multiplication of Numpy use a BLAS library which is generally OpenBLAS by default (this is dependent of the actual target platform). Eigen should also use a BLAS library, but it might not be the same than the one of Numpy.
Also note note that clock
is not a good way to measure parallel codes as it does not measure the wall clock time but the CPU time (see this post for more information). std::chrono::steady_clock
is generally a better alternative in C .
3- why the binary file of the python script (37 kb) is smaller than the c (57 kb) one ?
Python is generally compiled to bytecode which is not a native assembly code. C is generally compiled to executable programs that contains assembled binary code, additional information used to run the program as well as meta-informations. Bytecodes are generally pretty compact because they are higher-level. Native compilers can perform optimizations making programs bigger such as loop unrolling and inlining for example. Such optimizations are not done by CPython (the default Python interpreter). In fact, CPython performs no (advanced) optimizations on the bytecode. Note that you can tell to native compilers like GCC to generate a smaller code (though generally slower) using flags like -Os
and -s
.
CodePudding user response:
So based on what I learned from the @Jérôme Richard response and the comments @user17732522. It seems that I made two mistakes in the comparison,
1- I made a mistake defining multiplication in the python script, it should be np.matmul(a,b) or np.dot(a,b) or a@b. not a*b which is a elementwise multiplication. 2- I didn't measure the time in C code correctly. clock_t doesn't work right for this calculation, std::chrono::steady_clock works better.
With applying these comments, the c eigen code is 10 times faster than the python's.
The updated code for matmul_eigen.cpp:
#include <iostream>
#include <Eigen/Dense>
#include <chrono>
int main(){
size_t n=2000;
Eigen::MatrixXd a=Eigen::MatrixXd::Random(n,n);
Eigen::MatrixXd b=Eigen::MatrixXd::Random(n,n);
auto t1=std::chrono::steady_clock::now();
Eigen::MatrixXd c=a*b;
auto t2=std::chrono::steady_clock::now();
std::cout<<(double)std::chrono::duration_cast<std::chrono::microseconds>(t2-t1).count()/1000000.0f<<std::endl;
return 0;}
To compile, both the vectorization and multi-thread flags should be considered.
g matmul_eigen.cpp -I/usr/include/eigen3 -O3 -std=c 17 -march=native -fopenmp -o eigen_matmul
To use the multiple threads for running the code:
OMP_NUM_THREADS=4 ./eigen_matmul
where "4" is the number of CPU(s) that openmp can use, you can see how many you have with:
lscpu | grep "CPU(s):"
This will return 0.104 seconds.
The updated python script matmul_numpy.py:
import numpy as np
import time
a=np.random.rand(2000,2000)
b=np.random.rand(2000,2000)
a=np.array(a, dtype=np.float64)
b=np.array(b, dtype=np.float64)
start=time.time()
c=np.dot(a,b)
end=time.time()
print(end-start)
To run the code,
python3 matmul_numpy.py
This will return 1.0531 seconds.
About the reason that it is like this, I think @Jérôme Richard response is a better reference.