I'm trying to understand the performance differences I am seeing by using various numba
implementations of an algorithm. In particular, I would expect func1d
from below to be the fastest implementation since it it the only algorithm that is not copying data, however from my timings func1b
appears to be fastest.
import numpy
import numba
def func1a(data, a, b, c):
# pure numpy
return a * (1 numpy.tanh((data / b) - c))
@numba.njit(fastmath=True)
def func1b(data, a, b, c):
new_data = a * (1 numpy.tanh((data / b) - c))
return new_data
@numba.njit(fastmath=True)
def func1c(data, a, b, c):
new_data = numpy.empty(data.shape)
for i in range(new_data.shape[0]):
for j in range(new_data.shape[1]):
new_data[i, j] = a * (1 numpy.tanh((data[i, j] / b) - c))
return new_data
@numba.njit(fastmath=True)
def func1d(data, a, b, c):
for i in range(data.shape[0]):
for j in range(data.shape[1]):
data[i, j] = a * (1 numpy.tanh((data[i, j] / b) - c))
return data
Helper functions for testing memory copying
def get_data_base(arr):
"""For a given NumPy array, find the base array
that owns the actual data.
https://ipython-books.github.io/45-understanding-the-internals-of-numpy-to-avoid-unnecessary-array-copying/
"""
base = arr
while isinstance(base.base, numpy.ndarray):
base = base.base
return base
def arrays_share_data(x, y):
return get_data_base(x) is get_data_base(y)
def test_share(func):
data = data = numpy.random.randn(100, 3)
print(arrays_share_data(data, func(data, 0.5, 2.5, 2.5)))
Timings
# force compiling
data = numpy.random.randn(10_000, 300)
_ = func1a(data, 0.5, 2.5, 2.5)
_ = func1b(data, 0.5, 2.5, 2.5)
_ = func1c(data, 0.5, 2.5, 2.5)
_ = func1d(data, 0.5, 2.5, 2.5)
data = numpy.random.randn(10_000, 300)
%timeit func1a(data, 0.5, 2.5, 2.5)
%timeit func1b(data, 0.5, 2.5, 2.5)
%timeit func1c(data, 0.5, 2.5, 2.5)
%timeit func1d(data, 0.5, 2.5, 2.5)
67.2 ms ± 230 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
13 ms ± 10.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
69.8 ms ± 60.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
69.8 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Test which implementations copy memory
test_share(func1a)
test_share(func1b)
test_share(func1c)
test_share(func1d)
False
False
False
True
CodePudding user response:
Here, copying of data doesn't play a big role: the bottle neck is fast how the tanh
-function is evaluated. There are many algorithms: some of them are faster some of them are slower, some are more precise some less.
Different numpy-distributions use different implementations of tanh
-function, e.g. it could be one from mkl/vml or the one from the gnu-math-library.
Depending on numba version, also either the mkl/svml impelementation is used or gnu-math-library.
The easiest way to look inside is to use a profiler, for example perf
.
For the numpy-version on my machine I get:
>>> perf record python run.py
>>> perf report
Overhead Command Shared Object Symbol
46,73% python libm-2.23.so [.] __expm1
24,24% python libm-2.23.so [.] __tanh
4,89% python _multiarray_umath.cpython-37m-x86_64-linux-gnu.so [.] sse2_binary_scalar2_divide_DOUBLE
3,59% python [unknown] [k] 0xffffffff8140290c
As one can see, numpy uses the slow gnu-math-library (libm
) functionality.
For the numba-function I get:
53,98% python libsvml.so [.] __svml_tanh4_e9
3,60% python [unknown] [k] 0xffffffff81831c57
2,79% python python3.7 [.] _PyEval_EvalFrameDefault
which means that fast mkl/svml functionality is used.
That is (almost) all there is to it.
As @user2640045 has rightly pointed out, the numpy performance will be hurt by additional cache misses due to creation of temporary arrays.
However, cache misses don't play such a big role as the calculation of tanh
:
%timeit func1a(data, 0.5, 2.5, 2.5) # 91.5 ms ± 2.88 ms per loop
%timeit numpy.tanh(data) # 76.1 ms ± 539 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
i.e. creation of temporary objects is responsible for around 20% of the running time.
FWIW, also for version with the handwritten loops, my numba version (0.50.1) is able to vectorize and call mkl/svml functionality. If for some other version this not happens - numba will fall back to gnu-math-library functionality, what seems to be happening on your machine.
Listing of run.py
:
import numpy
# TODO: define func1b for checking numba
def func1a(data, a, b, c):
# pure numpy
return a * (1 numpy.tanh((data / b) - c))
data = numpy.random.randn(10_000, 300)
for _ in range(100):
func1a(data, 0.5, 2.5, 2.5)
CodePudding user response:
The performance difference is NOT in the evaluation of the tanh-function
I must disagree with @ead. Let's assume for the moment that
the main performance difference is in the evaluation of the tanh-function
Then one would expect that running just tanh
from numpy
and numba
with fast math would show that speed difference.
def func_a(data):
return np.tanh(data)
@nb.njit(fastmath=True)
def func_b(data):
new_data = np.tanh(data)
return new_data
data = np.random.randn(10_000, 300)
%timeit func_a(data)
%timeit func_b(data)
Yet on my machine the above code shows almost no difference in performance.
15.7 ms ± 129 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
15.8 ms ± 82 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Short detour on NumExpr
I tried a NumExpr
version of your code. But before being amazed that it runts almost 7 times faster you should keep in mind that it uses all 10 cores available on my machine. After allowing numba
to run in parallel too and optimising that a little bit the performance benefit is small but sill there 2.56 ms
vs 3.87 ms
. See code below.
@nb.njit(fastmath=True)
def func_a(data):
new_data = a * (1 np.tanh((data / b) - c))
return new_data
@nb.njit(fastmath=True, parallel=True)
def func_b(data):
new_data = a * (1 np.tanh((data / b) - c))
return new_data
@nb.njit(fastmath=True, parallel=True)
def func_c(data):
for i in nb.prange(data.shape[0]):
for j in range(data.shape[1]):
data[i, j] = a * (1 np.tanh((data[i, j] / b) - c))
return data
def func_d(data):
return ne.evaluate('a * (1 tanh((data / b) - c))')
data = np.random.randn(10_000, 300)
%timeit func_a(data)
%timeit func_b(data)
%timeit func_c(data)
%timeit func_d(data)
17.4 ms ± 146 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
4.31 ms ± 193 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
3.87 ms ± 152 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2.56 ms ± 104 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
The actual explanation
The ~34% time that NumExpr
saves compared to numba
are nice but even nicer is that they have a concise explanation why they are faster than numpy
. I am pretty sure that this applies to numba
too.
From the NumExpr github page:
The main reason why NumExpr achieves better performance than NumPy is that it avoids allocating memory for intermediate results. This results in better cache utilization and reduces memory access in general.
So
a * (1 numpy.tanh((data / b) - c))
is slower because it does a lot of steps producing intermediate results.