I am not sure if the out parameter is worth the trouble for my project, so I am doing a series of tests.
But so far, in every case I've tested, the out parameter seems to make the performance slightly slower if not the same as the simpler implementation and I can't figure out why.
here's a example:
test1
and test11
are meant to be equivalent but the later uses a n out parameter for avoiding allocating a new array every time.
same for test2
and test22
import numpy as np
from timeit import timeit
N = 1000
weights = np.arange(N * N).astype('f4').reshape((N, N))
inputs = np.arange(N).astype('f4')
cache_out1 = np.empty((N, N), dtype='f4')
cache_out2 = np.empty(N, dtype='f4')
def test1():
return (weights * inputs).sum(axis=1)
def test11():
np.multiply(weights, inputs, out=cache_out1)
np.sum(cache_out1, axis=1, out=cache_out2)
return cache_out2
def test2():
return (weights * inputs[:, np.newaxis]).sum(axis=0)
def test22():
np.multiply(weights, inputs[:, np.newaxis], out=cache_out1)
np.sum(cache_out1, axis=0, out=cache_out2)
return cache_out2
print('test1:', timeit(test1, number=1000))
print('test11:', timeit(test11, number=1000))
print('test2:', timeit(test2, number=1000))
print('test22:', timeit(test22, number=1000))
output:
test1: 1.1015455439919606
test11: 1.0834621820104076
test2: 1.1083468289871234
test22: 1.1045935050060507
CodePudding user response:
It will make a bigger impact when your arrays are bigger and the allocation time takes longer. Using the out
parameter would allow you to amortize the allocation time by re-using that memory. To take your example:
import numpy as np
from timeit import timeit
N = 4096
weights = np.arange(N * N).astype('f4').reshape((N, N))
inputs = np.arange(N).astype('f4')
cache_out1 = np.empty((N, N), dtype='f4')
cache_out2 = np.empty(N, dtype='f4')
def test1():
return (weights * inputs).sum(axis=1)
def test11():
np.multiply(weights, inputs, out=cache_out1)
np.sum(cache_out1, axis=1, out=cache_out2)
return cache_out2
def test2():
return (weights * inputs[:, np.newaxis]).sum(axis=0)
def test22():
np.multiply(weights, inputs[:, np.newaxis], out=cache_out1)
np.sum(cache_out1, axis=0, out=cache_out2)
return cache_out2
n = 100
print('test1:', timeit(test1, number=n))
print('test11:', timeit(test11, number=n))
print('test2:', timeit(test2, number=n))
print('test22:', timeit(test22, number=n))
test1: 2.5047981239913497
test11: 1.7144565229973523
test2: 2.4683585959865013
test22: 1.6845238849928137
CodePudding user response:
If you really want to squeeze the last nanoseconds, check Numba or Cython.
Especially if you can parallel the operations, you can get quite important speedups.