In which cases numpy's out parameter is faster?-CodePudding

I am not sure if the out parameter is worth the trouble for my project, so I am doing a series of tests.

But so far, in every case I've tested, the out parameter seems to make the performance slightly slower if not the same as the simpler implementation and I can't figure out why.

here's a example: test1 and test11 are meant to be equivalent but the later uses a n out parameter for avoiding allocating a new array every time.

same for test2 and test22


import numpy as np
from timeit import timeit

N = 1000

weights = np.arange(N * N).astype('f4').reshape((N, N))

inputs = np.arange(N).astype('f4')

cache_out1 = np.empty((N, N), dtype='f4')
cache_out2 = np.empty(N, dtype='f4')

def test1():
    return (weights * inputs).sum(axis=1)

def test11():
    np.multiply(weights, inputs, out=cache_out1)
    np.sum(cache_out1, axis=1, out=cache_out2)
    return cache_out2

def test2():
    return (weights * inputs[:, np.newaxis]).sum(axis=0)

def test22():
    np.multiply(weights, inputs[:, np.newaxis], out=cache_out1)
    np.sum(cache_out1, axis=0, out=cache_out2)
    return cache_out2


print('test1:', timeit(test1,  number=1000))
print('test11:', timeit(test11, number=1000))
print('test2:', timeit(test2,  number=1000))
print('test22:', timeit(test22, number=1000))

output:

test1: 1.1015455439919606
test11: 1.0834621820104076
test2: 1.1083468289871234
test22: 1.1045935050060507

CodePudding user response：

It will make a bigger impact when your arrays are bigger and the allocation time takes longer. Using the out parameter would allow you to amortize the allocation time by re-using that memory. To take your example:

import numpy as np
from timeit import timeit

N = 4096

weights = np.arange(N * N).astype('f4').reshape((N, N))

inputs = np.arange(N).astype('f4')

cache_out1 = np.empty((N, N), dtype='f4')
cache_out2 = np.empty(N, dtype='f4')

def test1():
    return (weights * inputs).sum(axis=1)

def test11():
    np.multiply(weights, inputs, out=cache_out1)
    np.sum(cache_out1, axis=1, out=cache_out2)
    return cache_out2

def test2():
    return (weights * inputs[:, np.newaxis]).sum(axis=0)

def test22():
    np.multiply(weights, inputs[:, np.newaxis], out=cache_out1)
    np.sum(cache_out1, axis=0, out=cache_out2)
    return cache_out2


n = 100
print('test1:', timeit(test1,  number=n))
print('test11:', timeit(test11, number=n))
print('test2:', timeit(test2,  number=n))
print('test22:', timeit(test22, number=n))

test1: 2.5047981239913497
test11: 1.7144565229973523
test2: 2.4683585959865013
test22: 1.6845238849928137

CodePudding user response：

If you really want to squeeze the last nanoseconds, check Numba or Cython.

Especially if you can parallel the operations, you can get quite important speedups.