Mean of selected rows of a matrix with Numpy and performance-CodePudding

I need to compute the mean of a 2D across one dimension. Here I keep all rows:

import numpy as np, time
x = np.random.random((100000, 500))

t0 = time.time()
y = x.mean(axis=0)       # y.shape is (500,) as expected
print(time.time() - t0)  # 36 milliseconds

When I filter and select some rows, I notice it is 8 times slower. So I tried an easy test where selected_rows are in fact all rows. Still, it is 8 times slower:

selected_rows = np.arange(100000)
t0 = time.time()
y = x[selected_rows, :].mean(axis=0)        # selecting all rows!
print(time.time() - t0) # 280 milliseconds! (for the same result as above!)

Is there a way to speed up the process of selecting certain rows (selected_rows), and computing .mean(axis=0) ?

In the specific case where selected_rows = all rows, it would be interesting to not have 8x slower execution.

CodePudding user response：

When you do x[selected_rows, :] where selected_rows is an array, it performs advanced (aka fancy) indexing to create a new array. This is what takes time.

If, instead, you did a slice operation, a view of the original array is created, and that takes less time. For example:

import timeit
import numpy as np

selected_rows = np.arange(0, 100000, 2)
array = np.random.random((100000, 500))

t1 = timeit.timeit("array[selected_rows, :].mean(axis=0)", globals=globals(), number=10)
t2 = timeit.timeit("array[::2, :].mean(axis=0)", globals=globals(), number=10)

print(t1, t2, t1 / t2) # 1.3985465039731935 0.18735826201736927 7.464557414839488

Unfortunately, there's no good way to represent all possible selected_rows as slices, so if you have a selected_rows that can't be represented as a slice, you don't have any other option but to take the hit in performance. There's more information in the answers to these questions:

dankal444's answer here doesn't help in your case, since the axis of the mean call is the axis you wanted to filter in the first place. It is, however, the best way to do this if the filter axis and the mean axis are different -- save the creation of the new array until after you've condensed one axis. You still take a performance hit compared to basic slicing, but it is not as large as if you indexed before the mean call.

For example, if you wanted .mean(axis=1),

t1 = timeit.timeit("array[selected_rows, :].mean(axis=1)", globals=globals(), number=10)
t2 = timeit.timeit("array.mean(axis=1)[selected_rows]", globals=globals(), number=10)
t3 = timeit.timeit("array[::2, :].mean(axis=1)", globals=globals(), number=10)
t4 = timeit.timeit("array.mean(axis=1)[::2]", globals=globals(), number=10)

print(t1, t2, t3, t4)
# 1.4732236850004483 0.3643951010008095 0.21357544500006043 0.32832237200000236

Which shows that

Indexing before mean is the worst by far (t1)
Slicing before mean is best, since you don't have to spend extra time calculating means for the unnecessary rows (t3)
Both indexing (t2) and slicing (t4) after mean are better than indexing before mean, but not better than slicing before mean

CodePudding user response：

(Sorry for first version of this answer)

I tried using numba:

import numba
@numba.jit('float64[:](float64[:, :], int32[:])')
def selective_mean(array, indices):
    sum = np.zeros(array.shape[1], dtype=np.float64)
    for idx in indices:
        sum  = array[idx]
    return sum / array.shape[0]

t0 = time.time()
y2 = selective_mean(x, selected_rows)
print(time.time() - t0)

There is little slowdown compared to numpy, but much smaller (20% slower?). After compilation (first call to this function) I got about the same timings. For fewer indices array you should see some gain.