I need to compute the mean of a 2D across one dimension. Here I keep all rows:
import numpy as np, time
x = np.random.random((100000, 500))
t0 = time.time()
y = x.mean(axis=0) # y.shape is (500,) as expected
print(time.time() - t0) # 36 milliseconds
When I filter and select some rows, I notice it is 8 times slower. So I tried an easy test where selected_rows
are in fact all rows. Still, it is 8 times slower:
selected_rows = np.arange(100000)
t0 = time.time()
y = x[selected_rows, :].mean(axis=0) # selecting all rows!
print(time.time() - t0) # 280 milliseconds! (for the same result as above!)
Is there a way to speed up the process of selecting certain rows (selected_rows
), and computing .mean(axis=0)
?
In the specific case where selected_rows
= all rows, it would be interesting to not have 8x slower execution.
CodePudding user response:
When you do x[selected_rows, :]
where selected_rows
is an array, it performs advanced (aka fancy) indexing to create a new array. This is what takes time.
If, instead, you did a slice operation, a view of the original array is created, and that takes less time. For example:
import timeit
import numpy as np
selected_rows = np.arange(0, 100000, 2)
array = np.random.random((100000, 500))
t1 = timeit.timeit("array[selected_rows, :].mean(axis=0)", globals=globals(), number=10)
t2 = timeit.timeit("array[::2, :].mean(axis=0)", globals=globals(), number=10)
print(t1, t2, t1 / t2) # 1.3985465039731935 0.18735826201736927 7.464557414839488
Unfortunately, there's no good way to represent all possible selected_rows
as slices, so if you have a selected_rows
that can't be represented as a slice, you don't have any other option but to take the hit in performance. There's more information in the answers to these questions:
- Fast numpy fancy indexing
- get view of numpy array using boolean or sequence object (advanced indexing)
dankal444's answer here doesn't help in your case, since the axis of the mean
call is the axis you wanted to filter in the first place. It is, however, the best way to do this if the filter axis and the mean
axis are different -- save the creation of the new array until after you've condensed one axis. You still take a performance hit compared to basic slicing, but it is not as large as if you indexed before the mean
call.
For example, if you wanted .mean(axis=1)
,
t1 = timeit.timeit("array[selected_rows, :].mean(axis=1)", globals=globals(), number=10)
t2 = timeit.timeit("array.mean(axis=1)[selected_rows]", globals=globals(), number=10)
t3 = timeit.timeit("array[::2, :].mean(axis=1)", globals=globals(), number=10)
t4 = timeit.timeit("array.mean(axis=1)[::2]", globals=globals(), number=10)
print(t1, t2, t3, t4)
# 1.4732236850004483 0.3643951010008095 0.21357544500006043 0.32832237200000236
Which shows that
- Indexing before
mean
is the worst by far (t1
) - Slicing before
mean
is best, since you don't have to spend extra time calculating means for the unnecessary rows (t3
) - Both indexing (
t2
) and slicing (t4
) aftermean
are better than indexing beforemean
, but not better than slicing beforemean
CodePudding user response:
(Sorry for first version of this answer)
I tried using numba:
import numba
@numba.jit('float64[:](float64[:, :], int32[:])')
def selective_mean(array, indices):
sum = np.zeros(array.shape[1], dtype=np.float64)
for idx in indices:
sum = array[idx]
return sum / array.shape[0]
t0 = time.time()
y2 = selective_mean(x, selected_rows)
print(time.time() - t0)
There is little slowdown compared to numpy, but much smaller (20% slower?). After compilation (first call to this function) I got about the same timings. For fewer indices array you should see some gain.