I'm trying to compute average values of shifted diagonals of a square array.
Given input matrix like (in reality much larger than 3x3):
[[a, b, c],
[d, e, f],
[g, h, i]]
correct answer would be
[g, (d h)/2, (a e i)/3, (b f)/2, c]
A code to compute such average could be:
import numpy as np
def offset_diag_mean(mat):
n = len(mat)
return np.array([np.mean(np.diag(mat,k)) for k in range(-n 1,n)])
I assume one can do without a list comprehension (speed is quite important for me). Any clever solutions I'm missing?
CodePudding user response:
The tricky part is summing the diagonals. I've questioned multiple times in the past about the fastest approach to accomplish this.
One solution is to use bincount
:
def offset_diag_mean(mat):
n = len(mat)
return np.array([np.mean(np.diag(mat,k)) for k in range(-n 1,n)])
def f(a):
A = a[::-1]
n = A.shape[0]
indices = np.arange(2*n-1)
indices = np.lib.stride_tricks.as_strided(indices, A.shape, (8,8))
triangle = np.r_[np.arange(1, n 1), np.arange(n-1, 0, -1)]
return np.bincount(indices.ravel(), A.ravel()) / triangle
a = np.arange(9).reshape(3,3) # 3x3 matrix example
assert (f(a) == offset_diag_mean(a)).all()
Timings
% timeit offset_diag_mean(a)
43.8 µs ± 962 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
%timeit f(a)
24.2 µs ± 362 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
For larger arrays:
a = np.arange(1024*1024).reshape(1024,1024)
%timeit offset_diag_mean(a)
21.5 ms ± 548 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit f(a)
8.71 ms ± 135 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
This might give you an entry point to further optimize it. I would suggest implementing it in c or use numba for a considerable speedup.
CodePudding user response:
On efficient solution is to accumulate lines of the input 2D array directly in the output array at a specific position and then perform the division.
The idea is to zero-initialize an output
array, then add [a, b, c]
to output[2:5]
, then add [d, e, f]
to output[1:4]
and then add [g, h, i]
to output[0:3]
. Finally, we can divide result
by [1, 2, 3, 2, 1]
.
Here is an implementation:
# Use the decorator @nb.njit here to use Numba
def compute(mat):
n = mat.shape[0]
output = np.zeros(n*2-1, dtype=np.float64)
for i in range(n-1, -1, -1):
output[i:i n] = mat[n-1-i]
output[0:n] /= np.arange(1, n 1, 1, dtype=np.float64)
output[n:] /= np.arange(n-1, 0, -1, dtype=np.float64)
return output
Here is the resulting performance on my machine on a 1000x1000 array:
Initial code: 15.31 ms
Kevin's code: 3.26 ms ( x4.7)
This code: 1.42 ms (x10.8)
This code Numba: 0.66 ms (x23.2)
Thus, this implementation is 10.8 times faster than the initial implementation and 2.3 times faster than the fastest alternative version. Note that using Numba results in an even faster code with almost not change to the code: this Numba version is 23.2 times faster than the initial implementation.