Problem:
- I have M objects sampled at different frames and I want to calculate the distance between pairs at each frame. I store the distances as a multidimensional array
xij
with three axis, where the elementxij[t,i,j]
corresponds to the distance between the objectsi
andj
at timet
. For example, we could have:N = 10**5 M = 10 xij = np.random.uniform(0, 10, N).reshape(int(N/M**2), M, M)
- Now I want to calculate for each element the average distance to other pairs (that is, excluding pairs between the same objects
xij[t,i,i]
). The way I implemented this was first changing the values of these indices to NaN and then usingnp.nanmean()
:xij[...,np.arange(M), np.arange(M)] = np.nan mean = np.nanmean(xij, axis = -1)
- However, changing all these values to
np.nan
becomes a bottleneck in my program and it seems to me that maybe is not necessary. Is there a faster alternative? I see there is an argumentwhere
innp.mean
to choose the elements to include in the calculation as a boolen array. I wonder if you could create this array more efficiently than using theNan
trick I implemented. Or alternatively, maybe using masked arrays? Although I am not familiar with them.
CodePudding user response:
You could sum, subtract the diagonal, and divide by M-1:
meanDistance = (np.sum(xij, axis = -1) - np.diagonal(xij, axis1=-2, axis2=-1)) / (M - 1)
Demo results:
(sum-diag) / (M-1):
time in seconds: 0.03786587715148926
t=0 first three means: [5.42617836 5.03198446 5.67675881]
nanmean:
time in seconds: 0.18410110473632812
t=0 first three means: [5.42617836 5.03198446 5.67675881]
Demo code (Try it online!):
import numpy as np
from time import time
N = 10**7
M = 10
xij = np.random.uniform(0, 10, N).reshape(int(N/M**2), M, M)
print('(sum-diag) / (M-1):')
t0 = time()
meanDistance = (np.sum(xij, axis = -1) - np.diagonal(xij, axis1=-2, axis2=-1)) / (M - 1)
print(' time in seconds:', time() - t0)
print(' t=0 first three means:', meanDistance[0,:3])
print()
print('nanmean:')
t0 = time()
xij[...,np.arange(M), np.arange(M)] = np.nan
meanDistance = np.nanmean(xij, axis = -1)
print(' time in seconds:', time() - t0)
print(' t=0 first three means:', meanDistance[0,:3])
CodePudding user response:
This isn't a direct answer to your question, because it is not only on calculating the average of distances between pairs, but doing the distance calculation and averaging all at once.
Assumptions
- Euclidean distance between pairs
- The distance calculation is based on one array, diagonal elements are zero
points
is an array with axis corresponding to(time, element, coordinate of the position)
Code
import numpy as np
import numba as nb
@nb.njit(fastmath=True,inline="never")
def mean_dist_inner(points,res):
div=1/(points.shape[0]-1)
for i in range(points.shape[0]):
acc=0
for j in range(points.shape[0]):
dist=0
for k in range(points.shape[1]):
dist =(points[i,k]-points[j,k])**2
acc =np.sqrt(dist)
res[i]=acc*div
return
@nb.njit(fastmath=True,parallel=True,cache=True)
def mean_dist_time(points):
res=np.empty((points.shape[0],points.shape[1]),dtype=np.float64)
for t in nb.prange(points.shape[0]):
mean_dist_inner(points[t],res[t])
return res
Timing
points=np.random.rand(10000,40,40)
%timeit mean_dist_time(points)
#40.1 ms ± 9.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
Edit: I wrongly thoughted a distance needed to be calculated first.
This seems a reshape exercise together with numpy.triu_indices. If the distance x[i,j] != x[j,i]
you need a combination with triu_indices
& tril_indices
.
I assume x[i,j] = x[j,i]
, than:
import numpy as np
N = 10000
xij = np.random.uniform(0, 10, (N,N))
np.mean( xij[ np.tril_indices(N, k=1) ] )
If there is a dimension time, like
N = 10**5
M = 10
xij = np.random.uniform(0, 10, N).reshape(int(N/M**2), M, M)
You could
N_dim = xij.shape[-1]
[ np.mean( xij[t,:][np.tril_indices(N_dim, k=1)] ) for t in range(xij.shape[0]) ]
To get a list of means, or the overal mean with
N_dim = xij.shape[-1]
np.mean( [ np.mean( xij[t,:][np.tril_indices(N_dim, k=1)] ) for t in range(xij.shape[0]) ] )