Most efficient way to calculate the average of a function between pairs for each element in Python?-CodePudding

Problem:

I have M objects sampled at different frames and I want to calculate the distance between pairs at each frame. I store the distances as a multidimensional array xij with three axis, where the element xij[t,i,j] corresponds to the distance between the objects i and j at time t. For example, we could have:
```
N  = 10**5
M = 10
xij = np.random.uniform(0, 10, N).reshape(int(N/M**2), M, M)
```
Now I want to calculate for each element the average distance to other pairs (that is, excluding pairs between the same objects xij[t,i,i]). The way I implemented this was first changing the values of these indices to NaN and then using np.nanmean():
```
xij[...,np.arange(M), np.arange(M)] = np.nan
mean = np.nanmean(xij, axis = -1) 
```
However, changing all these values to np.nan becomes a bottleneck in my program and it seems to me that maybe is not necessary. Is there a faster alternative? I see there is an argument where in np.mean to choose the elements to include in the calculation as a boolen array. I wonder if you could create this array more efficiently than using the Nan trick I implemented. Or alternatively, maybe using masked arrays? Although I am not familiar with them.

CodePudding user response：

You could sum, subtract the diagonal, and divide by M-1:

meanDistance = (np.sum(xij, axis = -1) - np.diagonal(xij, axis1=-2, axis2=-1))  / (M - 1)

Demo results:

(sum-diag) / (M-1):
  time in seconds: 0.03786587715148926
  t=0 first three means: [5.42617836 5.03198446 5.67675881]

nanmean:
  time in seconds: 0.18410110473632812
  t=0 first three means: [5.42617836 5.03198446 5.67675881]

Demo code (Try it online!):

import numpy as np
from time import time

N  = 10**7
M = 10
xij = np.random.uniform(0, 10, N).reshape(int(N/M**2), M, M)

print('(sum-diag) / (M-1):')
t0 = time()
meanDistance = (np.sum(xij, axis = -1) - np.diagonal(xij, axis1=-2, axis2=-1))  / (M - 1)
print('  time in seconds:', time() - t0)
print('  t=0 first three means:', meanDistance[0,:3])

print()
print('nanmean:')
t0 = time()
xij[...,np.arange(M), np.arange(M)] = np.nan
meanDistance = np.nanmean(xij, axis = -1)
print('  time in seconds:', time() - t0)
print('  t=0 first three means:', meanDistance[0,:3])

CodePudding user response：

This isn't a direct answer to your question, because it is not only on calculating the average of distances between pairs, but doing the distance calculation and averaging all at once.

Assumptions

Euclidean distance between pairs
The distance calculation is based on one array, diagonal elements are zero
points is an array with axis corresponding to (time, element, coordinate of the position)

Code

import numpy as np
import numba as nb

@nb.njit(fastmath=True,inline="never")
def mean_dist_inner(points,res):
    div=1/(points.shape[0]-1)

    for i in range(points.shape[0]):
        acc=0
        for j in range(points.shape[0]):
            dist=0
            for k in range(points.shape[1]):
                dist =(points[i,k]-points[j,k])**2
            acc =np.sqrt(dist)
        res[i]=acc*div
    return

@nb.njit(fastmath=True,parallel=True,cache=True)
def mean_dist_time(points):

    res=np.empty((points.shape[0],points.shape[1]),dtype=np.float64)

    for t in nb.prange(points.shape[0]):
        mean_dist_inner(points[t],res[t])
    return res

Timing

points=np.random.rand(10000,40,40)
%timeit mean_dist_time(points)
#40.1 ms ± 9.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response：

Edit: I wrongly thoughted a distance needed to be calculated first. This seems a reshape exercise together with numpy.triu_indices. If the distance x[i,j] != x[j,i] you need a combination with triu_indices & tril_indices.

I assume x[i,j] = x[j,i], than:

import numpy as np

N = 10000
xij = np.random.uniform(0, 10, (N,N))

np.mean( xij[ np.tril_indices(N, k=1) ] )

If there is a dimension time, like

N  = 10**5
M = 10
xij = np.random.uniform(0, 10, N).reshape(int(N/M**2), M, M)

You could

N_dim = xij.shape[-1]

[ np.mean( xij[t,:][np.tril_indices(N_dim, k=1)] ) for t in range(xij.shape[0]) ]

To get a list of means, or the overal mean with

N_dim = xij.shape[-1]

np.mean( [ np.mean( xij[t,:][np.tril_indices(N_dim, k=1)] ) for t in range(xij.shape[0]) ] )