Home > Enterprise >  Most efficient way to calculate the average of a function between pairs for each element in Python?
Most efficient way to calculate the average of a function between pairs for each element in Python?

Time:09-23

Problem:

  • I have M objects sampled at different frames and I want to calculate the distance between pairs at each frame. I store the distances as a multidimensional array xij with three axis, where the element xij[t,i,j] corresponds to the distance between the objects i and j at time t. For example, we could have:
    N  = 10**5
    M = 10
    xij = np.random.uniform(0, 10, N).reshape(int(N/M**2), M, M)
    
  • Now I want to calculate for each element the average distance to other pairs (that is, excluding pairs between the same objects xij[t,i,i]). The way I implemented this was first changing the values of these indices to NaN and then using np.nanmean():
    xij[...,np.arange(M), np.arange(M)] = np.nan
    mean = np.nanmean(xij, axis = -1) 
    
  • However, changing all these values to np.nan becomes a bottleneck in my program and it seems to me that maybe is not necessary. Is there a faster alternative? I see there is an argument where in np.mean to choose the elements to include in the calculation as a boolen array. I wonder if you could create this array more efficiently than using the Nan trick I implemented. Or alternatively, maybe using masked arrays? Although I am not familiar with them.

CodePudding user response:

You could sum, subtract the diagonal, and divide by M-1:

meanDistance = (np.sum(xij, axis = -1) - np.diagonal(xij, axis1=-2, axis2=-1))  / (M - 1)

Demo results:

(sum-diag) / (M-1):
  time in seconds: 0.03786587715148926
  t=0 first three means: [5.42617836 5.03198446 5.67675881]

nanmean:
  time in seconds: 0.18410110473632812
  t=0 first three means: [5.42617836 5.03198446 5.67675881]

Demo code (Try it online!):

import numpy as np
from time import time

N  = 10**7
M = 10
xij = np.random.uniform(0, 10, N).reshape(int(N/M**2), M, M)

print('(sum-diag) / (M-1):')
t0 = time()
meanDistance = (np.sum(xij, axis = -1) - np.diagonal(xij, axis1=-2, axis2=-1))  / (M - 1)
print('  time in seconds:', time() - t0)
print('  t=0 first three means:', meanDistance[0,:3])

print()
print('nanmean:')
t0 = time()
xij[...,np.arange(M), np.arange(M)] = np.nan
meanDistance = np.nanmean(xij, axis = -1)
print('  time in seconds:', time() - t0)
print('  t=0 first three means:', meanDistance[0,:3])

CodePudding user response:

This isn't a direct answer to your question, because it is not only on calculating the average of distances between pairs, but doing the distance calculation and averaging all at once.

Assumptions

  • Euclidean distance between pairs
  • The distance calculation is based on one array, diagonal elements are zero
  • points is an array with axis corresponding to (time, element, coordinate of the position)

Code

import numpy as np
import numba as nb

@nb.njit(fastmath=True,inline="never")
def mean_dist_inner(points,res):
    div=1/(points.shape[0]-1)

    for i in range(points.shape[0]):
        acc=0
        for j in range(points.shape[0]):
            dist=0
            for k in range(points.shape[1]):
                dist =(points[i,k]-points[j,k])**2
            acc =np.sqrt(dist)
        res[i]=acc*div
    return

@nb.njit(fastmath=True,parallel=True,cache=True)
def mean_dist_time(points):

    res=np.empty((points.shape[0],points.shape[1]),dtype=np.float64)

    for t in nb.prange(points.shape[0]):
        mean_dist_inner(points[t],res[t])
    return res

Timing

points=np.random.rand(10000,40,40)
%timeit mean_dist_time(points)
#40.1 ms ± 9.04 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

CodePudding user response:

Edit: I wrongly thoughted a distance needed to be calculated first. This seems a reshape exercise together with numpy.triu_indices. If the distance x[i,j] != x[j,i] you need a combination with triu_indices & tril_indices.

I assume x[i,j] = x[j,i], than:

import numpy as np

N = 10000
xij = np.random.uniform(0, 10, (N,N))
np.mean( xij[ np.tril_indices(N, k=1) ] )

If there is a dimension time, like

N  = 10**5
M = 10
xij = np.random.uniform(0, 10, N).reshape(int(N/M**2), M, M)

You could

N_dim = xij.shape[-1]

[ np.mean( xij[t,:][np.tril_indices(N_dim, k=1)] ) for t in range(xij.shape[0]) ]

To get a list of means, or the overal mean with

N_dim = xij.shape[-1]

np.mean( [ np.mean( xij[t,:][np.tril_indices(N_dim, k=1)] ) for t in range(xij.shape[0]) ] )
  • Related