Speeding up distance calculation between all combinations of two sets of paths-CodePudding

I have two agents, each having paths that are defined by an x and y coordinate for 100 timesteps. In other words, a single path has shape (100,2). Now agent A and B both have 1000 unique paths. I want to calculate the distance between all combinations of paths at every timestep. In other words, my final output should have shape (1000,1000,100). Currently I use the following approach:

import numpy as np
import time

np.random.seed(0)

N = 1000
A = np.random.rand(N,100,2)
B = np.random.rand(N,100,2)

t0 = time.time()
combinations = np.array(np.meshgrid(np.arange(N), np.arange(N)))
combinations = combinations.T.reshape(-1,2)

# Calculate distances 
diff = A[combinations[:,0],:] - B[combinations[:,1],:]  # Differences
distances = np.sqrt(np.einsum('ijk,ijk->ij',diff,diff)) # A bit faster than linalg norm
distances = np.reshape(distances, (N,N,100))
print('Time:', time.time() - t0)

However, I have to say that this method is quite slow (about 1.2 seconds on my machine). Is there a faster way to do this?

CodePudding user response：

The problem with the provided Numpy implementation is that it create huge temporary buffers that are costly to fill (mainly because they are stored in RAM and are allocated on the fly causing slow page faults). As a result, the code is totally memory-bound.

You can do that much more efficiently using Numba using plain loops and no temporary buffers (thanks to the JIT compiler). Moreover, Numba help you to easily compute that in parallel. Here is an implementation:

import numpy as np
import numba as nb
import time

# Calculate distances and put it in `distances`
@nb.njit('void(float64[:,:,::1], float64[:,:,::1], float64[:,:,::1])', parallel=True)
def computeDistance(A, B, distances):
    n = A.shape[0]
    assert A.shape == (n,100,2)
    assert A.shape == B.shape
    assert distances.shape == (n,n,100)
    for i in nb.prange(n):
        for j in range(n):
            for k in range(100):
                distances[i,j,k] = np.sqrt((A[i,k,0] - B[j,k,0])**2   (A[i,k,1] - B[j,k,1])**2)

np.random.seed(0)

N = 1000
A = np.random.rand(N,100,2)
B = np.random.rand(N,100,2)
distances = np.full((N,N,100), 1.0) # Pre-allocation

t0 = time.time()
computeDistance(A, B, distances)
print('Time:', time.time() - t0)

The loop is compiled to a very efficient parallel implementation (using SIMD instructions). distances is pre-allocated for better performance in iterative loops. On my 6-core machine, this code is 32 times faster (and it is still mainly memory-bound). If you want to get better performance, you could store the distances in 32-bit floats (up to 2 times faster).