Calculate distance between all points of two vectors in Python: linalg, ase.geometry, parallelizatio-CodePudding

I have two sets of 500,000 3D coordinates, pos1 and pos2. I need to calculate the distance between all points of these two sets, giving then a (500000,500000) array. So pos1[0] needs to be compared with pos2[0], pos2[1], ... pos2[499999], same for pos1[1], ...

It is a very slow process, and would like it to be less than an hour on a cluster with 8 nodes.

I have been trying the following with 15,000 positions to see how to proceed:

numpy.linalg.norm (43s): dist = np.linalg.norm(pos1[:, np.newaxis, :] - pos2, axis=2)
numpy.einsum (12s):

The code is the following:

diff = pos1[:,None,:] - pos2
dist = np.sqrt(np.einsum('ijk,ijk->ij', diff, diff))

ase.geometry.find_mic (3min36s):

The code would look like this:

diff = pos2[:, np.newaxis, :]-list1
dist = np.empty((pos1.shape[0], pos2.shape[0]))
for i, d in enumerate(diff):
    print(i)
    dist[i, i:] = find_mic(d[i:], cell=[[50., 0.0, 0.0], [25., 45., 0.0], [0.0, 0.0, 100.]])[1]
    if i 1 < len(diff):
        dist[i 1:, i] = dist[i, i 1:].T

So my question is: how to make it work faster? Since einsum scales like N^2, it would take 12*(500000/15000)^2/3600 = 3h42min to find a solution.

Are there other (faster, or scaling better) methods? Can I take advantage of the fact that I can use 20 threads on 8 nodes? That the distance between point x and point y is the same as distance y to x? I also have access to a GPU node, if that makes it better.

CodePudding user response：

This is an example for a cube.

You can split the cube in 32³ smaller cubes, and calculate the distance d (and 1/d) between the centers of each cube. It saves a lot of calculations because 32² cubes share the same x, 32² share the same y, and 32² share the same z, and because of the symmetries, you do not need to calculate all the cubes. If you calculate the distance for 1/8 of the cubes, you know all the distances.

Let's say that you have a million points.

Now, each cube will have, on average, 1,000,000/32³=30 points

you need to calculate the distance of those 30 points to the center of his cube: (dx,dy,dz).

The distance between the points is the distance d between the cubes, more a correction dd=(x.dx y.dy z.dz)/d

[You have (x,y,z)/d precalculated, and is the same for all points in the cube]

(I didn't made the exact calculation. I'm just giving the general idea)

You still have to compare every point with every point, which is O(n²), but you save the calculation of the root, for each pair of points.

You can compare the result with the exact calculation, and if the difference is not small enough, you can rise the number of cubes.

Precision will be larger for distant cubes, so you may need to tweak at what distance is better to calculate everything, and at what distance is good enough to use this trick.

CodePudding user response：

Please reconsider your needs. A (500000,500000) array of 64-bit values takes roughly 1862 GiB of RAM which is insanely huge an most computing machines does not have that. Computing this on the GPU is useless if you end up storing that in the RAM because the computation is memory-mound and CPU-GPU transfers are significantly slower than the main RAM. AFAIK, no GPU in the world have ~2 TiB of embedded RAM yet. Storing this huge array in RAM is actually the main problem. The main RAM is slow compared to cache and CPU can compute this very quickly in parallel already. The same thing applies for GPUs: the VRAM is generally significantly faster and GPU are also generally faster for this (though the VRAM and caches are smaller). This means you should compute data on-the-fly so to efficiently use a modern computer. Thus the computation using the matrix should also be implemented on the GPU if you plan to use a GPU. Honestly, a parallel computation should already be quite good for your task and I advise you to try to use a GPU only if the fastest CPU implementation fail to reach your timing budgets (because using a GPU efficiently is far more difficult than a CPU). To compute the operation on the fly, The @Colim answer provide some hints to implement that (on both a CPU and a GPU).

The question is quite a XY problem: the actual problem is not to compute the distance matrix but what you want to do with it and how to do it efficiently (and computing the whole matrix is actually an inefficient solution). The complexity of the method is O(n^2) and cannot be improved because the output is of size O(n^2). Generating a distance matrix is rarely useful and known not to scale on large datasets. There are relatively fast algorithms to classify large datasets, to compute nearest neighbours, etc. without requiring to compute the whole matrix (typically in O(n log n) time).