How to make the loop more efficient in my case? (numpy)-CodePudding

I'm running a python code to calculate the distance between certain coordinates. The original data looks like:

a = np.array([[1,40,70],[2,41,71],[3,42,73]])    #id, latitude, longitude

and I'm looking forward to get the distance between every pair, the result should look like:

[1, 2, 100(km)]
[1, 3, 200(km)].
[2, 1, 100(km)]
[2, 3, 300(km)]
[3, 1, 200(km)]
[3, 2, 300(km)]

The result should contan pair(m,n) and pair(n,m)

the actual data has 39000 columns, thus I have a great demand for code efficiency. Currently I'm using a double-loop which is really stupid:

line = 0
result = np.zeros((6,3))
for i in a:
    for j in a:
         dis = getDistance(i[1],i[2],j[1],j[2])  # this is the function i made to calculate distance between two coordinates
         result[line] = [i[0],j[0],dis]
         line  = 1

Can anyone help me to improve the code?

CodePudding user response：

If your distance function is simple, you can try to find distance_matrix and convert it to a data structure similar to the output of np.enumerate(distance_matrix):

def get_dist(X, Y):
    return 100*np.hypot(X-X[:,None], Y-Y[:,None])

M = np.array([[1,40,70],[2,41,71],[3,42,73]])
names, X, Y = np.transpose(M)
distance_matrix = get_dist(X, Y)
>>> list(np.enumerate(distance_matrix))
[((0, 0), 0.0),
 ((0, 1), 141.4213562373095),
 ((0, 2), 360.5551275463989),
 ((1, 0), 141.4213562373095),
 ((1, 1), 0.0),
 ((1, 2), 223.60679774997897),
 ((2, 0), 360.5551275463989),
 ((2, 1), 223.60679774997897),
 ((2, 2), 0.0)]

Note that we need to add different names for indices. Also, types of these names and distances are different therefore we can't keep both of them in one single (not structured) array. You might like to avoid iterating np.ndeumerate and find names in a different way:

x,y = np.indices([len(M), len(M)])
>>> names[x].ravel(), names[y].ravel(), distance_matrix.ravel()
(array([1, 1, 1, 2, 2, 2, 3, 3, 3]),
 array([1, 2, 3, 1, 2, 3, 1, 2, 3]),
 array([  0.        , 141.42135624, 360.55512755, 141.42135624,
          0.        , 223.60679775, 360.55512755, 223.60679775,
          0.        ]))

Or alternatively:

>>> np.transpose([names[x].ravel(), names[y].ravel(), dist_matrix.ravel()])
array([[  1.        ,   1.        ,   0.        ],
       [  1.        ,   2.        , 141.42135624],
       [  1.        ,   3.        , 360.55512755],
       [  2.        ,   1.        , 141.42135624],
       [  2.        ,   2.        ,   0.        ],
       [  2.        ,   3.        , 223.60679775],
       [  3.        ,   1.        , 360.55512755],
       [  3.        ,   2.        , 223.60679775],
       [  3.        ,   3.        ,   0.        ]])

CodePudding user response：

First of all, you are doing calculations twice. Your code:

for i in a:
    for j in a:

calculates the distance from i to j and from j to i even though these are the same distance. You cut your time in half simply by doing something like

for i in range(len(a)):
   for j in range(i 1, len(a)):
      D[i,j] = distance_calc(i,j)

Even if you need the distance in both directions, I wouldn't calculate it twice, just assign the value in two places. But if you can vectorize your code, it would likely speed up

for i in range(len(a)):
   D[i,i 1:] = distance_calc(i)

Giving an example for your problem, any calculations are going to be done in radians so I would want to start by converting your locations to radians rather than degrees. Even so, we can speed things up quite a bit. I am assuming here you are using the Haversine estiamte for distance on a spherical Earth. I'm basing my algorithm on the info from this post that is referenced in the comments to your question. To start, I am going to consider N=1000 points rather than your full set.

import numpy as np
import time

r_Earth = 6371
N = 1000

def haversine_one_element(data1, data2):
    # calculates the Haversine distance one element at a time
    lat1 = data1[0]                     
    lng1 = data1[1]         

    lat2 = data2[0]                     
    lng2 = data2[1]         

    diff_lat = lat1 - lat2
    diff_lng = lng1 - lng2
    d = np.sin(diff_lat/2)**2   np.cos(lat1)*np.cos(lat2) * np.sin(diff_lng/2)**2
    return 2 * r_Earth * np.arcsin(np.sqrt(d))            

def haversine_slow(a, N, do_all = True):
    D = np.zeros((N,N))
    for i in range(N):
        if do_all:
            for j in range(N):
                D[i,j] = haversine_one_element(a[i,1:], a[j,1:])
        else: 
            for j in range(i 1,N): # note that D[i,i] = 0 so we can skip it
                D[i,j] = D[j,i] = haversine_one_element(a[i,1:], a[j,1:])

a = np.array([np.arange(N).ravel(),
              np.random.random(N) * 360 - 180,
              np.arcsin(np.random.random(N) * 2 - 1) * 180 / np.pi]).transpose()
a[:,1:] = a[:,1:] * np.pi / 180

# Doing the entire array
start = time.time()
haversine_slow(a,N)
print(time.time() - start) # 8.777195453643799

# Doing only the upper half
start = time.time()
haversine_slow(a,N,do_all = False)
print(time.time() - start) # 4.547634840011597

However, we can rewrite our Haversine formula to take in a swath of values at once so we can calculate the distance from one point to all of the other points at once.

def haversine(data1, data2):
    # data1, data2 are the data arrays with 2 cols and they hold
    # lat., lng. values in those cols respectively
    lat1 = data1[0]                     
    lng1 = data1[1]         

    lat2 = data2[0,:]                     
    lng2 = data2[1,:]         

    diff_lat = lat1 - lat2
    diff_lng = lng1 - lng2
    d = np.sin(diff_lat/2)**2   np.cos(lat1)*np.cos(lat2) * np.sin(diff_lng/2)**2
    return 2 * r_Earth * np.arcsin(np.sqrt(d))            

def haversine_d(a,N):
    D = np.zeros((N,N))
    for i in range(N):
        d1 = a[i,1:]
        d2 = a[i 1:,1:].transpose()
        D[i,i 1:] = haversine(d1, d2)
        D[i 1:,i] = D[i, i 1:].transpose()
    # only return flattened upper triangular part of matirx as 
    return(D)

start = time.time()
d1 = haversine_d(a,N)
print(time.time() - start) # 0.03420424461364746

So, for my case here, that has reduced the compute time by another factor of 100, and I think it will be reduced more for longer vectors like what you are talking about. (Subject to memory-bound problems, see below.)

However, there is one more thing that I think we can do, if we look at the problem in terms of linear algebra instead of trigonometry. If you've taken linear algebra the you should know the relationship between dot products and cosines and we can use that information to our advantage by describing the locations as vectors in (x,y,z) instead of longitiude and latitude:

def get_xyz(a):
    x = np.cos(a[:,1]) * np.cos(a[:,2])
    y = np.cos(a[:,1]) * np.sin(a[:,2])
    z = np.sin(a[:,1]) 
    v = np.stack([x, y, z]).transpose()
    return(np.hstack([a[:,:1], v]))

def distance_dot(a,N):
    b = get_xyz(a)
    
    D = np.zeros((N,N))
    for i in range(N):
        D[i,i 1:] = np.arccos(b[i,1] * b[i 1:,1]   \
                              b[i,2] * b[i 1:,2]   \
                              b[i,3] * b[i 1:,3]) * r_Earth
        D[i 1:,i] = D[i, i 1:].transpose()
    return(D)

start = time.time()
distance_dot(a,N) # 0.019799470901489258
print(time.time() - start)

This algebra approach should give the same answers within the truncation limits of floating point computations; if you try a test comparing Haversine and linear algebra using == it will fail but if you do something like <= 10**-10 it will pass.

These methods have gotten reduced the calculation time by a factor of about 400 on my dataset of 1000 points. For a larger dataset, I did not run the first two for N=40000 but for the last two both the Haversine and linear algebra tests came in at about the same time (290 s) but that is probably because the problem ended up needing all the memory on my laptop. If memory hadn't been an issue I would have expected the problem to scale roughly as N**2 so that would have been around 60 s for the Haversine and 30 for the linear algebra.

If you start encountering memory problems, there might be a way to speed it up by calculating chunks of the array and saving to disk as you go, but for the time involved in figuring that out, unless you need to produce production-level code or do this over and over I would probably not bother.