How to generate a new dataframe that is the result of a row by row function on an existing df effici-CodePudding

import haversine as hs
from haversine import Unit
import pandas as pd
        
df=pd.DataFrame(data={'id':[1,2,3,4,5],'Latitude':[33.110348,33.500308,30.428149,33.493309,36.433678],'Longitude':[-83.259740,-86.792691,-92.326981,-86.828201,-89.025341]})
            
def getDists(x,y):
   return df.apply(lambda row: hs.haversine([x,y],row[['Latitude','Longitude']],unit=Unit.MILES),axis=1)
            
df.apply(lambda row: getDists(row['Latitude'],row['Longitude']),axis=1)

I have a Dataframe with lat/long values and a unique identifier. I would like to create a new dataframe that would have the distances (using haversine) of every combination of values. I also realized that the distance between x and y is the same as y and x so as long as I sort the values I can eliminate half the calculations.

I can easily do this with a for loop/iterrows, but I was trying to do it as efficiently as possible.

I think I can accomplish this using .apply() and a lambda function to my haversine function, but after thinking about this for hours I've managed only to completely confuse myself.

CodePudding user response：

Are you looking for something along the lines of

df['distance'] = df.apply(lambda r: haversine(r['lat'],r['long']),axis =1)

I've thought about what you're working on before, and I believe my original solution was to use a distance matrix which looped over 1/2 of the matrix---including the diagonal, and placing each value into the corresponding duplicated positions.

CodePudding user response：

To achieve the best performance, the distance function that you are computing must be vectorized. It looks like haversine already supports vectorization (see the comb argument), so you should be able to do something like:

from haversine import haversine_vector, Unit
haversine_vector(df, df, Unit.MILES, comb=True)

This is based on reading the haversine documentation, the section for performance improvements.

CodePudding user response：

I would first create those non-duplicated exhaustive combinations in Dataframe.

I don't know the terminology, but it is sortof cartesian product minus the (combination of itself and the duplicated combination of 2 locations).

In matrix sense, what you need is upper triangle matrix.

>>> import numpy as np
>>> np.triu(1- np.eye(len(df)))

array([[0., 1., 1., 1., 1.],
       [0., 0., 1., 1., 1.],
       [0., 0., 0., 1., 1.],
       [0., 0., 0., 0., 1.],
       [0., 0., 0., 0., 0.]])

Using this matrix as Dataframe index, I get Dataframe with the exhaustive list of 2 locations combinations.

>>> i, j = np.where(np.triu(1- np.eye(len(df))))
>>> df = (df.iloc[i].reset_index(drop=True)
          .join(df.iloc[j].reset_index(drop=True), lsuffix='_x', rsuffix='_y'))
>>> df
   id_x  Latitude_x  Longitude_x  id_y  Latitude_y  Longitude_y
0     1   33.110348   -83.259740     2   33.500308   -86.792691
1     1   33.110348   -83.259740     3   30.428149   -92.326981
2     1   33.110348   -83.259740     4   33.493309   -86.828201
3     1   33.110348   -83.259740     5   36.433678   -89.025341
4     2   33.500308   -86.792691     3   30.428149   -92.326981
5     2   33.500308   -86.792691     4   33.493309   -86.828201
6     2   33.500308   -86.792691     5   36.433678   -89.025341
7     3   30.428149   -92.326981     4   33.493309   -86.828201
8     3   30.428149   -92.326981     5   36.433678   -89.025341
9     4   33.493309   -86.828201     5   36.433678   -89.025341

This doesn't include any duplicates of combination, so then, I can calculate the distance for each row.

>>> df['result'] = df.apply(lambda row: hs.haversine([row['Latitude_x'], row['Longitude_x']], [row['Latitude_y'], row['Longitude_y']]), axis=1)

>>> df
   id_x  Latitude_x  Longitude_x  id_y  Latitude_y  Longitude_y      result
0     1   33.110348   -83.259740     2   33.500308   -86.792691  331.157711
1     1   33.110348   -83.259740     3   30.428149   -92.326981  907.184077
2     1   33.110348   -83.259740     4   33.493309   -86.828201  334.342337
3     1   33.110348   -83.259740     5   36.433678   -89.025341  643.134695
4     2   33.500308   -86.792691     3   30.428149   -92.326981  623.748640
5     2   33.500308   -86.792691     4   33.493309   -86.828201    3.383468
6     2   33.500308   -86.792691     5   36.433678   -89.025341  384.390670
7     3   30.428149   -92.326981     4   33.493309   -86.828201  620.539431
8     3   30.428149   -92.326981     5   36.433678   -89.025341  734.575638
9     4   33.493309   -86.828201     5   36.433678   -89.025341  383.356886

If your Dataframe is big, apply is not the best performant method, so if you worry about the performance of the apply, you can try np.vectorize.

%%timeit
df['result'] = df.apply(lambda row: hs.haversine([row['Latitude_x'], row['Longitude_x']], [row['Latitude_y'], row['Longitude_y']]), axis=1)
1.02 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%%timeit
df['result']  = np.vectorize(hs.haversine)(df[['Latitude_x', 'Longitude_x']], df[['Latitude_y', 'Longitude_y']])
430 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)