import haversine as hs
from haversine import Unit
import pandas as pd
df=pd.DataFrame(data={'id':[1,2,3,4,5],'Latitude':[33.110348,33.500308,30.428149,33.493309,36.433678],'Longitude':[-83.259740,-86.792691,-92.326981,-86.828201,-89.025341]})
def getDists(x,y):
return df.apply(lambda row: hs.haversine([x,y],row[['Latitude','Longitude']],unit=Unit.MILES),axis=1)
df.apply(lambda row: getDists(row['Latitude'],row['Longitude']),axis=1)
I have a Dataframe with lat/long values and a unique identifier. I would like to create a new dataframe that would have the distances (using haversine) of every combination of values. I also realized that the distance between x and y is the same as y and x so as long as I sort the values I can eliminate half the calculations.
I can easily do this with a for loop/iterrows, but I was trying to do it as efficiently as possible.
I think I can accomplish this using .apply() and a lambda function to my haversine function, but after thinking about this for hours I've managed only to completely confuse myself.
CodePudding user response:
Are you looking for something along the lines of
df['distance'] = df.apply(lambda r: haversine(r['lat'],r['long']),axis =1)
I've thought about what you're working on before, and I believe my original solution was to use a distance matrix which looped over 1/2 of the matrix---including the diagonal, and placing each value into the corresponding duplicated positions.
CodePudding user response:
To achieve the best performance, the distance function that you are computing must be vectorized. It looks like haversine
already supports vectorization (see the comb
argument), so you should be able to do something like:
from haversine import haversine_vector, Unit
haversine_vector(df, df, Unit.MILES, comb=True)
This is based on reading the haversine documentation, the section for performance improvements.
CodePudding user response:
I would first create those non-duplicated exhaustive combinations in Dataframe.
I don't know the terminology, but it is sortof cartesian product minus the (combination of itself and the duplicated combination of 2 locations).
In matrix sense, what you need is upper triangle matrix.
>>> import numpy as np
>>> np.triu(1- np.eye(len(df)))
array([[0., 1., 1., 1., 1.],
[0., 0., 1., 1., 1.],
[0., 0., 0., 1., 1.],
[0., 0., 0., 0., 1.],
[0., 0., 0., 0., 0.]])
Using this matrix as Dataframe index, I get Dataframe with the exhaustive list of 2 locations combinations.
>>> i, j = np.where(np.triu(1- np.eye(len(df))))
>>> df = (df.iloc[i].reset_index(drop=True)
.join(df.iloc[j].reset_index(drop=True), lsuffix='_x', rsuffix='_y'))
>>> df
id_x Latitude_x Longitude_x id_y Latitude_y Longitude_y
0 1 33.110348 -83.259740 2 33.500308 -86.792691
1 1 33.110348 -83.259740 3 30.428149 -92.326981
2 1 33.110348 -83.259740 4 33.493309 -86.828201
3 1 33.110348 -83.259740 5 36.433678 -89.025341
4 2 33.500308 -86.792691 3 30.428149 -92.326981
5 2 33.500308 -86.792691 4 33.493309 -86.828201
6 2 33.500308 -86.792691 5 36.433678 -89.025341
7 3 30.428149 -92.326981 4 33.493309 -86.828201
8 3 30.428149 -92.326981 5 36.433678 -89.025341
9 4 33.493309 -86.828201 5 36.433678 -89.025341
This doesn't include any duplicates of combination, so then, I can calculate the distance for each row.
>>> df['result'] = df.apply(lambda row: hs.haversine([row['Latitude_x'], row['Longitude_x']], [row['Latitude_y'], row['Longitude_y']]), axis=1)
>>> df
id_x Latitude_x Longitude_x id_y Latitude_y Longitude_y result
0 1 33.110348 -83.259740 2 33.500308 -86.792691 331.157711
1 1 33.110348 -83.259740 3 30.428149 -92.326981 907.184077
2 1 33.110348 -83.259740 4 33.493309 -86.828201 334.342337
3 1 33.110348 -83.259740 5 36.433678 -89.025341 643.134695
4 2 33.500308 -86.792691 3 30.428149 -92.326981 623.748640
5 2 33.500308 -86.792691 4 33.493309 -86.828201 3.383468
6 2 33.500308 -86.792691 5 36.433678 -89.025341 384.390670
7 3 30.428149 -92.326981 4 33.493309 -86.828201 620.539431
8 3 30.428149 -92.326981 5 36.433678 -89.025341 734.575638
9 4 33.493309 -86.828201 5 36.433678 -89.025341 383.356886
If your Dataframe is big, apply
is not the best performant method, so if you worry about the performance of the apply
, you can try np.vectorize
.
%%timeit
df['result'] = df.apply(lambda row: hs.haversine([row['Latitude_x'], row['Longitude_x']], [row['Latitude_y'], row['Longitude_y']]), axis=1)
1.02 ms ± 33.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
%%timeit
df['result'] = np.vectorize(hs.haversine)(df[['Latitude_x', 'Longitude_x']], df[['Latitude_y', 'Longitude_y']])
430 µs ± 9.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)