I am trying to calculate the distance from each point (coordinate) of a column in a Dataframe to each point (coordinate) of another column in another Dataframe and save the minimum distance to a Dataframe, such that the resulting Dataframe has the same length as the first column. data: intern:
ID | coordinates |
---|---|
1 | (50.939266, 6.934996) |
2 | (50.935998, 6.931481) |
360 entries
stops:
Name | coordinates |
---|---|
A | (50.93576, 6.96046) |
B | (50.9357, 6.95957) |
2082 entries
desired result:
ID | coordinates | min_distance |
---|---|---|
1 | (50.939266, 6.934996) | 1.8263939732112 |
2 | (50.935998, 6.931481) | 0.3 |
intern: 3 columns with 360 entries each, where the last column is the minimum distance of the coordinate in the second column to each coordinate of the stops Dataframe
I have tried:
intern_index = 0
stops_index = 0
min_distance = 99999
hs.haversine((50.939266, 6.934996), (50.93576, 6.96046), unit=Unit.KILOMETERS)
Out: 1.8263939732112031
for i in intern['coordinates']:
for j in stops['coordinates']:
new_distance = hs.haversine(i, j, unit=Unit.KILOMETERS)
if new_distance < new_distance:
min_distance = new_distance
intern['min_distance'] = min_distance
this yields:
ID | coordinates | min_distance |
---|---|---|
1 | (50.939266, 6.934996) | 22.941973 |
2 | (50.935998, 6.931481) | 22.941973 |
why doesn't it save the correct value? even the first try was smaller than this value and it cannot all be the same distance either
CodePudding user response:
I would suggest implementing what you want in the vectorised form as it's much faster. Numpy is super-efficient in the kind of calculation you need.
Initialising some test data:
df_1 = pd.DataFrame({"ID": [1, 2],
"coordinates": [(1.0, 0.0), (-0.5, 0.0)]})
df_1.set_index("ID", inplace=True)
df_2 = pd.DataFrame({"Name": ["A", "B", "C"],
"coordinates": [(1.0, 0.0), (0.0, 0.0), (-1.0, 0.0)]})
df_2.set_index("Name", inplace=True)
With the help of Vectorizing Haversine distance calculation in Python writing a vectorised version of haversine distance calculation between each pair of the two collections of inputs:
def haversine(df_1, df_2):
np_1 = np.array(df_1['coordinates'].tolist())
np_2 = np.array(df_2['coordinates'].tolist())
np_1 = np.deg2rad(np_1)
np_2 = np.deg2rad(np_2)
diff_lat = np_1[:, 0, None] - np_2[:, 0]
diff_lng = np_1[:, 1, None] - np_2[:, 1]
d = np.sin(diff_lat/2)**2 np.cos(np_1[:, 0, None])*np.cos(np_2[:, 0]) * np.sin(diff_lng/2)**2
return 2 * 6371 * np.arcsin(np.sqrt(d))
Applying:
df_1["min_haversine_dist"] = np.min(haversine(df_1, df_2), axis=1)
df_1.head()
Returns:
ID, coordinates, min_haversine_dist
1 (1.0, 0.0) 0.000000
2 (-0.5, 0.0) 55.597463
Note. My personal rule of thumb: if I'm looping over a dataFrame I'm doing something wrong, if I have a nested loop over records in a dataFrame I'm doing something completely wrong.