I have two DataFrames. One contains several power plants and their respective location by longitude and latitude, each in one column. Another dataframe contains several substations, also with long and lat. What I am trying to do is to assign the power plants to the closest substations .
df1 = pd.DataFrame{'ID_pp':['p1','p2','p3','p4'],'x':[12.644881,11.563269, 12.644881, 8.153184], 'y':[48.099206, 48.020081, 48.099206, 49.153766]}
df2 = pd.DataFrame{'ID_ss':['s1','s2','s3','s4'],'x':[9.269, 9.390, 9.317, 10.061], 'y':[55.037, 54.940, 54.716, 54.349]}
I found this solution, which is basically exactly what I need:
import pandas as pd
import geopy.distance
for i,row in df1.iterrows(): # A
a = row.x, row.y
distances = []
for j,row2 in df2.iterrows(): # B
b = row2.x, row2.y
distances.append(geopy.distance.geodesic(a, b).km)
min_distance = min(distances)
min_index = distances.index(min_distance)
df1['assigned_to'] = min_index
it works, but it is incredibly slow and my dataset is quite large. I think an approach using .apply would be much quicker, but I cant really think of a way to use it. Does anybody have an Idea how to rewrite the above function into an .apply approach that doesnt need iteration?
CodePudding user response:
A solution 8x faster (inspired by this topic) :
def closest_node(node, nodes):
nodes = np.asarray(nodes)
deltas = nodes - node
dist_2 = np.einsum('ij,ij->i', deltas, deltas)
return np.argmin(dist_2)
df2_array = df2[["x","y"]].values
df1['assigned_to'] = df1.apply(lambda x: closest_node2(np.array([x.x, x.y]),df2_array) ,axis=1)
Speed comparison :
before:
%%timeit
for i,row in df1.iterrows(): # A
a = row.x, row.y
distances = []
for j,row2 in df2.iterrows(): # B
b = row2.x, row2.y
distances.append(geopy.distance.geodesic(a, b).km)
min_distance = min(distances)
min_index = distances.index(min_distance)
df1['assigned_to'] = min_index
gives 8.75 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
after :
%%timeit
df2_array = df2[["x","y"]].values
df1['assigned_to'] = df1.apply(lambda x: closest_node2(np.array([x.x, x.y]),df2_array) ,axis=1)
gives 1.7 ms ± 31.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)