rewrite iteration into .apply function with several inputs pandas-CodePudding

I have two DataFrames. One contains several power plants and their respective location by longitude and latitude, each in one column. Another dataframe contains several substations, also with long and lat. What I am trying to do is to assign the power plants to the closest substations .

df1 = pd.DataFrame{'ID_pp':['p1','p2','p3','p4'],'x':[12.644881,11.563269, 12.644881,  8.153184], 'y':[48.099206, 48.020081, 48.099206, 49.153766]}
df2 = pd.DataFrame{'ID_ss':['s1','s2','s3','s4'],'x':[9.269, 9.390, 9.317, 10.061], 'y':[55.037, 54.940, 54.716, 54.349]}

I found this solution, which is basically exactly what I need:

import pandas as pd
import geopy.distance



for i,row in df1.iterrows(): # A
    a = row.x, row.y
    distances = []
    for j,row2 in df2.iterrows(): # B
        b = row2.x, row2.y
        distances.append(geopy.distance.geodesic(a, b).km)

    min_distance = min(distances)
    min_index = distances.index(min_distance)

 
    df1['assigned_to'] =  min_index

it works, but it is incredibly slow and my dataset is quite large. I think an approach using .apply would be much quicker, but I cant really think of a way to use it. Does anybody have an Idea how to rewrite the above function into an .apply approach that doesnt need iteration?

CodePudding user response：

A solution 8x faster (inspired by this topic) :

def closest_node(node, nodes):
    nodes = np.asarray(nodes)
    deltas = nodes - node
    dist_2 = np.einsum('ij,ij->i', deltas, deltas)
    return np.argmin(dist_2)

df2_array = df2[["x","y"]].values

df1['assigned_to'] = df1.apply(lambda x: closest_node2(np.array([x.x, x.y]),df2_array) ,axis=1)

Speed comparison :

before:

%%timeit
for i,row in df1.iterrows(): # A
    a = row.x, row.y
    distances = []
    for j,row2 in df2.iterrows(): # B
        b = row2.x, row2.y
        distances.append(geopy.distance.geodesic(a, b).km)

    min_distance = min(distances)
    min_index = distances.index(min_distance)

 
    df1['assigned_to'] =  min_index

gives 8.75 ms ± 108 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

after :

%%timeit
df2_array = df2[["x","y"]].values

df1['assigned_to'] = df1.apply(lambda x: closest_node2(np.array([x.x, x.y]),df2_array) ,axis=1)

gives 1.7 ms ± 31.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)