Home > Enterprise >  Add rows from one dataframe to another based on distances between their geographic coordinates
Add rows from one dataframe to another based on distances between their geographic coordinates

Time:05-25

I have a similar problem as here How to get the distance between two geographic coordinates of two different dataframes? Two dataframes:

df1 = pd.DataFrame({'id': [1,2,3],                   
                          'lat':[-23.48, -22.94, -23.22],
                          'long':[-46.36, -45.40, -45.80]})

df2 = pd.DataFrame({'id': [100,200,300],                   
                           'lat':[-28.48, -22.94, -23.22],
                           'long':[-46.36, -46.40, -45.80]})

My question is: using the solution suggested by Ben.T there, how could I add rows from df2 to df1, if a point from df2 is not near df? I think, based on that matrix with distances:

from sklearn.metrics.pairwise import haversine_distances

# variable in meter you can change
threshold = 100 # meters

# another parameter
earth_radius = 6371000  # meters

distance_matrix = (
    # get the distance between all points of each DF
    haversine_distances(
        # note that you need to convert to radiant with *np.pi/180
        X=df1[['lat','long']].to_numpy()*np.pi/180, 
        Y=df2[['lat','long']].to_numpy()*np.pi/180)
    # get the distance in meter
    *earth_radius
    # compare to your threshold
    < threshold
    # **here I want to add rows from df2 to df1 if point from df2 is NOT near df1**
    )

E.g. the output looks like this:

Output:

   id   lat       long  
    1   -23.48  -46.36    
    2   -22.94  -45.40    
    3   -23.22  -45.80    
    4   -28.48  -46.36
    5   -22.94  -46.40

CodePudding user response:

The distance matrix gives you a (len(df1), len(df2)) boolean array, with True indicating they are "close". You can find whether any points in df1 are close enough to each element in df2 by summarizing the matrix with any across axis 0:

In [33]: df2_has_close_point_in_df1 = distance_matrix.any(axis=0)

In [34]: df2_has_close_point_in_df1
Out[34]: array([False, False,  True])

You can then use this as a mask to filter df2. Use the bitwise negation operator ~ to reverse the True/False values (to get only the rows in df2 which are not close:

In [35]: df2.iloc[~df2_has_close_point_in_df1]
Out[35]:
    id    lat   long
0  100 -28.48 -46.36
1  200 -22.94 -46.40

This can now be joined with df1 to get a combined dataset:

In [36]: combined = pd.concat([df1, df2.iloc[~df2_has_close_point_in_df1]], axis=0)

In [37]: combined
Out[37]:
    id    lat   long
0    1 -23.48 -46.36
1    2 -22.94 -45.40
2    3 -23.22 -45.80
0  100 -28.48 -46.36
1  200 -22.94 -46.40
  • Related