merge_asof pandas python (multiple rows)-CodePudding

I have two pandas data frames. They rows need to be matched by two values (X,Y) coordinates which are not exact matches but within a tolerance.. I need to add the 'VAL' term from df2 to the 'VAL' term from df1 where (X,Y) match.

I've attempted using merge_asof without luck. I think if I could get the merge_asof to work the way I'm expecting I can do the addition and re-assignment to df2 but so far I'm unable to get the matching to work out.

Here is an example of the code I have attempted.

d1 = [ ['wp1',0.0,0.0,10],['wp2',10.0,10.0,5],['wp3',7.0,7.0,5]]
d2 = [ ['wp4',0.1,0.1,5],['wp5',10.2,10.2,8]]
headers = ['Name','X','Y','VAL']
df1 = pd.DataFrame(d1,columns = headers)
df2 = pd.DataFrame(d2,columns = headers)


#attempt at solution just to find matching values
df3 = pd.merge_asof(df1.sort_values('VAL'), df2.sort_values('VAL'),on=['VAL'],by=['X','Y'],direction='nearest',tolerance=1)


#desired solution is to sum 'VAL' from df2 into df1 
d1_final = [ ['wp1',0.0,0.0,10 5],['wp2',10.0,10.0,5 8]]
df1_final = pd.DataFrame(d1_final,columns = headers)

CodePudding user response：

As stated, your matching within tolerance is required on X,Y. With merge_asof() the tolerance matching happens with the on= parameter. (the by= parameter provides exact matching that happens prior to the on= tolerance matching - think grouby). BUT on= can only tolerance match on a single key. So you cannot do on=['X','Y'] to get you to where you want to go. merge_asof won't work for a tolerance match on multiple keys.

Quick thought. Wish I had more time right now. Just tolerance match on X. dropna(). Then do a secondary tolerance filtering on the Y values. That info will be part of the merge. If the Y's are less than the tolerance - then just add the VALs for your final answer.

df4 = pd.merge_asof(df1.sort_values(['X','Y']), df2.sort_values(['X','Y']), on='X', direction='nearest', tolerance=1)  
df4

  Name_x     X   Y_x  VAL_x Name_y   Y_y  VAL_y
0    wp1   0.0   0.0     10    wp4   0.1    5.0
1    wp3   7.0   7.0      5    NaN   NaN    NaN
2    wp2  10.0  10.0      5    wp5  10.2    8.0

Then do the process outlined above:

df4.assign(VAL=np.where((df4['Y_x']-df4['Y_y']).abs()<1, df4['VAL_x'] df4['VAL_y'], np.nan) ) \
    .rename(columns={'Name_x':'Name','Y_x':'Y'}).dropna()[['Name','X','Y','VAL']] 

  Name     X     Y   VAL
0  wp1   0.0   0.0  15.0
2  wp2  10.0  10.0  13.0

CodePudding user response：

jch solution works to setup a new df with the Names and Vals required to update df1.

as follows

import pandas as pd
import numpy as np

d1 = [ ['wp1',0.0,0.0,10],['wp2',10.0,10.0,5],['wp3',7.0,7.0,5]]
d2 = [ ['wp4',0.1,0.1,5],['wp5',10.2,10.2,8]]
headers = ['Name','X','Y','VAL']
df1 = pd.DataFrame(d1,columns = headers)
df2 = pd.DataFrame(d2,columns = headers)
df4 = pd.merge_asof(df1.sort_values(['X','Y']), df2.sort_values(['X','Y']), on='X', direction='nearest', tolerance=1)
print(df4.head())

df5 = df4.assign(VAL=np.where((df4['Y_x']-df4['Y_y']).abs()<1, df4['VAL_x'] df4['VAL_y'], np.nan)).rename(columns={'Name_x':'Name','Y_x':'Y'}).dropna()[['Name','X','Y','VAL']] 
print(df5.head())
print('df1')
print(df1.head())

Now I am trying (unsuccsessfully) to update the original df1 and keep the original values where there is not a match.

d = df5.set_index('Name')['VAL'].to_dict()
v = df5.filter(like='VAL')
df1[v.columns] = v.replace(d)
print(df1.head())

##Edit I now know that this mapping scheme is not a good approach, the values below arent correctly correspondign to the original df.

  Name     X     Y   VAL
0  wp1   0.0   0.0  15.0
1  wp2  10.0  10.0   NaN
2  wp3   7.0   7.0  13.0

I'd like df1 to be updated like this but insted of NaN for wp2 , it should just show the original VAL of wp2 in df1.