I have two pandas data frames. They rows need to be matched by two values (X,Y) coordinates which are not exact matches but within a tolerance.. I need to add the 'VAL' term from df2 to the 'VAL' term from df1 where (X,Y) match.
I've attempted using merge_asof without luck. I think if I could get the merge_asof to work the way I'm expecting I can do the addition and re-assignment to df2 but so far I'm unable to get the matching to work out.
Here is an example of the code I have attempted.
d1 = [ ['wp1',0.0,0.0,10],['wp2',10.0,10.0,5],['wp3',7.0,7.0,5]]
d2 = [ ['wp4',0.1,0.1,5],['wp5',10.2,10.2,8]]
headers = ['Name','X','Y','VAL']
df1 = pd.DataFrame(d1,columns = headers)
df2 = pd.DataFrame(d2,columns = headers)
#attempt at solution just to find matching values
df3 = pd.merge_asof(df1.sort_values('VAL'), df2.sort_values('VAL'),on=['VAL'],by=['X','Y'],direction='nearest',tolerance=1)
#desired solution is to sum 'VAL' from df2 into df1
d1_final = [ ['wp1',0.0,0.0,10 5],['wp2',10.0,10.0,5 8]]
df1_final = pd.DataFrame(d1_final,columns = headers)
CodePudding user response:
As stated, your matching within tolerance is required on X,Y
. With merge_asof()
the tolerance matching happens with the on=
parameter. (the by=
parameter provides exact matching that happens prior to the on=
tolerance matching - think grouby
). BUT on=
can only tolerance match on a single key. So you cannot do on=['X','Y']
to get you to where you want to go. merge_asof
won't work for a tolerance match on multiple keys.
Quick thought. Wish I had more time right now. Just tolerance match on X
. dropna()
. Then do a secondary tolerance filtering on the Y
values. That info will be part of the merge. If the Y
's are less than the tolerance - then just add the VAL
s for your final answer.
df4 = pd.merge_asof(df1.sort_values(['X','Y']), df2.sort_values(['X','Y']), on='X', direction='nearest', tolerance=1)
df4
Name_x X Y_x VAL_x Name_y Y_y VAL_y
0 wp1 0.0 0.0 10 wp4 0.1 5.0
1 wp3 7.0 7.0 5 NaN NaN NaN
2 wp2 10.0 10.0 5 wp5 10.2 8.0
Then do the process outlined above:
df4.assign(VAL=np.where((df4['Y_x']-df4['Y_y']).abs()<1, df4['VAL_x'] df4['VAL_y'], np.nan) ) \
.rename(columns={'Name_x':'Name','Y_x':'Y'}).dropna()[['Name','X','Y','VAL']]
Name X Y VAL
0 wp1 0.0 0.0 15.0
2 wp2 10.0 10.0 13.0
CodePudding user response:
jch solution works to setup a new df with the Names and Vals required to update df1.
as follows
import pandas as pd
import numpy as np
d1 = [ ['wp1',0.0,0.0,10],['wp2',10.0,10.0,5],['wp3',7.0,7.0,5]]
d2 = [ ['wp4',0.1,0.1,5],['wp5',10.2,10.2,8]]
headers = ['Name','X','Y','VAL']
df1 = pd.DataFrame(d1,columns = headers)
df2 = pd.DataFrame(d2,columns = headers)
df4 = pd.merge_asof(df1.sort_values(['X','Y']), df2.sort_values(['X','Y']), on='X', direction='nearest', tolerance=1)
print(df4.head())
df5 = df4.assign(VAL=np.where((df4['Y_x']-df4['Y_y']).abs()<1, df4['VAL_x'] df4['VAL_y'], np.nan)).rename(columns={'Name_x':'Name','Y_x':'Y'}).dropna()[['Name','X','Y','VAL']]
print(df5.head())
print('df1')
print(df1.head())
Now I am trying (unsuccsessfully) to update the original df1 and keep the original values where there is not a match.
d = df5.set_index('Name')['VAL'].to_dict()
v = df5.filter(like='VAL')
df1[v.columns] = v.replace(d)
print(df1.head())
##Edit I now know that this mapping scheme is not a good approach, the values below arent correctly correspondign to the original df.
Name X Y VAL
0 wp1 0.0 0.0 15.0
1 wp2 10.0 10.0 NaN
2 wp3 7.0 7.0 13.0
I'd like df1 to be updated like this but insted of NaN for wp2 , it should just show the original VAL of wp2 in df1.