I would like to merge two dataframes df1 and df2, with the condition that if a value of the right-side in key2 is not present on the left-side in key1, than the alternative_key is used. Is there any way to do that in a nice way?
a = {'key1': ['a','b','c'], 'alternative_key':['f','g','h']}
df1 = pd.DataFrame(data=a)
b = {'key2':['a','b','h'], 'some_stuff': [1,2,3]}
df2 = pd.DataFrame(data=b)
df_final=df1.merge(df2, left_on='key1', right_on='key2', how='left')
expected result in df_final:
key1 alternative_key some_stuff
0 a f 1
1 b g 2
2 c h 3
CodePudding user response:
Perform both merges and combine_first
in the desired order of priority:
(df1.merge(df2, left_on='key1', right_on='key2', how='left')
.combine_first(df1.merge(df2, left_on='alternative_key',
right_on='key2', how='left'))
.drop(columns='key2')
)
Alternative ways of writing the code (using variables):
merge1 = df1.merge(df2, left_on='key1', right_on='key2', how='left')
merge2 = df1.merge(df2, left_on='alternative_key', right_on='key2', how='left')
out = merge1.combine_first(merge2).drop(columns='key2')
output:
key1 alternative_key some_stuff
0 a f 1.0
1 b g 2.0
2 c h 3.0
NB. the advantage of this method is that you can have as many other columns as you want. All will be merged in order.
arbitrary number of alternative keys:
from functools import reduce
keys = ['key1', 'alternative_key']
out = (reduce(lambda a,b: a.combine_first(b),
[df1.merge(df2, left_on=k, right_on='key2', how='left')
for k in keys])
.drop(columns='key2')
)
CodePudding user response:
What if you base your merge on your key
columns which would give nan
on the non matches, then set your index correctly to use fillna
- which happens along the same column names and same indices:
df_final = df1.merge(df2, left_on='key1', right_on='key2', how='left')\
.set_index(['alternative_key']).fillna(df2.set_index(['key2'])).drop(['key2'],axis=1).reset_index()
prints:
alternative_key key1 some_stuff
0 f a 1.0
1 g b 2.0
2 h c 3.0