Home > Software engineering >  Use an alternative key if key missing in pandas merge
Use an alternative key if key missing in pandas merge

Time:09-09

I would like to merge two dataframes df1 and df2, with the condition that if a value of the right-side in key2 is not present on the left-side in key1, than the alternative_key is used. Is there any way to do that in a nice way?

a = {'key1': ['a','b','c'], 'alternative_key':['f','g','h']}
df1 = pd.DataFrame(data=a)


b = {'key2':['a','b','h'], 'some_stuff': [1,2,3]}
df2 = pd.DataFrame(data=b)

df_final=df1.merge(df2, left_on='key1', right_on='key2', how='left')

expected result in df_final:

  key1 alternative_key  some_stuff
0    a               f           1
1    b               g           2
2    c               h           3

CodePudding user response:

Perform both merges and combine_first in the desired order of priority:

(df1.merge(df2, left_on='key1', right_on='key2', how='left')
    .combine_first(df1.merge(df2, left_on='alternative_key',
                             right_on='key2', how='left'))
    .drop(columns='key2')
)

Alternative ways of writing the code (using variables):

merge1 = df1.merge(df2, left_on='key1', right_on='key2', how='left')
merge2 = df1.merge(df2, left_on='alternative_key', right_on='key2', how='left')

out = merge1.combine_first(merge2).drop(columns='key2')

output:

  key1 alternative_key  some_stuff
0    a               f         1.0
1    b               g         2.0
2    c               h         3.0

NB. the advantage of this method is that you can have as many other columns as you want. All will be merged in order.

arbitrary number of alternative keys:

from functools import reduce

keys = ['key1', 'alternative_key']

out = (reduce(lambda a,b: a.combine_first(b),
              [df1.merge(df2, left_on=k, right_on='key2', how='left')
               for k in keys])
       .drop(columns='key2')
      )

CodePudding user response:

What if you base your merge on your key columns which would give nan on the non matches, then set your index correctly to use fillna - which happens along the same column names and same indices:

df_final = df1.merge(df2, left_on='key1', right_on='key2', how='left')\
    .set_index(['alternative_key']).fillna(df2.set_index(['key2'])).drop(['key2'],axis=1).reset_index()

prints:

  alternative_key key1  some_stuff
0               f    a         1.0
1               g    b         2.0
2               h    c         3.0
  • Related