Merging Pandas dataframe on 2 columns in either order without duplication-CodePudding

My question is similar to this one here, but with a couple of critical differences which I'll try to make clear below.

I have two dataframes:

df1 = pd.DataFrame({'id_A':['0001', '0002', '0003', '0004', '0005'],
                      'id_B':['0010', '0020', '0030', '0040', '0050'],
                      'value':['A','B','C','D','E']})

df2 = pd.DataFrame({'id_a':['0020', '0010', '0004', '0003', '0005'],
                      'id_b':['0002', None, '0040', None, '0050'],
                      'value':[1,2,3,4,5]})

>>> df1
   id_A  id_B Value
0  0001  0010     A
1  0002  0020     B
2  0003  0030     C
3  0004  0040     D
4  0005  0050     E

>>> df2
   id_a  id_b  value
0  0020  0002      1
1  0010  None      2
2  0004  0040      3
3  0003  None      4
4  0005  0050      5

Each item (or row) has one or two unique id numbers. These unique id numbers appear in both tables, but one table may be less complete than the other and may only list one of these id numbers for a row when two actually exist. What I want as an output is something like this:

>>> df_final
   id_A  id_B Value value
0  0001  0010     A     2
1  0002  0020     B     1
2  0003  0030     C     4
3  0004  0040     D     3
4  0005  0050     E     5

The final dataframe should have the same number of rows as df_1. Currently I'm at a loss, so any help would be appreciated.

CodePudding user response：

one option is via update before merging:

df2.columns = df1.columns
df1 = df1.rename(columns={'value':'Value'})
df2.update(df1)
df1.merge(df2, on = ['id_A', 'id_B'])
   id_A  id_B Value  value
0  0001  0010     A      1
1  0002  0020     B      2
2  0003  0030     C      3
3  0004  0040     D      4
4  0005  0050     E      5

This is restrictive, as it aligns on indices before merging

CodePudding user response：

Try this:

df1['key'] = df1[l].where(df1[l].isin(df2[l].stack().tolist())).fillna(0).apply(frozenset,axis=1)
df2['key'] = df2[l].fillna(0).apply(frozenset,axis=1)
ndf = pd.merge(df1,df2[['key','Value']],on = 'key',how='left').drop('key',axis=1)

Output:

   id_A  id_B value  Value
0  0001  0010     A      2
1  0002  0020     B      1
2  0003  0030     C      4
3  0004  0040     D      3
4  0005  0050     E      5