I am trying to merge two dataframes if the right_on
matches with any of the left_on
columns.
Dataframe examples:
df_test_1 = pd.DataFrame({'A':['S1','S2','S3','S4','S5'],
'B':['S6','','','S7',''],
'C':[10,11,12,13,14]})
df_test_2 = pd.DataFrame({'D':['S1','S2','S3','S5','S7','S9'],
'E':[1,2,3,4,5,6]})
Merging this way through or
gives this output:
pd.merge(df_test_1,df_test_2,left_on=['A' or 'B'],right_on=['D'],how='left')
A B C D E
0 S1 S6 10 S1 1.0
1 S2 11 S2 2.0
2 S3 12 S3 3.0
3 S4 S7 13 NaN NaN
4 S5 14 S5 4.0
Merging this way through and
gives this output:
pd.merge(df_test_1,df_test_2,left_on=['A' and 'B'],right_on=['D'],how='left')
A B C D E
0 S1 S6 10 NaN NaN
1 S2 11 NaN NaN
2 S3 12 NaN NaN
3 S4 S7 13 S7 5.0
4 S5 14 NaN NaN
From the help guide this is how left_on
:
left_on : label or list, or array-like
Column or index level names to join on in the left DataFrame. Can also
be an array or list of arrays of the length of the left DataFrame.
These arrays are treated as if they are columns.
I can still imagine how ['A' or 'B'] is behaving but no clue why ['A' and 'B'] is returning 5, when it could have returned 2 as well!
How is left_on=['A' or 'B'] / ['A' and 'B']
behaving here? Array, list of arrays? I can't figure it out how is pandas reading that line. Any idea?
Edit: I slightly edited the df_test_1 to show the expected output.
pd.merge(df_test_1,df_test_2,left_on=['A' or 'B'],right_on=['D'],how='left')
A B C D E
0 S1 S6 10 S1 1.0
1 S2 11 S2 2.0
2 S3 12 S3 3.0
3 S4 S7 13 NaN 5.0
4 S5 14 S5 4.0
Since S7 has the value '5' in df_test_2
CodePudding user response:
What you tried didn't work because python first interpreted A' or 'B'
as 'A'
and 'A' and 'B'
as 'B'. Thus only passing one value to merge
.
Even if you had used a list of columns, this would have been interpreted by merge
as multiple pairs of columns to use as keys for the merge, and you would need the same number of columns in left_on
/right_on
to form the pairs.
workaround
Assuming you want to let-merge on A, with a failback on B in case there is no match for A, you have to perform multiple merges, then combine them.
Here is a code for a generic case of an arbitrary number of columns with help of functools.reduce
:
from functools import reduce
left_cols = ['A', 'B']
out = reduce(lambda a, b: a.combine_first(b),
[pd.merge(df_test_1, df_test_2, left_on=col, right_on='D', how='left')
for col in left_cols])
Output (note that I changed the first row of df_test_1
for the demo):
A B C D E
0 S6 S9 10 S9 6.0 # no match for A; B was used
1 S2 S7 11 S2 2.0
2 S3 12 S3 3.0
3 S4 S8 13 NaN NaN
4 S5 14 S5 4.0
Assuming you want to combine inner merges with multiple reference columns, you could use a list comprehension and concat
:
cols = ['A', 'B']
out = pd.concat([pd.merge(df_test_1, df_test_2, left_on=col, right_on='D', how='inner')
for col in cols])
Output:
A B C D E
0 S2 S7 11 S2 2 # merged on A
1 S3 12 S3 3 #
2 S5 14 S5 4 #
0 S6 S9 10 S9 6 # merged on B
1 S2 S7 11 S7 5 #