Home > Net >  How is the Merge function reading the boolean (or/and) here?
How is the Merge function reading the boolean (or/and) here?

Time:01-14

I am trying to merge two dataframes if the right_on matches with any of the left_on columns.

Dataframe examples:

df_test_1 = pd.DataFrame({'A':['S1','S2','S3','S4','S5'],
                          'B':['S6','','','S7',''],
                          'C':[10,11,12,13,14]})
df_test_2 = pd.DataFrame({'D':['S1','S2','S3','S5','S7','S9'],
                          'E':[1,2,3,4,5,6]})

Merging this way through or gives this output:

pd.merge(df_test_1,df_test_2,left_on=['A' or 'B'],right_on=['D'],how='left')

    A   B   C   D   E
0   S1  S6  10  S1  1.0
1   S2      11  S2  2.0
2   S3      12  S3  3.0
3   S4  S7  13  NaN NaN
4   S5      14  S5  4.0

Merging this way through and gives this output:

pd.merge(df_test_1,df_test_2,left_on=['A' and 'B'],right_on=['D'],how='left')

    A   B   C   D   E
0   S1  S6  10  NaN NaN
1   S2      11  NaN NaN
2   S3      12  NaN NaN
3   S4  S7  13  S7  5.0
4   S5      14  NaN NaN

From the help guide this is how left_on:

left_on : label or list, or array-like
    Column or index level names to join on in the left DataFrame. Can also
be an array or list of arrays of the length of the left DataFrame.
These arrays are treated as if they are columns.

I can still imagine how ['A' or 'B'] is behaving but no clue why ['A' and 'B'] is returning 5, when it could have returned 2 as well! How is left_on=['A' or 'B'] / ['A' and 'B'] behaving here? Array, list of arrays? I can't figure it out how is pandas reading that line. Any idea?

Edit: I slightly edited the df_test_1 to show the expected output.

pd.merge(df_test_1,df_test_2,left_on=['A' or 'B'],right_on=['D'],how='left')

    A   B   C   D   E
0   S1  S6  10  S1  1.0
1   S2      11  S2  2.0
2   S3      12  S3  3.0
3   S4  S7  13  NaN 5.0
4   S5      14  S5  4.0

Since S7 has the value '5' in df_test_2

CodePudding user response:

What you tried didn't work because python first interpreted A' or 'B' as 'A' and 'A' and 'B' as 'B'. Thus only passing one value to merge.

Even if you had used a list of columns, this would have been interpreted by merge as multiple pairs of columns to use as keys for the merge, and you would need the same number of columns in left_on/right_on to form the pairs.

workaround

Assuming you want to let-merge on A, with a failback on B in case there is no match for A, you have to perform multiple merges, then combine them.

Here is a code for a generic case of an arbitrary number of columns with help of functools.reduce:

from functools import reduce

left_cols = ['A', 'B']

out = reduce(lambda a, b: a.combine_first(b),
             [pd.merge(df_test_1, df_test_2, left_on=col, right_on='D', how='left')
              for col in left_cols]) 

Output (note that I changed the first row of df_test_1 for the demo):

    A   B   C    D    E
0  S6  S9  10   S9  6.0  # no match for A; B was used
1  S2  S7  11   S2  2.0
2  S3      12   S3  3.0
3  S4  S8  13  NaN  NaN
4  S5      14   S5  4.0

Assuming you want to combine inner merges with multiple reference columns, you could use a list comprehension and concat:

cols = ['A', 'B']
out = pd.concat([pd.merge(df_test_1, df_test_2, left_on=col, right_on='D', how='inner')
                 for col in cols])

Output:

    A   B   C   D  E
0  S2  S7  11  S2  2 # merged on A
1  S3      12  S3  3 #
2  S5      14  S5  4 #
0  S6  S9  10  S9  6   # merged on B
1  S2  S7  11  S7  5   #
  • Related