How to do a pandas join if the match might appear in either of two columns?-CodePudding

Here is the situation: I have two pandas data frames:

TABLE 1:

name	alias	col3
str	str	str

TABLE 2:

name_or_alias	col2	col3
str	str	str

table1.name and table1.alias all contain unique values. Meaning, there are no duplicates between either of the two columns. I need to do a left join on table2, but the problem is that the column to join on may be either table1.name OR table1.alias.

So, if I do:

table2.merge(table2, how=left, on=name),

I will only get some of the matches. If I do:

table2.merge(table2, how=left, on=alias),

I will also only get some of the matches. What I tried to do is concat the two merges and remove the duplicates

pd.concat([df1.merge(df2, how='left', left_on='name', right_on='name_or_alias'), df1.merge(df2, how='left', left_on='alias', right_on='name_or_alias')], axis=0).pipe(lambda x: x[x.index.duplicated()])

but this doesn't remove the duplicates correctly because if the match was in one of the matches but not the other, then it wont be duplicated (since the rows will be null for one of the merges and not the other).

I need to figure out how to remove the rows if the match was found in the other table. Any ideas?

CodePudding user response：

You can melt, merge, and drop_duplicates:

(df1
 .reset_index()
 .melt(['index', 'col3'], value_name='name_or_alias')
 .merge(df2, on='name_or_alias', suffixes=(None, '_2'), how='left')
 .drop_duplicates('index')
 .set_index('index')
 )

NB. To keep the original DataFrame, join the output to it.

Output:

       col3 variable name_or_alias  col2  col3_2
index                                           
0         0     name             A   5.0     8.0
1         1     name             B   3.0     6.0
2         2     name             C   NaN     NaN

Used input:

df1 = pd.DataFrame({'name': ['A', 'B', 'C'], 'alias': ['D', 'E', 'F'], 'col3': [0, 1, 2]})
df2 = pd.DataFrame({'name_or_alias': ['B', 'D', 'A'], 'col2': [3, 4, 5], 'col3': [6, 7, 8]})

CodePudding user response：

I think the problem here is that when you concat the 2 merged dataframes, it will generate duplicated records.

You can use combine_first instead of concat to combine 2 merged dataframes

It will fill null values in first dataframe with non-null values from second dataframe

df3=df1.merge(df2, how='left', left_on='name', right_on='name_or_alias')
df4=df1.merge(df2, how='left', left_on='alias', right_on='name_or_alias')
df3.combine_first(df4)