Here is the situation: I have two pandas data frames:
TABLE 1:
name | alias | col3 |
---|---|---|
str | str | str |
TABLE 2:
name_or_alias | col2 | col3 |
---|---|---|
str | str | str |
table1.name and table1.alias all contain unique values. Meaning, there are no duplicates between either of the two columns. I need to do a left join on table2, but the problem is that the column to join on may be either table1.name OR table1.alias.
So, if I do:
table2.merge(table2, how=left, on=name),
I will only get some of the matches. If I do:
table2.merge(table2, how=left, on=alias),
I will also only get some of the matches. What I tried to do is concat the two merges and remove the duplicates
pd.concat([df1.merge(df2, how='left', left_on='name', right_on='name_or_alias'), df1.merge(df2, how='left', left_on='alias', right_on='name_or_alias')], axis=0).pipe(lambda x: x[x.index.duplicated()])
but this doesn't remove the duplicates correctly because if the match was in one of the matches but not the other, then it wont be duplicated (since the rows will be null for one of the merges and not the other).
I need to figure out how to remove the rows if the match was found in the other table. Any ideas?
CodePudding user response:
You can melt
, merge
, and drop_duplicates
:
(df1
.reset_index()
.melt(['index', 'col3'], value_name='name_or_alias')
.merge(df2, on='name_or_alias', suffixes=(None, '_2'), how='left')
.drop_duplicates('index')
.set_index('index')
)
NB. To keep the original DataFrame, join
the output to it.
Output:
col3 variable name_or_alias col2 col3_2
index
0 0 name A 5.0 8.0
1 1 name B 3.0 6.0
2 2 name C NaN NaN
Used input:
df1 = pd.DataFrame({'name': ['A', 'B', 'C'], 'alias': ['D', 'E', 'F'], 'col3': [0, 1, 2]})
df2 = pd.DataFrame({'name_or_alias': ['B', 'D', 'A'], 'col2': [3, 4, 5], 'col3': [6, 7, 8]})
CodePudding user response:
I think the problem here is that when you concat the 2 merged dataframes, it will generate duplicated records.
You can use combine_first instead of concat
to combine 2 merged dataframes
It will fill null values in first dataframe with non-null values from second dataframe
df3=df1.merge(df2, how='left', left_on='name', right_on='name_or_alias')
df4=df1.merge(df2, how='left', left_on='alias', right_on='name_or_alias')
df3.combine_first(df4)