I have two dataframes
df1 = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]})
df2 = pd.DataFrame({'col3': [1,5,3]})
and would like to left merge df1
to df2
. I don't have a fixed merge column in df1
though. I would like to merge on col1
if the cell value of col1
exists in df2.col3
and on col2
if the cell value of col2
exists in df2.col3
. So in the above example merge on col1
, col2
and then col1
. (This is just an example, I actually have more than only two columns).
I could do this but I'm not sure if it's ok.
df1 = df1.assign(merge_col = np.where(df1.col1.isin(df2.col3), df1.col1, df1.col2))
df1.merge(df2, left_on='merge_col', right_on='col3', how='left')
Are there any better ways to solve it?
CodePudding user response:
Perform the merges in the preferred order, and use combine_first
to combine the merges:
(df1.merge(df2, left_on='col1', right_on='col3', how='left')
.combine_first(df1.merge(df2, left_on='col2', right_on='col3', how='left')
)
)
For a generic method with many columns:
cols = ['col1', 'col2']
from functools import reduce
out = reduce(
lambda a,b: a.combine_first(b),
[df1.merge(df2, left_on=col, right_on='col3', how='left')
for col in cols]
)
Output:
col1 col2 col3
0 1 4 1.0
1 2 5 5.0
2 3 6 3.0
Better example:
Adding another column to df2
to illustrate the merge:
df2 = pd.DataFrame({'col3': [1,5,3], 'new': ['A', 'B', 'C']})
Output:
col1 col2 col3 new
0 1 4 1.0 A
1 2 5 5.0 B
2 3 6 3.0 C
CodePudding user response:
I think your solution is possible modify with get merged Series
with compare all columns from list and then merge with this Series
:
Explanation of s
: Compare all columns by DataFrame.isin
, create missing values if no match by DataFrame.where
and for priority marge back filling missing values with select first column by position:
cols = ['col1', 'col2']
s = df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0]
print (s)
0 1.0
1 5.0
2 3.0
Name: col1, dtype: float64
df = df1.merge(df2, left_on=s, right_on='col3', how='left')
print (df)
col1 col2 col3
0 1 4 1
1 2 5 5
2 3 6 3
Your solution with helper column:
cols = ['col1', 'col2']
df1 = (df1.assign(merge_col = = df1[cols].where(df1[cols].isin(df2.col3))
.bfill(axis=1).iloc[:, 0]))
df = df1.merge(df2, left_on='merge_col', right_on='col3', how='left')
print (df)
col1 col2 merge_col col3
0 1 4 1.0 1
1 2 5 5.0 5
2 3 6 3.0 3
Explanation of s
: Compare all columns by DataFrame.isin
, create missing values if no match by DataFrame.where
and for priority marge back filling missing values with select first column by position:
print (df1[cols].isin(df2.col3))
col1 col2
0 True False
1 False True
2 True False
print (df1[cols].where(df1[cols].isin(df2.col3)))
col1 col2
0 1.0 NaN
1 NaN 5.0
2 3.0 NaN
print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1))
col1 col2
0 1.0 NaN
1 5.0 5.0
2 3.0 NaN
print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0])
0 1.0
1 5.0
2 3.0
Name: col1, dtype: float64