Home > database >  merge two dataframes on common cell values of different columns
merge two dataframes on common cell values of different columns

Time:11-21

I have two dataframes

df1 = pd.DataFrame({'col1': [1,2,3], 'col2': [4,5,6]})
df2 = pd.DataFrame({'col3': [1,5,3]})

and would like to left merge df1 to df2. I don't have a fixed merge column in df1 though. I would like to merge on col1 if the cell value of col1 exists in df2.col3 and on col2 if the cell value of col2 exists in df2.col3. So in the above example merge on col1, col2 and then col1. (This is just an example, I actually have more than only two columns). I could do this but I'm not sure if it's ok.

df1 = df1.assign(merge_col = np.where(df1.col1.isin(df2.col3), df1.col1, df1.col2))
df1.merge(df2, left_on='merge_col', right_on='col3', how='left')

Are there any better ways to solve it?

CodePudding user response:

Perform the merges in the preferred order, and use combine_first to combine the merges:

(df1.merge(df2, left_on='col1', right_on='col3', how='left')
    .combine_first(df1.merge(df2, left_on='col2', right_on='col3', how='left')
                  )
)

For a generic method with many columns:

cols = ['col1', 'col2']

from functools import reduce

out = reduce(
  lambda a,b: a.combine_first(b),
  [df1.merge(df2, left_on=col, right_on='col3', how='left')
   for col in cols]
)

Output:

   col1  col2  col3
0     1     4   1.0
1     2     5   5.0
2     3     6   3.0

Better example: Adding another column to df2 to illustrate the merge:

df2 = pd.DataFrame({'col3': [1,5,3], 'new': ['A', 'B', 'C']})

Output:

   col1  col2  col3 new
0     1     4   1.0   A
1     2     5   5.0   B
2     3     6   3.0   C

CodePudding user response:

I think your solution is possible modify with get merged Series with compare all columns from list and then merge with this Series:

Explanation of s: Compare all columns by DataFrame.isin, create missing values if no match by DataFrame.where and for priority marge back filling missing values with select first column by position:

cols = ['col1', 'col2']

s = df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0]
print (s)
0    1.0
1    5.0
2    3.0
Name: col1, dtype: float64

df = df1.merge(df2, left_on=s, right_on='col3', how='left')
print (df)
   col1  col2  col3
0     1     4     1
1     2     5     5
2     3     6     3

Your solution with helper column:

cols = ['col1', 'col2']

df1 = (df1.assign(merge_col = = df1[cols].where(df1[cols].isin(df2.col3))
                                         .bfill(axis=1).iloc[:, 0]))
df = df1.merge(df2, left_on='merge_col', right_on='col3', how='left')

print (df)
   col1  col2  merge_col  col3
0     1     4        1.0     1
1     2     5        5.0     5
2     3     6        3.0     3

Explanation of s: Compare all columns by DataFrame.isin, create missing values if no match by DataFrame.where and for priority marge back filling missing values with select first column by position:

print (df1[cols].isin(df2.col3))
    col1   col2
0   True  False
1  False   True
2   True  False

print (df1[cols].where(df1[cols].isin(df2.col3)))
   col1  col2
0   1.0   NaN
1   NaN   5.0
2   3.0   NaN

print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1))
   col1  col2
0   1.0   NaN
1   5.0   5.0
2   3.0   NaN

print (df1[cols].where(df1[cols].isin(df2.col3)).bfill(axis=1).iloc[:, 0])
0    1.0
1    5.0
2    3.0
Name: col1, dtype: float64
  • Related