Merge 2 data frames with multiple conditions, on ID and column names, automatically based on common-CodePudding

I have 2 large data frames with thousands of columns for each df. I need to left-join the two tables, namely df1 and df2. However, I don't think I'll be able to manually list all of the common columns/keys between the 2 data frames. Below is an example of the data frames:

df1 = pd.DataFrame({'id' : ['a', 'b', 'c'], # Define dictionary
                'test' : [0,0,0],
                'beautiful' : [0,0,0],
                'crazy' : [0,0,0],
                'word' : [0,0,0]})

  id  test  beautiful  crazy  word
0  a     0          0      0     0
1  b     0          0      0     0
2  c     0          0      0     0

df2 = pd.DataFrame({'id' : ['a', 'b', 'c'], # Define dictionary
                'test' : [1,0,0],
                'autumn' : [0,1,0],
                'fall' : [0,0,1],
                'word' : [1,1,0]})

  id  test  autumn  fall  word
0  a     1       0     0     1
1  b     0       1     0     1
2  c     0       0     1     0

df_result = pd.DataFrame({'id':['a','b','c'],
                          'test' : [1,0,0],
                        'beautiful' : [0,0,0],
                        'crazy' : [0,0,0],
                        'word' : [1,1,0]})

  id  test  beautiful  crazy  word
0  a     1          0      0     1
1  b     0          0      0     1
2  c     0          0      0     0

As you can see from the code, I need to join the two data frames based on two conditions. If id matches and column names match, then transfer the value from df2 to df1. I found this post with a similar problem to mine, but it is left unsolved. Thanks in advance!

CodePudding user response：

The DataFrame.update() method should do it.

df1.update(df2)

After running this line, we can test to see if it matches your desired output:

print(df1 == df_result)

Here's the result:

     id  test  beautiful  crazy  word
0  True  True       True   True  True
1  True  True       True   True  True
2  True  True       True   True  True

Additionally, df1 itself (.update works in place):

    id  test    beautiful   crazy   word
0   a   1   0   0   1
1   b   0   0   0   1
2   c   0   0   0   0

CodePudding user response：

df_result = df2.combine_first(df1)
df_result = df_result[['id', 'test', 'beautiful', 'crazy', 'word']]
print(df_result)

Output:

  id  test  beautiful  crazy  word
0  a     1          0      0     1
1  b     0          0      0     1
2  c     0          0      0     0

CodePudding user response：

You just need to set the index to id and use pandas.DataFrame.update No need to worry about other columns in df2 because:

join{‘left’}: default ‘left’
Only left join is implemented, keeping the index and columns of the original object.

df1 = df1.set_index('id')
df1.update(df2.set_index('id'))
df1.reset_index(inplace=True)

Output df1:

  id  test  beautiful  crazy  word
0  a     1          0      0     1
1  b     0          0      0     1
2  c     0          0      0     0