Home > Software design >  Merge 2 data frames with multiple conditions, on ID and column names, automatically based on common
Merge 2 data frames with multiple conditions, on ID and column names, automatically based on common

Time:07-29

I have 2 large data frames with thousands of columns for each df. I need to left-join the two tables, namely df1 and df2. However, I don't think I'll be able to manually list all of the common columns/keys between the 2 data frames. Below is an example of the data frames:

df1 = pd.DataFrame({'id' : ['a', 'b', 'c'], # Define dictionary
                'test' : [0,0,0],
                'beautiful' : [0,0,0],
                'crazy' : [0,0,0],
                'word' : [0,0,0]})

  id  test  beautiful  crazy  word
0  a     0          0      0     0
1  b     0          0      0     0
2  c     0          0      0     0

df2 = pd.DataFrame({'id' : ['a', 'b', 'c'], # Define dictionary
                'test' : [1,0,0],
                'autumn' : [0,1,0],
                'fall' : [0,0,1],
                'word' : [1,1,0]})

  id  test  autumn  fall  word
0  a     1       0     0     1
1  b     0       1     0     1
2  c     0       0     1     0

df_result = pd.DataFrame({'id':['a','b','c'],
                          'test' : [1,0,0],
                        'beautiful' : [0,0,0],
                        'crazy' : [0,0,0],
                        'word' : [1,1,0]})

  id  test  beautiful  crazy  word
0  a     1          0      0     1
1  b     0          0      0     1
2  c     0          0      0     0

As you can see from the code, I need to join the two data frames based on two conditions. If id matches and column names match, then transfer the value from df2 to df1. I found this post with a similar problem to mine, but it is left unsolved. Thanks in advance!

CodePudding user response:

The DataFrame.update() method should do it.

df1.update(df2)

After running this line, we can test to see if it matches your desired output:

print(df1 == df_result)

Here's the result:

     id  test  beautiful  crazy  word
0  True  True       True   True  True
1  True  True       True   True  True
2  True  True       True   True  True

Additionally, df1 itself (.update works in place):

    id  test    beautiful   crazy   word
0   a   1   0   0   1
1   b   0   0   0   1
2   c   0   0   0   0

CodePudding user response:

df_result = df2.combine_first(df1)
df_result = df_result[['id', 'test', 'beautiful', 'crazy', 'word']]
print(df_result)

Output:

  id  test  beautiful  crazy  word
0  a     1          0      0     1
1  b     0          0      0     1
2  c     0          0      0     0

CodePudding user response:

You just need to set the index to id and use pandas.DataFrame.update No need to worry about other columns in df2 because:

join{‘left’}: default ‘left’
Only left join is implemented, keeping the index and columns of the original object.

df1 = df1.set_index('id')
df1.update(df2.set_index('id'))
df1.reset_index(inplace=True)

Output df1:

  id  test  beautiful  crazy  word
0  a     1          0      0     1
1  b     0          0      0     1
2  c     0          0      0     0
  • Related