I have 2 large data frames with thousands of columns for each df. I need to left-join the two tables, namely df1 and df2. However, I don't think I'll be able to manually list all of the common columns/keys between the 2 data frames. Below is an example of the data frames:
df1 = pd.DataFrame({'id' : ['a', 'b', 'c'], # Define dictionary
'test' : [0,0,0],
'beautiful' : [0,0,0],
'crazy' : [0,0,0],
'word' : [0,0,0]})
id test beautiful crazy word
0 a 0 0 0 0
1 b 0 0 0 0
2 c 0 0 0 0
df2 = pd.DataFrame({'id' : ['a', 'b', 'c'], # Define dictionary
'test' : [1,0,0],
'autumn' : [0,1,0],
'fall' : [0,0,1],
'word' : [1,1,0]})
id test autumn fall word
0 a 1 0 0 1
1 b 0 1 0 1
2 c 0 0 1 0
df_result = pd.DataFrame({'id':['a','b','c'],
'test' : [1,0,0],
'beautiful' : [0,0,0],
'crazy' : [0,0,0],
'word' : [1,1,0]})
id test beautiful crazy word
0 a 1 0 0 1
1 b 0 0 0 1
2 c 0 0 0 0
As you can see from the code, I need to join the two data frames based on two conditions. If id
matches and column names
match, then transfer the value from df2 to df1. I found this post with a similar problem to mine, but it is left unsolved. Thanks in advance!
CodePudding user response:
The DataFrame.update() method should do it.
df1.update(df2)
After running this line, we can test to see if it matches your desired output:
print(df1 == df_result)
Here's the result:
id test beautiful crazy word
0 True True True True True
1 True True True True True
2 True True True True True
Additionally, df1 itself (.update works in place):
id test beautiful crazy word
0 a 1 0 0 1
1 b 0 0 0 1
2 c 0 0 0 0
CodePudding user response:
df_result = df2.combine_first(df1)
df_result = df_result[['id', 'test', 'beautiful', 'crazy', 'word']]
print(df_result)
Output:
id test beautiful crazy word
0 a 1 0 0 1
1 b 0 0 0 1
2 c 0 0 0 0
CodePudding user response:
You just need to set the index to id
and use pandas.DataFrame.update
No need to worry about other columns in df2
because:
join{‘left’}: default ‘left’
Only left join is implemented, keeping the index and columns of the original object.
df1 = df1.set_index('id')
df1.update(df2.set_index('id'))
df1.reset_index(inplace=True)
Output df1
:
id test beautiful crazy word
0 a 1 0 0 1
1 b 0 0 0 1
2 c 0 0 0 0