Say I have the following dataframes:
>>> df1 = pd.DataFrame({'fruit':['apple','orange','orange'],'taste':['sweet','sweet','sour']})
>>> df1
fruit taste
0 apple sweet
1 orange sweet
2 orange sour
>>> df2 = pd.DataFrame({'fruit':['apple','orange','orange'],'price':['high','low','low']})
>>> df2
fruit price
0 apple high
1 orange low
2 orange low
When I do df3=df1.merge(df2,on='fruit')
, I got the following result:
fruit taste price
0 apple sweet high
1 orange sweet low
2 orange sweet low
3 orange sour low
4 orange sour low
Here it looks like 2 duplicate rows were created; instead, I would expect something like
fruit taste price
0 apple sweet high
1 orange sweet low
3 orange sour low
How should I understand this behavior and how to obtain the result I was looking for?
CodePudding user response:
you should remove duplicates rows before merge:
df3 = df1.merge(df2.drop_duplicates(), on='fruit')
CodePudding user response:
if you want to merge row1,row2,row3 from df1
with row1,row2,row3 from df
2 then the code below works.
import pandas as pd
df1 = pd.DataFrame({'fruit':['apple','orange','orange'],'taste':['sweet','sweet','sour']})
df2 = pd.DataFrame({'fruit':['apple','orange','orange'],'price':['high','low','low']})
df3=df1.copy()
df3["price"]=None
df3.update(df2, join="left")
print(df3)
the reason why you get duplicated rows usingdf3=df1.merge(df2,on='fruit')
is because merge uses df1 cross join df2
(aka df1 X df2
).
More information on that if you research sql cross join.