I am trying to merge two dataframes on Origin, Destination and Service. The first df is
df = {'Origin': ['London','London','London','Geneva','Amsterdam','Amsterdam','Mardid'],
'Destination': ['Paris','Paris','Paris','Vienna','Berlin','Berlin','Barcelona'],
'Service': ['Express','Express','Express','Standard', 'Express', 'Express','Standard'],
'Goal': [12,12,12,3,8,8,10],
'Date' : ['5/6','3/4','5/6','1/2','7/8','8/9','9/10']}
df = pd.DataFrame(data=df)
The second df is
df1 = {'Origin': ['London','London','Geneva','Amsterdam','Mardid'],
'Destination': ['Paris','Paris','Vienna','Berlin','Barcelona'],
'Service': ['Express','Express','Standard','Express','Standard'],
'Price': [55,65,110,78,60]}
df1 = pd.DataFrame(data=df1)
I am using the following code to merge the two:
df2 = pd.merge(df,df1, how='left',
left_on=['Origin','Destination','Service'],
right_on=['Origin','Destination','Service'])
Is there a way to merge the two dataframes but only use the lower price from df2 if there are any duplicate rows that I am matching on? When I use the code above to merge the result is duplicate rows with both prices. I considered removing duplicates in my merged df2 ones but my original df already contains some duplicate rows that should not be removed.
My second dataframe is thousands of rows long so I will not be able to remove them and I will also need to keep them in case they are needed further.
Thank you
CodePudding user response:
Filter df1
to keep the lowest price during the merge:
cols = ['Origin','Destination','Service']
df2 = pd.merge(df,
df1.loc[df1.groupby(cols)['Price'].idxmin()],
how='left',
on=cols,
)
NB. I simplified the merge
command here as you are using the same columns in both inputs.
Output:
Origin Destination Service Goal Date Price
0 London Paris Express 12 5/6 55
1 London Paris Express 12 3/4 55
2 London Paris Express 12 5/6 55
3 Geneva Vienna Standard 3 1/2 110
4 Amsterdam Berlin Express 8 7/8 78
5 Amsterdam Berlin Express 8 8/9 78
6 Mardid Barcelona Standard 10 9/10 60
CodePudding user response:
Use drop_duplicates
with keep="first"
(keep is "first" by default) on df1
.
before drop_duplicates
, sort by price.
cols = ["Origin", "Destination", "Service"]
df2 = pd.merge(
df,
df1.sort_values(cols ["Price"]).drop_duplicates(cols),
how="left",
on=cols,
)