Pandas merge duplicate rows-CodePudding

I am trying to merge two dataframes on Origin, Destination and Service. The first df is

df = {'Origin': ['London','London','London','Geneva','Amsterdam','Amsterdam','Mardid'],
      'Destination': ['Paris','Paris','Paris','Vienna','Berlin','Berlin','Barcelona'],
      'Service': ['Express','Express','Express','Standard', 'Express', 'Express','Standard'],
      'Goal': [12,12,12,3,8,8,10],
      'Date' : ['5/6','3/4','5/6','1/2','7/8','8/9','9/10']}
df = pd.DataFrame(data=df)

The second df is

df1 = {'Origin': ['London','London','Geneva','Amsterdam','Mardid'],
      'Destination': ['Paris','Paris','Vienna','Berlin','Barcelona'],
      'Service': ['Express','Express','Standard','Express','Standard'],
      'Price': [55,65,110,78,60]}
df1 = pd.DataFrame(data=df1)

I am using the following code to merge the two:

df2 = pd.merge(df,df1, how='left',
              left_on=['Origin','Destination','Service'],
              right_on=['Origin','Destination','Service'])

Is there a way to merge the two dataframes but only use the lower price from df2 if there are any duplicate rows that I am matching on? When I use the code above to merge the result is duplicate rows with both prices. I considered removing duplicates in my merged df2 ones but my original df already contains some duplicate rows that should not be removed.

My second dataframe is thousands of rows long so I will not be able to remove them and I will also need to keep them in case they are needed further.

Thank you

CodePudding user response：

Filter df1 to keep the lowest price during the merge:

cols = ['Origin','Destination','Service']

df2 = pd.merge(df,
               df1.loc[df1.groupby(cols)['Price'].idxmin()],
               how='left',
               on=cols,
              )

NB. I simplified the merge command here as you are using the same columns in both inputs.

Output:

      Origin Destination   Service  Goal  Date  Price
0     London       Paris   Express    12   5/6     55
1     London       Paris   Express    12   3/4     55
2     London       Paris   Express    12   5/6     55
3     Geneva      Vienna  Standard     3   1/2    110
4  Amsterdam      Berlin   Express     8   7/8     78
5  Amsterdam      Berlin   Express     8   8/9     78
6     Mardid   Barcelona  Standard    10  9/10     60

CodePudding user response：

Use drop_duplicates with keep="first" (keep is "first" by default) on df1.

before drop_duplicates, sort by price.

cols = ["Origin", "Destination", "Service"]

df2 = pd.merge(
    df,
    df1.sort_values(cols   ["Price"]).drop_duplicates(cols),
    how="left",
    on=cols,
)