How to calculate max values in a dataframe column while removing duplicates in another column?-CodePudding

I have a dataset containing hourly temperatures for a year. So, I have 24 entries for each day (temp for every hour) and I want to find out the 5 days with highest temp. I am aware of nlargest() function to find out 5 max values but those values happen to be on a single day only. How do I find out the 5 max values but on different days?

I tried using nlargest() and .loc() but could not find the solution. Please help.

I have attached what the dataset looks like.

CodePudding user response：

You might want to get the max per group with groupby.max then find the top 5 with nlargest

df.groupby(['year','month','day'])['temp'].max().nlargest(5)

CodePudding user response：

You can use groupby

An example would be

max_5_temps = df.groupby('date_column')['temperature_column'].max().nlargest(5)

CodePudding user response：

Use GroupBy.max for rows per days by maximal temperature with Series.nlargest:

df1 = df.groupby(['year','month','day'])['temp'].max().nlargest(5).reset_index(name='top5')

If need also another rows use DataFrameGroupBy.idxmax with DataFrame.nlargest:

df2 = df.loc[df.groupby(['year','month','day'])['temp'].idxmax()].nlargest(5, 'temp')