Suppose I have following data frame:
d = {'Date': ['2020-1-1', '2020-1-2', '2020-1-3', '2020-1-1', '2020-1-2',
'2020-1-3','2020-1-1', '2020-1-2', '2020-1-3'], 'col2': ['A','A','A', 'B',
'B','B', 'C','C','C'],
'col3':[0.01,0.02,0.03,0.02,0.03,0.04,0.05,0.1,0.01]}
d = pd.DataFrame(data=d)
d['Date'] = pd.to_datetime(d['Date'])
d
and get:
Date col2 col3
0 2020-01-01 A 0.01
1 2020-01-02 A 0.02
2 2020-01-03 A 0.03
3 2020-01-01 B 0.02
4 2020-01-02 B 0.03
5 2020-01-03 B 0.04
6 2020-01-01 C 0.05
7 2020-01-02 C 0.10
8 2020-01-03 C 0.01
How could I iterate over rows, so that I get for each Date
the 2 highest values of col3
and group from col2
?
So for example I should get:
2020-01-01 B 0.02
2020-01-01 C 0.05
2020-01-02 B 0.03
2020-01-02 C 0.10
2020-01-03 A 0.03
2020-01-03 B 0.04
And at the end to sum for each day col 3:
Date sum
2020-01-01 0.07
2020-01-02 0.13
2020-01-03 0.07
But of course real example has much more groups (col2
) and more dates.
CodePudding user response:
I think this:
d[['Date', 'col3']].groupby('Date')['col3'].nlargest(2).groupby('Date').sum()
CodePudding user response:
grouped_df = d.groupby("Date")
results = []
for name, group in grouped_df:
group.sort_values(by="col3", ascending=False, inplace=True)
top_2 = group.nlargest(2, "col3")
top_2_sum = top_2["col3"].sum()
results.append((name, top_2_sum))
sum_df = pd.DataFrame(results, columns=["Date", "Sum"])
print(sum_df)
Output:
Date Sum
0 2020-01-01 0.07
1 2020-01-02 0.13
2 2020-01-03 0.07