Home > Enterprise >  Iterating over rows and groups in dataframe
Iterating over rows and groups in dataframe

Time:01-09

Suppose I have following data frame:

d = {'Date': ['2020-1-1', '2020-1-2', '2020-1-3', '2020-1-1', '2020-1-2', 
              '2020-1-3','2020-1-1', '2020-1-2', '2020-1-3'], 'col2': ['A','A','A', 'B',
                                                    'B','B', 'C','C','C'],
     'col3':[0.01,0.02,0.03,0.02,0.03,0.04,0.05,0.1,0.01]}
d = pd.DataFrame(data=d)
d['Date'] = pd.to_datetime(d['Date'])
d

and get:

    Date      col2  col3
0   2020-01-01  A   0.01
1   2020-01-02  A   0.02
2   2020-01-03  A   0.03
3   2020-01-01  B   0.02
4   2020-01-02  B   0.03
5   2020-01-03  B   0.04
6   2020-01-01  C   0.05
7   2020-01-02  C   0.10
8   2020-01-03  C   0.01

How could I iterate over rows, so that I get for each Date the 2 highest values of col3 and group from col2? So for example I should get:

2020-01-01  B   0.02
2020-01-01  C   0.05
2020-01-02  B   0.03
2020-01-02  C   0.10
2020-01-03  A   0.03
2020-01-03  B   0.04

And at the end to sum for each day col 3:

Date        sum
2020-01-01  0.07
2020-01-02  0.13
2020-01-03  0.07

But of course real example has much more groups (col2) and more dates.

CodePudding user response:

I think this:

d[['Date', 'col3']].groupby('Date')['col3'].nlargest(2).groupby('Date').sum()

CodePudding user response:

grouped_df = d.groupby("Date")
results = []
for name, group in grouped_df:
    group.sort_values(by="col3", ascending=False, inplace=True)
    top_2 = group.nlargest(2, "col3")
    top_2_sum = top_2["col3"].sum()
    results.append((name, top_2_sum))

sum_df = pd.DataFrame(results, columns=["Date", "Sum"])
print(sum_df)

Output:

        Date   Sum
0 2020-01-01  0.07
1 2020-01-02  0.13
2 2020-01-03  0.07
  • Related