Home > Back-end >  Take the sum of every N rows per group in a pandas DataFrame
Take the sum of every N rows per group in a pandas DataFrame

Time:07-23

Just want to mention that this question is not a duplicate of this: Take the sum of every N rows in pandas series

My problem is a bit different as I want to calculate N rows per group. My current code looks like this:

import pandas as pd

df = pd.DataFrame({'ID':['AA','AA','AA','BB','BB','BB'], 
                   'DATE':['2021-01-01','2021-01-03','2021-01-08','2021-03-04','2021-03-06','2021-03-08'],
                   'VALUE':[10,15,25,40,60,90]})

df['DATE'] = pd.to_datetime(df['DATE'])
df = df.sort_values(by=['ID','DATE'])
df.head(10)

Sample DataFrame:

 ---- ------------ ------- 
| ID | DATE       | VALUE |
 ---- ------------ ------- 
| AA | 2021-01-01 |   10  |
 ---- ------------ ------- 
| AA | 2021-01-03 |   15  |
 ---- ------------ ------- 
| AA | 2021-01-08 |   25  |
 ---- ------------ ------- 
| BB | 2021-03-04 |   40  |
 ---- ------------ ------- 
| BB | 2021-03-06 |   60  |
 ---- ------------ ------- 
| BB | 2021-03-08 |   90  |
 ---- ------------ ------- 

I apply this preprocessing based on the post:

#Calculate result
df.groupby(['ID', df.index//2]).agg({'VALUE':'mean', 'DATE':'median'}).reset_index()

I get this:

 ---- ------- ------------ 
| ID | VALUE | DATE       |
 ---- ------- ------------ 
| AA | 12.5  | 2021-01-02 |
 ---- ------- ------------ 
| AA | 25    | 2021-01-08 |
 ---- ------- ------------ 
| BB | 40    | 2021-03-04 |
 ---- ------- ------------ 
| BB | 75    | 2021-03-07 |
 ---- ------- ------------ 

But I want this:

 ---- ------- ------------ 
| ID | VALUE | DATE       |
 ---- ------- ------------ 
| AA | 12.5  | 2021-01-02 |
 ---- ------- ------------ 
| AA | 25    | 2021-01-08 |
 ---- ------- ------------ 
| BB | 50    | 2021-03-05 |
 ---- ------- ------------ 
| BB | 90    | 2021-03-08 |
 ---- ------- ------------ 

It seems like pandas index does not work well when my groups are not perfectly aligned and it messes up the beginning of the next series and how aggregations happen. Any suggestions? My dates can be completely irregular by the way.

CodePudding user response:

You can use groupby.cumcount to form a subgroup:

N = 2
group = df.groupby('ID').cumcount()//N

out = (df.groupby(['ID', group])
         .agg({'VALUE':'mean', 'DATE':'median'})
         .droplevel(1).reset_index()
      )

output:

   ID  VALUE       DATE
0  AA   12.5 2021-01-02
1  AA   25.0 2021-01-08
2  BB   50.0 2021-03-05
3  BB   90.0 2021-03-08
  • Related