Just want to mention that this question is not a duplicate of this: Take the sum of every N rows in pandas series
My problem is a bit different as I want to calculate N rows per group. My current code looks like this:
import pandas as pd
df = pd.DataFrame({'ID':['AA','AA','AA','BB','BB','BB'],
'DATE':['2021-01-01','2021-01-03','2021-01-08','2021-03-04','2021-03-06','2021-03-08'],
'VALUE':[10,15,25,40,60,90]})
df['DATE'] = pd.to_datetime(df['DATE'])
df = df.sort_values(by=['ID','DATE'])
df.head(10)
Sample DataFrame:
---- ------------ -------
| ID | DATE | VALUE |
---- ------------ -------
| AA | 2021-01-01 | 10 |
---- ------------ -------
| AA | 2021-01-03 | 15 |
---- ------------ -------
| AA | 2021-01-08 | 25 |
---- ------------ -------
| BB | 2021-03-04 | 40 |
---- ------------ -------
| BB | 2021-03-06 | 60 |
---- ------------ -------
| BB | 2021-03-08 | 90 |
---- ------------ -------
I apply this preprocessing based on the post:
#Calculate result
df.groupby(['ID', df.index//2]).agg({'VALUE':'mean', 'DATE':'median'}).reset_index()
I get this:
---- ------- ------------
| ID | VALUE | DATE |
---- ------- ------------
| AA | 12.5 | 2021-01-02 |
---- ------- ------------
| AA | 25 | 2021-01-08 |
---- ------- ------------
| BB | 40 | 2021-03-04 |
---- ------- ------------
| BB | 75 | 2021-03-07 |
---- ------- ------------
But I want this:
---- ------- ------------
| ID | VALUE | DATE |
---- ------- ------------
| AA | 12.5 | 2021-01-02 |
---- ------- ------------
| AA | 25 | 2021-01-08 |
---- ------- ------------
| BB | 50 | 2021-03-05 |
---- ------- ------------
| BB | 90 | 2021-03-08 |
---- ------- ------------
It seems like pandas index does not work well when my groups are not perfectly aligned and it messes up the beginning of the next series and how aggregations happen. Any suggestions? My dates can be completely irregular by the way.
CodePudding user response:
You can use groupby.cumcount
to form a subgroup:
N = 2
group = df.groupby('ID').cumcount()//N
out = (df.groupby(['ID', group])
.agg({'VALUE':'mean', 'DATE':'median'})
.droplevel(1).reset_index()
)
output:
ID VALUE DATE
0 AA 12.5 2021-01-02
1 AA 25.0 2021-01-08
2 BB 50.0 2021-03-05
3 BB 90.0 2021-03-08