How do i simultaneously groupby and mean of n rows?-CodePudding

I have a large DataFrame with multiple cols, one of which has ID of string, and the others are of float

example of the dataframe:

df = pd.DataFrame({'ID': ['Child', 'Child', 'Child', 'Child', 'Baby', 'Baby', 'Baby', 'Baby'],
'income': [40000, 50000, 42000, 300, 2000, 4000, 2000, 3000],
'Height': [1.3, 1.5, 1.9, 2.0, 2.3, 1.4, 0.9, 0.8]})

What I want to do is a combination of calculating the average of every n rows of all cols, inside every ID group. desired output:

steps = 3
df = pd.DataFrame({'ID': ['Child', 'Child', 'Baby', 'Baby'],
'income': [44000, 300, 2600 , 3000],
'Height': [1.567, 2.0, 1.533, 0.8],
'Values': [3, 1, 3, 1]})

Where the rows are first grouped by ID and then the mean is taken over every 3 values in the same group. I added Values such that i can track how many rows are taken for that row's average of all cols.

I have found similar questions but I cannot seem to combine them to solve my problem: This question gives averages of n rows. [This question2 covers pd.cut which I might need as well, I just dont understand how the bins work.

How can I make this happen?

CodePudding user response：

You can use a double groupby:

# set up secondary grouper
group = df.groupby('ID').cumcount().floordiv(steps)

# groupy agg
(df.groupby(['ID', group], as_index=False, sort=False)
   .agg(**{'income': ('income', 'mean'),
           'Height': ('Height', 'mean'),
           'Values': ('Height', 'count'),
          })
)

output:

      ID        income    Height  Values
0  Child  44000.000000  1.566667       3
1  Child    300.000000  2.000000       1
2   Baby   2666.666667  1.533333       3
3   Baby   3000.000000  0.800000       1