I have a large DataFrame with multiple cols
, one of which has ID
of string
, and the others are of float
example of the dataframe:
df = pd.DataFrame({'ID': ['Child', 'Child', 'Child', 'Child', 'Baby', 'Baby', 'Baby', 'Baby'],
'income': [40000, 50000, 42000, 300, 2000, 4000, 2000, 3000],
'Height': [1.3, 1.5, 1.9, 2.0, 2.3, 1.4, 0.9, 0.8]})
What I want to do is a combination of calculating the average of every n
rows of all cols
, inside every ID
group.
desired output:
steps = 3
df = pd.DataFrame({'ID': ['Child', 'Child', 'Baby', 'Baby'],
'income': [44000, 300, 2600 , 3000],
'Height': [1.567, 2.0, 1.533, 0.8],
'Values': [3, 1, 3, 1]})
Where the rows are first grouped by ID
and then the mean is taken over every 3 values in the same group. I added Values
such that i can track how many rows are taken for that row's average of all cols
.
I have found similar questions but I cannot seem to combine them to solve my problem:
This question gives averages of n
rows.
[This question2 covers pd.cut
which I might need as well, I just dont understand how the bins work.
How can I make this happen?
CodePudding user response:
You can use a double groupby:
# set up secondary grouper
group = df.groupby('ID').cumcount().floordiv(steps)
# groupy agg
(df.groupby(['ID', group], as_index=False, sort=False)
.agg(**{'income': ('income', 'mean'),
'Height': ('Height', 'mean'),
'Values': ('Height', 'count'),
})
)
output:
ID income Height Values
0 Child 44000.000000 1.566667 3
1 Child 300.000000 2.000000 1
2 Baby 2666.666667 1.533333 3
3 Baby 3000.000000 0.800000 1