I´m reading some excel files where I want to calculate the mean of the column 'Duration' but I want to see the change in mean after recording 1, 2, 3 and so on. So for the below table it would be:
Mean1 = (3 3 3/3)
Mean2 = (3 3 3 4 6/5)
Mean3 = (3 3 3 4 6 4 3 1 4/9)
Each file can have different number of recordings (but always starting from 1) and different number of durations for each recording.
Is is possible in Python to make a script that does this automatically?
Recording | Duration |
---|---|
1 | 3 |
1 | 3 |
1 | 3 |
2 | 4 |
2 | 6 |
3 | 4 |
3 | 3 |
3 | 1 |
3 | 4 |
CodePudding user response:
IIUC, you could do groupby
"Recording" and get the ratio of the cumulated sum
and cumulated count
(using cumsum
):
g = df.groupby('Recording')['Duration']
s = g.sum().cumsum()/g.count().cumsum()
output:
Recording
1 3.000000
2 3.800000
3 3.444444
Name: Duration, dtype: float64
intermediates:
>>> g.sum().cumsum()
Recording
1 9
2 19
3 31
Name: Duration, dtype: int64
>>> g.count().cumsum()
Recording
1 3
2 5
3 9
Name: Duration, dtype: int64
CodePudding user response:
You can count average by defintion - first get sum
and count per Recording
, add cumulative sum by DataFrame.cumsum
and for mean divide sum
by size
columns:
df = df.groupby('Recording')['Duration'].agg(['sum','size']).cumsum().reset_index()
df['avg'] = df.pop('sum').div(df.pop('size'))
print (df)
Recording avg
0 1 3.000000
1 2 3.800000
2 3 3.444444