A process records a batch of data (e.g. batch_size = 27
) from multiple iterations (e.g. n_iterations = 10
) and flattens them into a CSV file. In this case, the file would have 270
rows. I need to group/aggregate the data by batch to be able to calculate metrics like average, median and standard deviation on the Duration
field. The entries in the Name
field repeat and have the same order within each batch, so df[0]["Name"] == df[batch_size]["Name"]
, but might not be unique within each batch.
Name Duration
0 A 9.664
1 B 25.184
2 A 10.656
3 C 9.408
4 D 8.608
... ... ...
265 A 9.920
266 D 4.064
267 A 11.616
268 B 13.664
269 A 9.440
The final result should be a dataframe with batch_size
rows that contains all Durations as columns Duration_{i}
, or as a list [Duration_0, Duration_1, ...]
, or directly the statistical metrics Duration_mean
, Duration_median
, ... as columns. Any format is acceptable.
Name Duration_0 Duration_1 ...
0 A 9.664 7.221 ...
1 B 25.184 24.110 ...
... ... ... ... ...
25 B 13.664 12.853 ...
26 A 9.440 8.502 ...
or
Name Duration
0 A [9.664, 7.221, ...]
1 B [25.184, 24.110, ...
... ... ...
25 B [13.664, 12.853, ...]
26 A [9.440, 8.502, ...]
or
Name Duration_mean Duration_median ...
0 A 9.552 8.997 ...
1 B 24.444 24.321 ...
...
25 B 13.521 13.415 ...
26 A 9.440 9.502 ...
How can this be done with Pandas?
EDIT: Simpler data for verification with batch_size=3
and n_iterations=2
:
Input:
Name Duration
0 A 1 -
1 B 2 | batch_0
2 A 2 -
3 A 5 -
4 B 2 | batch_1
5 A 4 -
Output:
Name mean
0 A 3 // from (1 5)/2
1 B 2 // from (2 2)/2
2 A 5 // from (0 10)/2
CodePudding user response:
If need aggregate per n_batch
rows with first value of Name
per groups and Duration aggregate by mean
and median
use:
n_batch = 3
g = df.index % n_batch
df = df.groupby(g).agg(Neme=('Name','first'),
Duration_mean=('Duration','mean'),
Duration_median=('Duration','median') )
print (df)
Neme Duration_mean Duration_median
0 A 3 3
1 B 2 2
2 A 3 3