Home > Mobile >  How to group by batch of rows in pandas?
How to group by batch of rows in pandas?

Time:06-10

A process records a batch of data (e.g. batch_size = 27) from multiple iterations (e.g. n_iterations = 10) and flattens them into a CSV file. In this case, the file would have 270 rows. I need to group/aggregate the data by batch to be able to calculate metrics like average, median and standard deviation on the Duration field. The entries in the Name field repeat and have the same order within each batch, so df[0]["Name"] == df[batch_size]["Name"], but might not be unique within each batch.

    Name    Duration
0   A       9.664
1   B       25.184
2   A       10.656
3   C       9.408
4   D       8.608
... ... ...
265 A       9.920
266 D       4.064
267 A       11.616
268 B       13.664
269 A       9.440

The final result should be a dataframe with batch_size rows that contains all Durations as columns Duration_{i}, or as a list [Duration_0, Duration_1, ...], or directly the statistical metrics Duration_mean, Duration_median, ... as columns. Any format is acceptable.

    Name    Duration_0  Duration_1  ...
0   A       9.664       7.221       ...
1   B       25.184      24.110      ...
... ...     ...         ...         ...
25  B       13.664      12.853      ...
26  A       9.440       8.502       ...

or

    Name    Duration
0   A       [9.664, 7.221, ...]
1   B       [25.184,  24.110, ...
... ...     ...
25  B       [13.664, 12.853, ...]
26  A       [9.440, 8.502, ...]

or

    Name    Duration_mean   Duration_median ...
0   A       9.552           8.997           ...
1   B       24.444          24.321          ...
...
25  B       13.521          13.415          ...
26  A       9.440           9.502           ...

How can this be done with Pandas?

EDIT: Simpler data for verification with batch_size=3 and n_iterations=2:

Input:

    Name    Duration
0   A       1           - 
1   B       2            | batch_0
2   A       2           - 
3   A       5           - 
4   B       2            | batch_1
5   A       4           - 

Output:

    Name    mean
0   A       3   // from (1 5)/2
1   B       2   // from (2 2)/2
2   A       5   // from (0 10)/2

CodePudding user response:

If need aggregate per n_batch rows with first value of Name per groups and Duration aggregate by mean and median use:

n_batch = 3
g = df.index % n_batch
df = df.groupby(g).agg(Neme=('Name','first'),
                        Duration_mean=('Duration','mean'),
                        Duration_median=('Duration','median') )

print (df)
  Neme  Duration_mean  Duration_median
0    A              3                3
1    B              2                2
2    A              3                3
  • Related