Group and sum every 3 rows per group in Python-CodePudding

I have a dataframe, df, where I would like to group and sum every 3 rows per group in Python

Data

location    date    one         two         three
aa          Q4 16   1.444444    4.111111    2.444444
aa          Q4 16   1.611111    5.111111    2.777777
aa          Q4 16   2.111111    4.222222    2.999999
aa          Q4 23   2.444444    6.111111    2.444444
aa          Q4 23   3.611111    5.111111    2.777777
aa          Q4 23   1.111111    8.222222    1.999999
bb          Q1 24   0.111111    1.111111    1.111111
bb          Q1 24   2.111111    2.111111    2.111111
bb          Q1 24   3.122222    2.222222    1.999999

Desired

group and sum by every 3 rows, making sure each location is grouped

location    date    one         two         three
aa          Q4 16   5.166666    13.44444    8.22222
aa          Q4 23   7.61111     19.55556    7.22222
bb          Q1 24   5.344444    5.444444    5.222221

Doing

I am setting the number I wish to group by and then

N = 3
df.groupby(df.index // N).sum()

CodePudding user response：

In this case, we can use groupby agg and declare each transformation ('first' for values we want to take the first instance of and 'sum' for the remaining):

N = 3
res_df = df.groupby(df.index // N).agg({
    'location': 'first',
    'date': 'first',
    'one': 'sum',
    'two': 'sum',
    'three': 'sum'
})

The dictionary of aggregations can also be built programmatically if needing many columns summed:

N = 3
first_cols = ['location', 'date']
res_df = df.groupby(df.index // N).agg({
    # Cols not to sum
    **{k: 'first' for k in first_cols},
    # Sum all other cols
    **{k: 'sum' for k in df.columns if k not in first_cols}
})

In any case res_df is:

  location   date       one        two     three
0       aa  Q4 16  5.166666  13.444444  8.222220
1       aa  Q4 23  7.166666  19.444444  7.222220
2       bb  Q1 24  5.344444   5.444444  5.222221

Assuming we don't need every three rows, but just unique location date pairs. We can just groupby and set sort=False to ensure values appear in order of appearance:

res_df = df.groupby(['location', 'date'], as_index=False, sort=False).sum()

If we need every three rows within each group we can use groupby cumcount and then floor divide which will split each group into sub-groups of N rows. droplevel is needed here to remove the last index level from groupby (which was added by creating the sub-groups):

N = 3
grp_cols = ['location', 'date']
res_df = df.groupby([
    *grp_cols, df.groupby(grp_cols).cumcount() // N
], sort=False).sum().droplevel(-1).reset_index()

With this sample data these options also produce the same res_df:

  location   date       one        two     three
0       aa  Q4 16  5.166666  13.444444  8.222220
1       aa  Q4 23  7.166666  19.444444  7.222220
2       bb  Q1 24  5.344444   5.444444  5.222221

Setup:

import pandas as pd

df = pd.DataFrame({
    'location': ['aa', 'aa', 'aa', 'aa', 'aa', 'aa', 'bb', 'bb', 'bb'],
    'date': ['Q4 16', 'Q4 16', 'Q4 16', 'Q4 23', 'Q4 23', 'Q4 23', 'Q1 24',
             'Q1 24', 'Q1 24'],
    'one': [1.444444, 1.611111, 2.111111, 2.444444, 3.611111, 1.111111,
            0.111111, 2.111111, 3.122222],
    'two': [4.111111, 5.111111, 4.222222, 6.111111, 5.111111, 8.222222,
            1.111111, 2.111111, 2.222222],
    'three': [2.444444, 2.777777, 2.999999, 2.444444, 2.777777, 1.999999,
              1.111111, 2.111111, 1.999999]
})