I have a dataframe, df, where I would like to group and sum every 3 rows per group in Python
Data
location date one two three
aa Q4 16 1.444444 4.111111 2.444444
aa Q4 16 1.611111 5.111111 2.777777
aa Q4 16 2.111111 4.222222 2.999999
aa Q4 23 2.444444 6.111111 2.444444
aa Q4 23 3.611111 5.111111 2.777777
aa Q4 23 1.111111 8.222222 1.999999
bb Q1 24 0.111111 1.111111 1.111111
bb Q1 24 2.111111 2.111111 2.111111
bb Q1 24 3.122222 2.222222 1.999999
Desired
group and sum by every 3 rows, making sure each location is grouped
location date one two three
aa Q4 16 5.166666 13.44444 8.22222
aa Q4 23 7.61111 19.55556 7.22222
bb Q1 24 5.344444 5.444444 5.222221
Doing
I am setting the number I wish to group by and then
N = 3
df.groupby(df.index // N).sum()
CodePudding user response:
In this case, we can use groupby agg
and declare each transformation ('first' for values we want to take the first instance of and 'sum' for the remaining):
N = 3
res_df = df.groupby(df.index // N).agg({
'location': 'first',
'date': 'first',
'one': 'sum',
'two': 'sum',
'three': 'sum'
})
The dictionary of aggregations can also be built programmatically if needing many columns summed:
N = 3
first_cols = ['location', 'date']
res_df = df.groupby(df.index // N).agg({
# Cols not to sum
**{k: 'first' for k in first_cols},
# Sum all other cols
**{k: 'sum' for k in df.columns if k not in first_cols}
})
In any case res_df
is:
location date one two three
0 aa Q4 16 5.166666 13.444444 8.222220
1 aa Q4 23 7.166666 19.444444 7.222220
2 bb Q1 24 5.344444 5.444444 5.222221
Assuming we don't need every three rows, but just unique location
date
pairs. We can just groupby
and set sort=False
to ensure values appear in order of appearance:
res_df = df.groupby(['location', 'date'], as_index=False, sort=False).sum()
If we need every three rows within each group we can use groupby cumcount
and then floor divide which will split each group into sub-groups of N rows. droplevel
is needed here to remove the last index level from groupby
(which was added by creating the sub-groups):
N = 3
grp_cols = ['location', 'date']
res_df = df.groupby([
*grp_cols, df.groupby(grp_cols).cumcount() // N
], sort=False).sum().droplevel(-1).reset_index()
With this sample data these options also produce the same res_df
:
location date one two three
0 aa Q4 16 5.166666 13.444444 8.222220
1 aa Q4 23 7.166666 19.444444 7.222220
2 bb Q1 24 5.344444 5.444444 5.222221
Setup:
import pandas as pd
df = pd.DataFrame({
'location': ['aa', 'aa', 'aa', 'aa', 'aa', 'aa', 'bb', 'bb', 'bb'],
'date': ['Q4 16', 'Q4 16', 'Q4 16', 'Q4 23', 'Q4 23', 'Q4 23', 'Q1 24',
'Q1 24', 'Q1 24'],
'one': [1.444444, 1.611111, 2.111111, 2.444444, 3.611111, 1.111111,
0.111111, 2.111111, 3.122222],
'two': [4.111111, 5.111111, 4.222222, 6.111111, 5.111111, 8.222222,
1.111111, 2.111111, 2.222222],
'three': [2.444444, 2.777777, 2.999999, 2.444444, 2.777777, 1.999999,
1.111111, 2.111111, 1.999999]
})