I've been trying to groupby and the bin from the values of each group and get the average but I can't seem to find a straight way to do it.
dataframe:
code1 code2 code3 day amount
abc1 xyz1 123 1 25
abc1 xyz1 123 2 5
abc1 xyz1 123 3 15
. . . . .
. . . . .
abc1 xyz1 123 20 10
abc2 xyz1 456 1 4
. . . . .
. . . . .
abc10 xyz5 890 21 5
There are 3 different codes that I want to groupby and then bin and get the average of that bin to have a dataframe like this:
code1 code2 code3 day amount
abc1 xyz1 123 [1-3] 15
abc1 xyz1 123 [4-6] 13
abc1 xyz1 123 [7-9] 17
. . . . .
. . . . .
abc10 xyz5 890 [19-21] 18
I have try:
df(['code1', 'code2', 'code3'])[day].apply(pd.cut, bins=7)
but not giving me the results i want plus I still need the average of the bin.
Edit note: not all groups are the same size and not all groups are equally distributed in number of days, such that some groups are ending in 20, for example, breaking the ability to straight up divide by a number. Perhaps this can be solved only by grouping and then loop through each group to create the bins.
Any help is appreciated.
CodePudding user response:
Try groupby
with agg
:
df.groupby(df.index // 3).agg({k: ('last' if k != 'day' else lambda x: f'[{min(x)}-{max(x)}]') for k in df.columns})
Or better with just specifying column names:
df.groupby(df.index // 3).agg({'code1': 'last', 'code2': 'last', 'code3': 'last', 'day': lambda x: f'[{min(x)}-{max(x)}]', 'amount': 'last'})
CodePudding user response:
A good thing to do to generate your bins in each group is to groupby.transform
:
>>> binned_days = df.groupby(['code1', 'code2', 'code3'])['day'].transform(pd.cut, bins=7, precision=0, right=False)
>>> binned_days
0 [1.0, 4.0)
1 [1.0, 4.0)
2 [1.0, 4.0)
5 [17.0, 20.0)
6 [0.9999, 1.0001)
9 [20.997, 21.003)
Name: day, dtype: interval
The notation is slightly different than your bins, with [1.0, 4.0)
instead of [1, 3]
− but the meaning is the same. In fact as the datatype are intervals it’s easy to convert, even though you should use .apply
while the .interval
accessor does not exist yet:
>>> binned_days = binned_days.apply(lambda iv: pd.Interval(int(iv.left), int(iv.right), closed='both'))
>>> binned_days
0 [1, 4]
1 [1, 4]
2 [1, 4]
5 [17, 20]
6 [0, 1]
9 [20, 21]
Name: day, dtype: interval
Now we can use the columns and these day definitions to compute the mean:
>>> df.groupby(['code1', 'code2', 'code3', binned_days])[['amount']].mean().reset_index()
code1 code2 code3 day amount
0 abc1 xyz1 123 [1, 4] 15.0
1 abc1 xyz1 123 [17, 20] 10.0
2 abc10 xyz5 890 [20, 21] 5.0
3 abc2 xyz1 456 [0, 1] 4.0