Pandas groupby, bin and average-CodePudding

I've been trying to groupby and the bin from the values of each group and get the average but I can't seem to find a straight way to do it.

dataframe:

code1 code2 code3 day amount
abc1   xyz1   123   1   25
abc1   xyz1   123   2   5
abc1   xyz1   123   3   15
 .      .      .    .   .
 .      .      .    .   .
abc1   xyz1   123   20  10
abc2   xyz1   456   1   4
 .      .      .    .   .
 .      .      .    .   .
abc10  xyz5   890   21  5

There are 3 different codes that I want to groupby and then bin and get the average of that bin to have a dataframe like this:

code1 code2 code3  day    amount
abc1   xyz1   123  [1-3]    15
abc1   xyz1   123  [4-6]    13
abc1   xyz1   123  [7-9]    17
 .      .      .    .       .
 .      .      .    .       .
abc10  xyz5   890  [19-21]  18

I have try: df(['code1', 'code2', 'code3'])[day].apply(pd.cut, bins=7) but not giving me the results i want plus I still need the average of the bin.

Edit note: not all groups are the same size and not all groups are equally distributed in number of days, such that some groups are ending in 20, for example, breaking the ability to straight up divide by a number. Perhaps this can be solved only by grouping and then loop through each group to create the bins.

Any help is appreciated.

CodePudding user response：

Try groupby with agg:

df.groupby(df.index // 3).agg({k: ('last' if k != 'day' else lambda x: f'[{min(x)}-{max(x)}]') for k in df.columns})

Or better with just specifying column names:

df.groupby(df.index // 3).agg({'code1': 'last', 'code2': 'last', 'code3': 'last', 'day': lambda x: f'[{min(x)}-{max(x)}]', 'amount': 'last'})

CodePudding user response：

A good thing to do to generate your bins in each group is to groupby.transform:

>>> binned_days = df.groupby(['code1', 'code2', 'code3'])['day'].transform(pd.cut, bins=7, precision=0, right=False)
>>> binned_days
0          [1.0, 4.0)
1          [1.0, 4.0)
2          [1.0, 4.0)
5        [17.0, 20.0)
6    [0.9999, 1.0001)
9    [20.997, 21.003)
Name: day, dtype: interval

The notation is slightly different than your bins, with [1.0, 4.0) instead of [1, 3] − but the meaning is the same. In fact as the datatype are intervals it’s easy to convert, even though you should use .apply while the .interval accessor does not exist yet:

>>> binned_days = binned_days.apply(lambda iv: pd.Interval(int(iv.left), int(iv.right), closed='both'))
>>> binned_days
0      [1, 4]
1      [1, 4]
2      [1, 4]
5    [17, 20]
6      [0, 1]
9    [20, 21]
Name: day, dtype: interval

Now we can use the columns and these day definitions to compute the mean:

>>> df.groupby(['code1', 'code2', 'code3', binned_days])[['amount']].mean().reset_index()
   code1 code2 code3       day  amount
0   abc1  xyz1   123    [1, 4]    15.0
1   abc1  xyz1   123  [17, 20]    10.0
2  abc10  xyz5   890  [20, 21]     5.0
3   abc2  xyz1   456    [0, 1]     4.0