groupby with totals/subtotals-CodePudding

Say I have the following dataframe

Strategy    AssetClass    Symbol         Value    Indicator
Strat1      OPT           OPT_ABC1       50       -0.3
Strat1      OPT           OPT_ABC2       50       1.5
Strat1      STK           STK_ABC        50       2.7
Strat2      STK           STK_XYZ        70       -3.8
Strat3      OPT           OPT_MNO        25       10

I would like to produce the following:

Strategy    AssetClass    Symbol    Value    Indicator
Strat1                                       3.9
            OPT                              1.2
                          OPT_ABC1  50       -0.3
                          OPT_ABC2  50       1.5
            STK                              2.7
                          STK_ABC   50       2.7
Strat2                                       -3.8
            STK                              -3.8
                          STK_XYZ   70       -3.8
Strat3                                       10
            OPT                              10
                          OPT_MNO   25       10

So the idea is to rearrange the data with totals per Strategy, then AssetClass and then per Symbol. The column "Value" is available at the Symbol level, while the column "Indicator" is the sum of the subgroup.

I thought of using pd.pivot_table but it doesn't seem to produce the totals/sub_totals I am looking for. I think I should use/loop over pd.groupby on Strategy and then loop over another groupby on Strategy/AssetClass and then loop over a groupby on Strategy/AssetClass/Symbol

With df being the dataframe above, I did this:

container = []
for label, _df in df.groupby(['Strategy', 'AssetClass', 'Symbol']):
    _df.loc[f'{label}'] = _df[['Indicator']].sum()
    container.append(_df)

df_res = pd.concat(container)
print(df_res.fillna(''))

My problem is that the subtotal is inserted after the corresponding rows and the label is used as index. Besides I can't figure out an easy/pythonic way of adding the other lopps(ie subtotals)

CodePudding user response：

You can aggregate by different columns, so for performance is better not use nested groupby.apply but rather multple aggregation, last join them togehether by concat, change order of columns by DataFrame.reindex and last sorting per first 2 columns:

df1 = df.groupby(['Strategy', 'AssetClass', 'Symbol'], as_index=False).sum()

df2 = (df1.groupby(['Strategy', 'AssetClass'], as_index=False)['Indicator'].sum()
          .assign(Symbol = ''))

df3 = (df1.groupby('Strategy', as_index=False)['Indicator'].sum()
          .assign(AssetClass = ''))

df = (pd.concat([df3, df2, df1])
        .reindex(df.columns, axis=1)
        .fillna('')
        .sort_values(['Strategy','AssetClass'], ignore_index=True))
print (df)
   Strategy AssetClass    Symbol Value  Indicator
0    Strat1                                   3.9
1    Strat1        OPT                        1.2
2    Strat1        OPT  OPT_ABC1  50.0       -0.3
3    Strat1        OPT  OPT_ABC2  50.0        1.5
4    Strat1        STK                        2.7
5    Strat1        STK   STK_ABC  50.0        2.7
6    Strat2                                  -3.8
7    Strat2        STK                       -3.8
8    Strat2        STK   STK_XYZ  70.0       -3.8
9    Strat3                                  10.0
10   Strat3        OPT                       10.0
11   Strat3        OPT   OPT_MNO  25.0       10.0