Get max row of sums in group with pandas in python?-CodePudding

I have a dataframe (df) like this:

(not real timestamps)

                           group system  mb_used
2022-04-13 08:42:42.465237     A  sys_1      353
2022-04-13 08:42:42.465241     A  sys_2      122
2022-04-13 08:42:42.465241     A  sys_1      965
2022-04-13 08:42:42.465242     A  sys_1      412
2022-04-13 08:42:42.465243     A  sys_2      986
2022-04-13 08:42:42.465244     B  sys_3      513
2022-04-13 08:42:42.465244     B  sys_3      843
2022-04-13 08:42:42.465245     B  sys_3     1103
2022-04-13 08:42:42.465245     B  sys_3      266
2022-04-13 08:42:42.465246     B  sys_4      983
...

It keeps track of mb used by sql queries of that system.

One row equals one query.
Each system belongs to a group.
there is no fixed amount of systems per group etc.

What I'm trying to get is the sum of all mb_used per system per group, but then only the max value per group. And all of that on a per day basis.

In my head that would be:

group by day, group and system
get sum mb_used
group again by day and system
get max mb_used

getting the sum is easy:

sum_df = (
    df
    .groupby([pd.Grouper(freq="1D"), pd.Grouper('group'), pd.Grouper('system')])
    .mb_used
    .sum()
    .round(1)
    .fillna(0)
    .to_frame()
    .reset_index("group")
    .reset_index("system")
)

           system group  mb_used
2022-04-13  sys_1     A     1730
2022-04-13  sys_2     A     1108
2022-04-13  sys_3     B     2725
2022-04-13  sys_4     B      983

what I'm struggeling with is trying to get only the row with the max value per group. I tried the following (and other things) without luck:

table = (
    sum_df
    .groupby([pd.Grouper(freq="1D"), "system"])
    .mb_used
    .max()
    .unstack()
    .fillna(0)
    .tail(14)
    .sort_index(ascending=False)
    .assign(day=lambda x: x.index.strftime("%d-%m"))
    .set_index('day')
    .transpose()
)

table_sorted = (
    table
    .sort_values(table.columns[0], ascending=False)
    .head(20)

)

day     13-04   12-04   11-04
system       
sys_3    2725     ...     ...
sys_1    1730     ...     ...
sys_2    1108     ...     ...
sys_4     983     ...     ...
...

What I'm trying to get would be this:

         13-04   12-04   11-04
sys_1     1730     ...     ...
sys_3     2725     ...     ...
...

I'm rather new to pandas so any help is welcome, thanks!

CodePudding user response：

First use your solution with removed pd.Grouper and convert all levels to columns by reset_index:

sum_df = (
    df.rename_axis('date')
    .groupby([pd.Grouper(freq="1D"), 'group', 'system'])
    .mb_used
    .sum()
    .round(1)
    .reset_index()
)
print (sum_df)
        date group system  mb_used
0 2022-04-13     A  sys_1     1730
1 2022-04-13     A  sys_2     1108
2 2022-04-13     B  sys_3     2725
3 2022-04-13     B  sys_4      983

Then get rows by maximal mb_used per groups group by DataFrameGroupBy.idxmax:

df = sum_df.loc[sum_df.groupby('group')['mb_used'].idxmax()]
print (df)
        date group system  mb_used
0 2022-04-13     A  sys_1     1730
2 2022-04-13     B  sys_3     272

and last create helper column with DataFrame.pivot_table:

table_sorted = (df.assign(day=lambda x: x['date'].dt.strftime("%d-%m"))
                      .pivot_table(index='system', 
                                   columns='day', 
                                   values='mb_used', 
                                   aggfunc='max'))
print (table_sorted)
day     13-04
system       
sys_1    1730
sys_3    2725