Home > Net >  Get max row of sums in group with pandas in python?
Get max row of sums in group with pandas in python?

Time:04-13

I have a dataframe (df) like this:

(not real timestamps)

                           group system  mb_used
2022-04-13 08:42:42.465237     A  sys_1      353
2022-04-13 08:42:42.465241     A  sys_2      122
2022-04-13 08:42:42.465241     A  sys_1      965
2022-04-13 08:42:42.465242     A  sys_1      412
2022-04-13 08:42:42.465243     A  sys_2      986
2022-04-13 08:42:42.465244     B  sys_3      513
2022-04-13 08:42:42.465244     B  sys_3      843
2022-04-13 08:42:42.465245     B  sys_3     1103
2022-04-13 08:42:42.465245     B  sys_3      266
2022-04-13 08:42:42.465246     B  sys_4      983
...

It keeps track of mb used by sql queries of that system.

  • One row equals one query.
  • Each system belongs to a group.
  • there is no fixed amount of systems per group etc.

What I'm trying to get is the sum of all mb_used per system per group, but then only the max value per group. And all of that on a per day basis.

In my head that would be:

  1. group by day, group and system
  2. get sum mb_used
  3. group again by day and system
  4. get max mb_used

getting the sum is easy:

sum_df = (
    df
    .groupby([pd.Grouper(freq="1D"), pd.Grouper('group'), pd.Grouper('system')])
    .mb_used
    .sum()
    .round(1)
    .fillna(0)
    .to_frame()
    .reset_index("group")
    .reset_index("system")
)
           system group  mb_used
2022-04-13  sys_1     A     1730
2022-04-13  sys_2     A     1108
2022-04-13  sys_3     B     2725
2022-04-13  sys_4     B      983

what I'm struggeling with is trying to get only the row with the max value per group. I tried the following (and other things) without luck:

table = (
    sum_df
    .groupby([pd.Grouper(freq="1D"), "system"])
    .mb_used
    .max()
    .unstack()
    .fillna(0)
    .tail(14)
    .sort_index(ascending=False)
    .assign(day=lambda x: x.index.strftime("%d-%m"))
    .set_index('day')
    .transpose()
)

table_sorted = (
    table
    .sort_values(table.columns[0], ascending=False)
    .head(20)

)
day     13-04   12-04   11-04
system       
sys_3    2725     ...     ...
sys_1    1730     ...     ...
sys_2    1108     ...     ...
sys_4     983     ...     ...
...

What I'm trying to get would be this:

         13-04   12-04   11-04
sys_1     1730     ...     ...
sys_3     2725     ...     ...
...

I'm rather new to pandas so any help is welcome, thanks!

CodePudding user response:

First use your solution with removed pd.Grouper and convert all levels to columns by reset_index:

sum_df = (
    df.rename_axis('date')
    .groupby([pd.Grouper(freq="1D"), 'group', 'system'])
    .mb_used
    .sum()
    .round(1)
    .reset_index()
)
print (sum_df)
        date group system  mb_used
0 2022-04-13     A  sys_1     1730
1 2022-04-13     A  sys_2     1108
2 2022-04-13     B  sys_3     2725
3 2022-04-13     B  sys_4      983

Then get rows by maximal mb_used per groups group by DataFrameGroupBy.idxmax:

df = sum_df.loc[sum_df.groupby('group')['mb_used'].idxmax()]
print (df)
        date group system  mb_used
0 2022-04-13     A  sys_1     1730
2 2022-04-13     B  sys_3     272

and last create helper column with DataFrame.pivot_table:

table_sorted = (df.assign(day=lambda x: x['date'].dt.strftime("%d-%m"))
                      .pivot_table(index='system', 
                                   columns='day', 
                                   values='mb_used', 
                                   aggfunc='max'))
print (table_sorted)
day     13-04
system       
sys_1    1730
sys_3    2725
  • Related