I have a dataframe (df) like this:
(not real timestamps)
group system mb_used
2022-04-13 08:42:42.465237 A sys_1 353
2022-04-13 08:42:42.465241 A sys_2 122
2022-04-13 08:42:42.465241 A sys_1 965
2022-04-13 08:42:42.465242 A sys_1 412
2022-04-13 08:42:42.465243 A sys_2 986
2022-04-13 08:42:42.465244 B sys_3 513
2022-04-13 08:42:42.465244 B sys_3 843
2022-04-13 08:42:42.465245 B sys_3 1103
2022-04-13 08:42:42.465245 B sys_3 266
2022-04-13 08:42:42.465246 B sys_4 983
...
It keeps track of mb used by sql queries of that system.
- One row equals one query.
- Each system belongs to a group.
- there is no fixed amount of systems per group etc.
What I'm trying to get is the sum of all mb_used per system per group, but then only the max value per group. And all of that on a per day basis.
In my head that would be:
- group by day, group and system
- get sum mb_used
- group again by day and system
- get max mb_used
getting the sum is easy:
sum_df = (
df
.groupby([pd.Grouper(freq="1D"), pd.Grouper('group'), pd.Grouper('system')])
.mb_used
.sum()
.round(1)
.fillna(0)
.to_frame()
.reset_index("group")
.reset_index("system")
)
system group mb_used
2022-04-13 sys_1 A 1730
2022-04-13 sys_2 A 1108
2022-04-13 sys_3 B 2725
2022-04-13 sys_4 B 983
what I'm struggeling with is trying to get only the row with the max value per group. I tried the following (and other things) without luck:
table = (
sum_df
.groupby([pd.Grouper(freq="1D"), "system"])
.mb_used
.max()
.unstack()
.fillna(0)
.tail(14)
.sort_index(ascending=False)
.assign(day=lambda x: x.index.strftime("%d-%m"))
.set_index('day')
.transpose()
)
table_sorted = (
table
.sort_values(table.columns[0], ascending=False)
.head(20)
)
day 13-04 12-04 11-04
system
sys_3 2725 ... ...
sys_1 1730 ... ...
sys_2 1108 ... ...
sys_4 983 ... ...
...
What I'm trying to get would be this:
13-04 12-04 11-04
sys_1 1730 ... ...
sys_3 2725 ... ...
...
I'm rather new to pandas so any help is welcome, thanks!
CodePudding user response:
First use your solution with removed pd.Grouper
and convert all levels to columns by reset_index
:
sum_df = (
df.rename_axis('date')
.groupby([pd.Grouper(freq="1D"), 'group', 'system'])
.mb_used
.sum()
.round(1)
.reset_index()
)
print (sum_df)
date group system mb_used
0 2022-04-13 A sys_1 1730
1 2022-04-13 A sys_2 1108
2 2022-04-13 B sys_3 2725
3 2022-04-13 B sys_4 983
Then get rows by maximal mb_used
per groups group
by DataFrameGroupBy.idxmax
:
df = sum_df.loc[sum_df.groupby('group')['mb_used'].idxmax()]
print (df)
date group system mb_used
0 2022-04-13 A sys_1 1730
2 2022-04-13 B sys_3 272
and last create helper column with DataFrame.pivot_table
:
table_sorted = (df.assign(day=lambda x: x['date'].dt.strftime("%d-%m"))
.pivot_table(index='system',
columns='day',
values='mb_used',
aggfunc='max'))
print (table_sorted)
day 13-04
system
sys_1 1730
sys_3 2725