How to create a multi-index pivot table that sums the max values within a sub-group-CodePudding

I have a somewhat large dataframe of customers assigned to a hub and each hub is in a specific location. The hubs get flagged whenever there's an issue and I'd like to know the number of customers affected each time this happens.

So I'd like to find the max number of customers assigned to each hub (this would then exclude the times the hub may have been flagged multiple times) and then group the rows by location and the columns by type, then show the sum of the max count of customers over a period of months.

The data looks like:

Hub	Location	DateTime	Month	Type	Customers
J01	NY	01/01/2022	January	Type 1	250
J03	CA	01/21/2022	January	Type 2	111
J01	NY	04/01/2022	April	Type 1	250
J05	CA	06/01/2022	June	Type 1	14
J03	CA	08/18/2022	August	Type 2	111

I did the following code to generate a pivot table and it generates the max values for each hub, but there are hundreds of hubs.

`    pd.pivot_table (out,values='Customers',index=['Location','Hub'], columns=
     ['Type','Month'],aggfunc='max')    `

Results mostly look like:

	Type	Type 1			Type 2
	Month	January	February	March	January
Location	Hub				NA
NY	J01	0	250	250	NA
	J04	222	222	222	NA
CA	J03	NA	NA	NA	111
CA	J05	14	14	0	NA

I would like the results to look like:

	Type	Type 1			Type 2
	Month	January	February	March	January
Location
NY		222	472	472	0
CA		14	14	0	111

Is there an easier way to achieve this?

CodePudding user response：

You're on the right start! pivot_table is the right way to group the table with columns by type. You also identified that you can perform the max aggregation at pivot-time:

df = pd.pivot_table(
    out,
    values='Customers',
    index=['Location','Hub'],
    columns=['Type','Month'],
    aggfunc='max'
)

Next, we need to groupby Location and add (sum) the values:

df = df.groupby("Location").max()

The final result (trying to build a table that matches your data):

out = pd.read_csv(io.StringIO("""
Hub Location    DateTime    Month   Type    Customers
J01 NY  01/01/2022  January Type 1  250
J03 CA  01/21/2022  January Type 2  111
J01 NY  04/01/2022  April   Type 1  250
J04 NY  01/01/2022  January Type 1  222
J04 NY  02/01/2022  February    Type 1  222
J04 NY  04/01/2022  April   Type 1  222
J05 CA  06/01/2022  June    Type 1  14
J03 CA  08/18/2022  August  Type 2  111
"""), sep='\t')

pd.pivot_table(
    out,
    values='Customers',
    index=['Location','Hub'],
    columns=['Type','Month'],
    aggfunc='max'
).groupby("Location").sum()

gives:

    Type    Type 1  Type 2
   Month    April   February    January June    August  January
Location                        
      CA    0.0     0.0         0.0     14.0    111.0   111.0
      NY    472.0   222.0       472.0   0.0     0.0     0.0