Home > Blockchain >  How to create a multi-index pivot table that sums the max values within a sub-group
How to create a multi-index pivot table that sums the max values within a sub-group

Time:11-15

I have a somewhat large dataframe of customers assigned to a hub and each hub is in a specific location. The hubs get flagged whenever there's an issue and I'd like to know the number of customers affected each time this happens.

So I'd like to find the max number of customers assigned to each hub (this would then exclude the times the hub may have been flagged multiple times) and then group the rows by location and the columns by type, then show the sum of the max count of customers over a period of months.

The data looks like:

Hub Location DateTime Month Type Customers
J01 NY 01/01/2022 January Type 1 250
J03 CA 01/21/2022 January Type 2 111
J01 NY 04/01/2022 April Type 1 250
J05 CA 06/01/2022 June Type 1 14
J03 CA 08/18/2022 August Type 2 111

I did the following code to generate a pivot table and it generates the max values for each hub, but there are hundreds of hubs.

`    pd.pivot_table (out,values='Customers',index=['Location','Hub'], columns=
     ['Type','Month'],aggfunc='max')    `

Results mostly look like:

Type Type 1 Type 2
Month January February March January
Location Hub NA
NY J01 0 250 250 NA
J04 222 222 222 NA
CA J03 NA NA NA 111
CA J05 14 14 0 NA

I would like the results to look like:

Type Type 1 Type 2
Month January February March January
Location
NY 222 472 472 0
CA 14 14 0 111

Is there an easier way to achieve this?

CodePudding user response:

You're on the right start! pivot_table is the right way to group the table with columns by type. You also identified that you can perform the max aggregation at pivot-time:

df = pd.pivot_table(
    out,
    values='Customers',
    index=['Location','Hub'],
    columns=['Type','Month'],
    aggfunc='max'
)

Next, we need to groupby Location and add (sum) the values:

df = df.groupby("Location").max()

The final result (trying to build a table that matches your data):

out = pd.read_csv(io.StringIO("""
Hub Location    DateTime    Month   Type    Customers
J01 NY  01/01/2022  January Type 1  250
J03 CA  01/21/2022  January Type 2  111
J01 NY  04/01/2022  April   Type 1  250
J04 NY  01/01/2022  January Type 1  222
J04 NY  02/01/2022  February    Type 1  222
J04 NY  04/01/2022  April   Type 1  222
J05 CA  06/01/2022  June    Type 1  14
J03 CA  08/18/2022  August  Type 2  111
"""), sep='\t')

pd.pivot_table(
    out,
    values='Customers',
    index=['Location','Hub'],
    columns=['Type','Month'],
    aggfunc='max'
).groupby("Location").sum()

gives:

    Type    Type 1  Type 2
   Month    April   February    January June    August  January
Location                        
      CA    0.0     0.0         0.0     14.0    111.0   111.0
      NY    472.0   222.0       472.0   0.0     0.0     0.0
  • Related