I have a somewhat large dataframe of customers assigned to a hub and each hub is in a specific location. The hubs get flagged whenever there's an issue and I'd like to know the number of customers affected each time this happens.
So I'd like to find the max number of customers assigned to each hub (this would then exclude the times the hub may have been flagged multiple times) and then group the rows by location and the columns by type, then show the sum of the max count of customers over a period of months.
The data looks like:
Hub | Location | DateTime | Month | Type | Customers |
---|---|---|---|---|---|
J01 | NY | 01/01/2022 | January | Type 1 | 250 |
J03 | CA | 01/21/2022 | January | Type 2 | 111 |
J01 | NY | 04/01/2022 | April | Type 1 | 250 |
J05 | CA | 06/01/2022 | June | Type 1 | 14 |
J03 | CA | 08/18/2022 | August | Type 2 | 111 |
I did the following code to generate a pivot table and it generates the max values for each hub, but there are hundreds of hubs.
` pd.pivot_table (out,values='Customers',index=['Location','Hub'], columns=
['Type','Month'],aggfunc='max') `
Results mostly look like:
Type | Type 1 | Type 2 | |||
---|---|---|---|---|---|
Month | January | February | March | January | |
Location | Hub | NA | |||
NY | J01 | 0 | 250 | 250 | NA |
J04 | 222 | 222 | 222 | NA | |
CA | J03 | NA | NA | NA | 111 |
CA | J05 | 14 | 14 | 0 | NA |
I would like the results to look like:
Type | Type 1 | Type 2 | |||
---|---|---|---|---|---|
Month | January | February | March | January | |
Location | |||||
NY | 222 | 472 | 472 | 0 | |
CA | 14 | 14 | 0 | 111 |
Is there an easier way to achieve this?
CodePudding user response:
You're on the right start! pivot_table
is the right way to group the table with columns by type. You also identified that you can perform the max
aggregation at pivot-time:
df = pd.pivot_table(
out,
values='Customers',
index=['Location','Hub'],
columns=['Type','Month'],
aggfunc='max'
)
Next, we need to groupby Location
and add (sum) the values:
df = df.groupby("Location").max()
The final result (trying to build a table that matches your data):
out = pd.read_csv(io.StringIO("""
Hub Location DateTime Month Type Customers
J01 NY 01/01/2022 January Type 1 250
J03 CA 01/21/2022 January Type 2 111
J01 NY 04/01/2022 April Type 1 250
J04 NY 01/01/2022 January Type 1 222
J04 NY 02/01/2022 February Type 1 222
J04 NY 04/01/2022 April Type 1 222
J05 CA 06/01/2022 June Type 1 14
J03 CA 08/18/2022 August Type 2 111
"""), sep='\t')
pd.pivot_table(
out,
values='Customers',
index=['Location','Hub'],
columns=['Type','Month'],
aggfunc='max'
).groupby("Location").sum()
gives:
Type Type 1 Type 2
Month April February January June August January
Location
CA 0.0 0.0 0.0 14.0 111.0 111.0
NY 472.0 222.0 472.0 0.0 0.0 0.0