I am attempting to aggregate a dataframe in Pandas to provide useful summary statistics such as total spend, total number of customers, or most frequently used item in a given category. Naturally, some areas of these aggregated categories do not have any customers or any dollars spent. My original dataframe appears as the following:
(Ex.1):
Cost ID Item Category Date Expiration Brand Type Product Type
1 100.00 Kh98hkM Item Cat 1 02/05/2017 30 Days Store Brand General
2 200.00 A23kzs0 Item Cat 2 01/23/2014 60 Days Name Brand Specialty
3 130.00 JQ2309d Item Cat 3 11/19/2019 90 Days Store Brand Specialty
4 0.00 JD31132 Item Cat 1 12/20/2020 15 Days Store Brand General
..........................................................................................
..........................................................................................
..........................................................................................
100 512.00 QDlkew123 Item Cat 3 03/12/2021 60 Days Name Brand Specialty
My objective is to aggregate the columns based upon item category, Expiration, Brand Type, and Product type. For product expiration to avoid dozens of categorizations, I have created a new column called shelf life. Here, products which have an expiration of 30 days or less have a monthly shelf life. Products which have an expiration of more than 30 days have a seasonal shelf life. So df['Shelf Life']
would look as follows:
Expiration Shelf Life
30 Days Monthly
60 Days Seasonal
90 Days Seasonal
15 Days Monthly
The end result should aggregate as follows: (Ex. 2)
Item Category Shelf Life Product Type Brand Type Purchases Cost Sales
Item Cat 1 Monthly General Store Brand 571 15243.00 18546.00
Name Brand 1565 43535.00 56400.00
Specialty Store Brand 0 0.00 0.00
Name Brand 0 0.00 0.00
Seasonal General Store Brand 245 41015.00 65416.00
Name Brand 341 6168.54 11406.56
Specialty Store Brand 456 11688.53 6557.56
Name Brand 1995 1234.12 19865.56
Item Cat 2 Monthly General Store Brand 6231 1238191.00 2152345.00
Name Brand 6123 674122.00 701000.00
Specialty Store Brand 0 0.00 0.00
Name Brand 1 50.00 20.00
Seasonal General Store Brand 0 0.00 0.00
Name Brand 456 4165.00 5465.00
Specialty Store Brand 0 0.00 0.00
Name Brand 156 49843.00 71650.00
Item Cat 3 Monthly General Store Brand 5 1565.00 756.47
Name Brand 71 54045.00 31856.45
Specialty Store Brand 793 198301.98 179833.56
Name Brand 565 798846.46 1268798.55
Seasonal General Store Brand 1985 32656.46 24989.48
Name Brand 0 0.00 0.00
Specialty Store Brand 0 0.00 0.00
Name Brand 0 0.00 0.00
However, the end result looks like this: (Ex. 3)
Item Category Shelf Life Product Type Brand Type Purchases Cost Sales
Item Cat 1 Monthly General Store Brand 571 15243.00 18546.00
Name Brand 1565 43535.00 56400.00
Seasonal General Store Brand 245 41015.00 65416.00
Name Brand 341 6168.54 11406.56
Specialty Store Brand 456 11688.53 6557.56
Name Brand 1995 1234.12 19865.56
Item Cat 2 Monthly General Store Brand 6231 1238191.00 2152345.00
Name Brand 6123 674122.00 701000.00
Specialty Name Brand 1 50.00 20.00
Seasonal General Name Brand 456 4165.00 5465.00
Specialty Name Brand 156 49843.00 71650.00
Item Cat 3 Monthly General Store Brand 5 1565.00 756.47
Name Brand 71 54045.00 31856.45
Specialty Store Brand 793 198301.98 179833.56
Name Brand 565 798846.46 1268798.55
Seasonal General Store Brand 1985 32656.46 24989.48
To rectify this, I have attempted to prevent the dropping of NA values and replace them with 0.
groupby_list = ['Item Category', 'Shelf Life', 'Product Type', 'Brand Type']
grouped_df = (by = groupby_list, observed = False, dropna=False)[['Purchases', 'Cost','Sales']].agg('sum').fillna(0)
However, the end result was still the same as it was in Example 3. I have read the documentation was unable to find out how to prevent this automatic reduction of NA rows. Is there a way to prevent aggregated NA rows from autodeleting?
CodePudding user response:
You need to reindex your dataframe after the groupby result is obtained. That’s easily done by using the index levels and pd.MultiIndex.from_product
:
>>> grouped_df
0
item 1 seasonal a
monthly b
item 2 monthly c
>>> grouped_df.reindex(pd.MultiIndex.from_product(grouped_df.index.levels), fill_value=0)
0
item 1 monthly b
seasonal a
item 2 monthly c
seasonal 0