Home > Back-end >  How does one avoid autodropping <NA> rows when aggregating a dataframe in Pandas?
How does one avoid autodropping <NA> rows when aggregating a dataframe in Pandas?

Time:10-02

I am attempting to aggregate a dataframe in Pandas to provide useful summary statistics such as total spend, total number of customers, or most frequently used item in a given category. Naturally, some areas of these aggregated categories do not have any customers or any dollars spent. My original dataframe appears as the following:

(Ex.1):

        Cost      ID      Item Category      Date         Expiration   Brand Type  Product Type
    1   100.00   Kh98hkM    Item Cat 1     02/05/2017       30 Days     Store Brand    General 
    2   200.00   A23kzs0    Item Cat 2     01/23/2014       60 Days     Name Brand     Specialty 
    3   130.00   JQ2309d    Item Cat 3     11/19/2019       90 Days     Store Brand    Specialty
    4    0.00    JD31132    Item Cat 1     12/20/2020       15 Days     Store Brand    General 
    ..........................................................................................
    ..........................................................................................
    ..........................................................................................
   100  512.00 QDlkew123    Item Cat 3      03/12/2021      60 Days     Name Brand     Specialty

My objective is to aggregate the columns based upon item category, Expiration, Brand Type, and Product type. For product expiration to avoid dozens of categorizations, I have created a new column called shelf life. Here, products which have an expiration of 30 days or less have a monthly shelf life. Products which have an expiration of more than 30 days have a seasonal shelf life. So df['Shelf Life'] would look as follows:

Expiration   Shelf Life
 30 Days      Monthly
 60 Days      Seasonal
 90 Days      Seasonal
 15 Days      Monthly

The end result should aggregate as follows: (Ex. 2)

Item Category   Shelf Life  Product Type    Brand Type     Purchases   Cost       Sales
  Item Cat 1      Monthly     General       Store Brand      571     15243.00    18546.00
                                            Name Brand      1565     43535.00    56400.00
                              Specialty     Store Brand        0         0.00        0.00
                                            Name Brand         0         0.00        0.00
                  Seasonal    General       Store Brand      245     41015.00    65416.00
                                            Name Brand       341      6168.54    11406.56
                              Specialty     Store Brand      456     11688.53     6557.56
                                            Name Brand      1995      1234.12    19865.56
  Item Cat 2      Monthly     General       Store Brand     6231   1238191.00  2152345.00
                                            Name Brand      6123    674122.00   701000.00
                              Specialty     Store Brand        0         0.00        0.00 
                                            Name Brand         1        50.00       20.00  
                  Seasonal    General       Store Brand        0         0.00        0.00
                                            Name Brand       456      4165.00     5465.00
                              Specialty     Store Brand        0         0.00        0.00           
                                            Name Brand       156     49843.00    71650.00
  Item Cat 3      Monthly     General       Store Brand        5      1565.00      756.47
                                            Name Brand        71     54045.00    31856.45 
                              Specialty     Store Brand      793    198301.98   179833.56
                                            Name Brand       565    798846.46  1268798.55
                  Seasonal    General       Store Brand     1985     32656.46    24989.48
                                            Name Brand       0           0.00        0.00 
                              Specialty     Store Brand      0           0.00        0.00 
                                            Name Brand       0           0.00        0.00 

However, the end result looks like this: (Ex. 3)

Item Category   Shelf Life  Product Type    Brand Type     Purchases   Cost       Sales
  Item Cat 1      Monthly     General       Store Brand      571     15243.00    18546.00
                                            Name Brand      1565     43535.00    56400.00
                  Seasonal    General       Store Brand      245     41015.00    65416.00
                                            Name Brand       341      6168.54    11406.56
                              Specialty     Store Brand      456     11688.53     6557.56
                                            Name Brand      1995      1234.12    19865.56
  Item Cat 2      Monthly     General       Store Brand     6231   1238191.00  2152345.00
                                            Name Brand      6123    674122.00   701000.00
                              Specialty     Name Brand         1        50.00       20.00  
                  Seasonal    General       Name Brand       456      4165.00     5465.00
                              Specialty     Name Brand       156     49843.00    71650.00
  Item Cat 3      Monthly     General       Store Brand        5      1565.00      756.47
                                            Name Brand        71     54045.00    31856.45 
                              Specialty     Store Brand      793    198301.98   179833.56
                                            Name Brand       565    798846.46  1268798.55
                  Seasonal    General       Store Brand     1985     32656.46    24989.48

To rectify this, I have attempted to prevent the dropping of NA values and replace them with 0.

groupby_list = ['Item Category', 'Shelf Life', 'Product Type', 'Brand Type'] 
grouped_df = (by = groupby_list, observed = False, dropna=False)[['Purchases', 'Cost','Sales']].agg('sum').fillna(0)

However, the end result was still the same as it was in Example 3. I have read the documentation was unable to find out how to prevent this automatic reduction of NA rows. Is there a way to prevent aggregated NA rows from autodeleting?

CodePudding user response:

You need to reindex your dataframe after the groupby result is obtained. That’s easily done by using the index levels and pd.MultiIndex.from_product:

>>> grouped_df
                 0
item 1 seasonal  a
       monthly   b
item 2 monthly   c
>>> grouped_df.reindex(pd.MultiIndex.from_product(grouped_df.index.levels), fill_value=0)
                 0
item 1 monthly   b
       seasonal  a
item 2 monthly   c
       seasonal  0
  • Related