Trying to organize a large dataset and then determine the mean days and standard deviation using Spy-CodePudding

*Edit: Posted code as dataframe not sheets link.

I have a large data set consisting of ~6.5M rows and 6 columns. The rows are BrandId's (e.g., 01-00058) associated with unique items and the 3 columns I need utilized are: BrandId, InventoryDate, and OnHand.

          BrandID  SalesPrice InventoryDate   Size  OnHand  PurchasePrice
0        01-00058        9.28    2018-06-30  750mL       6           6.77
1        01-00058        9.28    2018-07-01  750mL       6           6.77
2        01-00058        9.28    2018-07-02  750mL       6           6.77
3        01-00058        9.28    2018-07-03  750mL     102           6.77
4        01-00058        9.28    2018-07-04  750mL      96           6.77
          ...         ...           ...    ...     ...            ...
6531265  02-90631       12.74    2019-06-26  400mL      60           8.49
6531266  02-90631       12.74    2019-06-27  400mL      60           8.49
6531267  02-90631       12.74    2019-06-28  400mL      60           8.49
6531268  02-90631       12.74    2019-06-29  400mL      60           8.49
6531269  02-90631       12.74    2019-06-30  400mL      60           8.49

[6531270 rows x 6 columns]

I would like to determine how many days each particular BrandId has no inventory on hand. For example, BrandId 01-00058 has 27 unique days where OnHand = 0. I would like summarize that information for all unique BrandId's.

I would then like to find the mean and standard deviation of these unique BrandId's from the days each is stocked out.

Ideally, I would love this information to be viewed in the variable explorer as a table that reads:

BrandID     Sum OnHand = 0
01-00058    27
01-00061    39
01-00062    14
``'

CodePudding user response：

IIUC, try with groupby:

>>> df[df["OnHand"].eq(0)].groupby("BrandID")["InventoryDate"].nunique()